From kenwhistler at att.net Fri May 1 09:17:31 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 01 May 2015 07:17:31 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> Message-ID: <55438AFB.6020000@att.net> Koji, Personally, I don't have a horse in this race, because I am not responsible for any linebreaking implementation -- so a change for halfwidth katakana wouldn't matter one way or the other to me. Secondly, there is no formal stability guarantee constraining Line_Break property values (other than the generic guarantee that the property itself or existing aliases cannot be *removed* from the standard). Nor is there any stability guarantee regarding the rest of the algorithm definition in UAX #14. So in principle, the UTC could rewrite it completely. But I doubt that that would be in anybody's interest at this point. ;-) But as I see it, the way this should work is for the major stakeholders who *do* have implemented linebreaking algorithms depending on UAX #14 working in released products (and that would include people speaking for various browsers and for Apple products in general, I think) should be the ones either pushing for a change, because it would make their behavior more correct and acceptable for Japanese, or pushing back *against* a change, because they depend on UAX #14 stability and would prefer tweaking the behavior in their implementations, instead. So I'd like to see a formal proposal for a change (specified *exactly* as to the set of characters affected) brought to the UTC, where implementers and users of ICU could make the case for or against. The other thing that I think would need to happen here is that any proposal should also provide suggested wording for UAX #14 which would explain why halfwidth katakana specifically need to break with the general principles that were used 15 years ago to assign LB classes based on East_Asian_Width considerations, and instead need to match the LB classes of their fullwidth katakana counterparts. That should be made explicit in the text of UAX #14, so somebody else doesn't "discover" another inconsistency between sets of values and try to change things back later on -- not knowing the rationale for the values. Because a well-formed proposal for a change like this involves both a justification for a property value change *and* a corresponding fix to annex text, I think this is too late in the cycle to be taken as just beta feedback for the Version 8.0 release, unfortunately. Because of the potential hit on existing implementations (and test cases), this needs full review, and should instead be pushed as an early proposal for the Version 9.0 release cycle. --Ken On 5/1/2015 5:33 AM, Koji Ishii wrote: > I support Makoto for the change. Nobody should appreciate that behavior, either worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than implementing yet another work around in Chrome, I wish it being fixed finally after 15 years. > > If this issue is like 5 people say break and 5 not to, or considering the long life of the bug, 9 say break and 1 say not to, I understand that Ken?s answer might make more sense. However, I?m quite sure that this is a 10-0 issue. Everyone using UAX#14 has to choose from trailer, unnoticed, or won?t fix. I think that kind of things should better be fixed. > > Half-width CJK should follow the same line breaking class as their wide counterparts. From that point of view, half-width Hangul being AL is actually correct. (Note that this is not the same as full-width oftentimes having the different classes than their narrow counterparts.) > > Half-width punctuations already have correct classes, so they?re fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code points in any CJK legacy encoding. Where had they come from? Logical thinking is to assign the same classes as their wide counterparts, but I can?t be sure without knowing where they came from. > > Ken, does this change cause problems in terms of the stability policy? > > /koji > > > From kojiishi at gmail.com Fri May 1 07:33:49 2015 From: kojiishi at gmail.com (Koji Ishii) Date: Fri, 1 May 2015 21:33:49 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55403267.9060202@att.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> Message-ID: I support Makoto for the change. Nobody should appreciate that behavior, either worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than implementing yet another work around in Chrome, I wish it being fixed finally after 15 years. If this issue is like 5 people say break and 5 not to, or considering the long life of the bug, 9 say break and 1 say not to, I understand that Ken?s answer might make more sense. However, I?m quite sure that this is a 10-0 issue. Everyone using UAX#14 has to choose from trailer, unnoticed, or won?t fix. I think that kind of things should better be fixed. Half-width CJK should follow the same line breaking class as their wide counterparts. From that point of view, half-width Hangul being AL is actually correct. (Note that this is not the same as full-width oftentimes having the different classes than their narrow counterparts.) Half-width punctuations already have correct classes, so they?re fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code points in any CJK legacy encoding. Where had they come from? Logical thinking is to assign the same classes as their wide counterparts, but I can?t be sure without knowing where they came from. Ken, does this change cause problems in terms of the stability policy? /koji > On Apr 29, 2015, at 10:22, Ken Whistler wrote: > > Taking this thread back to the original question... > > The Line_Break property values for halfwidth katakana (lb=AL) > and regular katakana (lb=ID) have been stable since they > were first defined for Unicode 3.0 -- 15 years ago. > > Regardless of whether lb=AL is the optimal assignment for > the halfwidth katakana, it seems likely to me that trying to > *change* that Line_Break assignment, just for halfwidth > katakana, at this late date, would likely be more destabilizing > for existing implementations, rather than helpful. > > The citations below show *different* behavior between browsers > for linebreaking around halfwidth katakana. That suggests that > Firefox and IE11 have already provided tailoring to better match > expectations. The correct avenue forward, it seems to me, would > be to pursue bugs against browsers that do not show expected > behavior, to see if improvements there are feasible, rather than > to modify the base Line_Break property values that everybody has > to tailor *from*. > > Note that this is not *just* a Japanese problem nor a matter > of not matching JIS X 4051. UAX #14 is *not* a direct implementation > of JIS X 4051 rules, although it is certainly informed by them and > has many Line_Break values introduced to get default behavior closer to > the Japanese rules for linebreaking. And the compatibility halfwidth > characters in the standard also include halfwidth jamo and symbols, > so any changes also would need to be considered in the context > of consistency for those and for *Korean* rules, as well as for Japanese. > > --Ken > > On 4/27/2015 10:57 PM, Makoto Kato wrote: >> Hi, Suzuki-san. Thank you for reply. >> >>> At present, I have no objection to add halfwidth katakana >>> to ideographic-class in UAX#14, but I'm unfamiliar with the >>> (negative) impact caused by the lack of halfwidth katakana >>> in it. Could you tell me if you know anything? >> Since half-width katakana isn't ID, it isn't break line like >> full-wdith katakana. >> >> >> Firefox and IE11 define half-width katakana as ID. The line break of >> half-width katakana is same as full-width katakana. >> Chrome doesn't define it as ID. Half-width katakana isn't line break >> per character. >> >> Although I read JIS X 4051, it doesn't define that half-width katakana >> and full-width katakana are differently. >> >> >>> I guess, the inclusion or exclusion in other classes, like, >>> AI, AL, CJ, JL, JV, JT, SA might be quite important to realize >>> the appropriate line breaking, but the inclusion or exclusion >>> in ID-class does not seem to be important. If the inclusion >>> in ID-class is important, more characters (e.g. Bopomofo) >>> should be considered for full coverage. How do you think of? >> My discussion is why half-width katanaka character isn't same class of >> full-width katakana character. In this case, half-width katakana >> originally defines as AL at current spec. So when moving to ID, break >> rule is strongly changed. (non-break -> break before or after). >> >> >> -- Makoto >> >> > From asmus-inc at ix.netcom.com Fri May 1 09:47:38 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 01 May 2015 07:47:38 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55438AFB.6020000@att.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> Message-ID: <5543920A.7060906@ix.netcom.com> On 5/1/2015 7:17 AM, Ken Whistler wrote: > > Koji, > > Personally, I don't have a horse in this race, because I am not > responsible for > any linebreaking implementation -- so a change for halfwidth katakana > wouldn't > matter one way or the other to me. > > Secondly, there is no formal stability guarantee constraining > Line_Break property > values (other than the generic guarantee that the property itself or > existing aliases cannot be *removed* from the standard). Nor is there > any stability guarantee regarding the rest of the algorithm definition > in UAX #14. > So in principle, the UTC could rewrite it completely. But I doubt that > that would > be in anybody's interest at this point. ;-) > > But as I see it, the way this should work is for the major > stakeholders who *do* > have implemented linebreaking algorithms depending on UAX #14 working > in released products (and that would include people speaking for various > browsers and for Apple products in general, I think) should be the ones > either pushing for a change, because it would make their behavior more > correct > and acceptable for Japanese, or pushing back *against* a change, > because they > depend on UAX #14 stability and would prefer tweaking the behavior in > their > implementations, instead. So I'd like to see a formal proposal for a > change > (specified *exactly* as to the set of characters affected) brought to > the UTC, > where implementers and users of ICU could make the case for or against. I would go further and suggest that UTC make no change until it has positively heard from a representative sample of users/implementers. This kind of seemingly innocuous change does affect implementations but implementers are usually not expecting to have the ground shift under them after a decade or more of stable property assignments. Silence on their part may just as likely be the result of failing to appreciate the possibility of adverse outcome than of actual acquiescence. To the degree that the CSS working group relies on UAX#14 as default in some/any situations, it would be imperative to hear from them as well, before taking any action. In principle, this should be the stated procedure by the UTC when making any change in long-standing property assignments -- particularly for widely deployed scripts. That said, with proper buy-in from stakeholders, I see no objection to making a change. A./ > > The other thing that I think would need to happen here is that any > proposal > should also provide suggested wording for UAX #14 which would explain > why halfwidth katakana specifically need to break with the general > principles > that were used 15 years ago to assign LB classes based on > East_Asian_Width > considerations, and instead need to match the LB classes of their > fullwidth katakana counterparts. That should be made explicit in the text > of UAX #14, so somebody else doesn't "discover" another inconsistency > between sets of values and try to change things back later on -- not > knowing > the rationale for the values. > > Because a well-formed proposal for a change like this involves both > a justification for a property value change *and* a corresponding fix > to annex text, I think this is too late in the cycle to be taken as just > beta feedback for the Version 8.0 release, unfortunately. Because of > the potential hit on existing implementations (and test cases), this > needs > full review, and should instead be pushed as an early proposal for > the Version 9.0 release cycle. > > --Ken > > On 5/1/2015 5:33 AM, Koji Ishii wrote: >> I support Makoto for the change. Nobody should appreciate that >> behavior, either worked around locally (Firefox, IE) or unnoticed >> (Chrome). Rather than implementing yet another work around in Chrome, >> I wish it being fixed finally after 15 years. >> >> If this issue is like 5 people say break and 5 not to, or considering >> the long life of the bug, 9 say break and 1 say not to, I understand >> that Ken?s answer might make more sense. However, I?m quite sure that >> this is a 10-0 issue. Everyone using UAX#14 has to choose from >> trailer, unnoticed, or won?t fix. I think that kind of things should >> better be fixed. >> >> Half-width CJK should follow the same line breaking class as their >> wide counterparts. From that point of view, half-width Hangul being >> AL is actually correct. (Note that this is not the same as full-width >> oftentimes having the different classes than their narrow counterparts.) >> >> Half-width punctuations already have correct classes, so they?re >> fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but >> I do not find these code points in any CJK legacy encoding. Where had >> they come from? Logical thinking is to assign the same classes as >> their wide counterparts, but I can?t be sure without knowing where >> they came from. >> >> Ken, does this change cause problems in terms of the stability policy? >> >> /koji >> >> >> > > From mpsuzuki at hiroshima-u.ac.jp Fri May 1 10:25:24 2015 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Sat, 02 May 2015 00:25:24 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55438AFB.6020000@att.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> Message-ID: <55439AE4.4020109@hiroshima-u.ac.jp> Dear Ken, Ken Whistler wrote: > The other thing that I think would need to happen here is that any proposal > should also provide suggested wording for UAX #14 which would explain > why halfwidth katakana specifically need to break with the general > principles > that were used 15 years ago to assign LB classes based on East_Asian_Width > considerations, and instead need to match the LB classes of their > fullwidth katakana counterparts. That should be made explicit in the text > of UAX #14, so somebody else doesn't "discover" another inconsistency > between sets of values and try to change things back later on -- not > knowing > the rationale for the values. Excuse me, there is any discussion record how UAX#14 class for halfwidth-katakana in 15 years ago? If there is such, I want to see a sample text (of halfwidth-katakana) and expected layout result for it. You commented that the UAX#14 class should not be changed but the tailoring of the line breaking behaviour would solve the problem (as Firefox and IE11 did). However, some developers may wonder "there might be a reason why UTC put halfwidth-katakana to AL - without understanding it, we could not determine whether the proposed tailoring should be enabled always, or enabled only for a specific environment (e.g. locale, surrounding text)". If UTC can supply the "expected layout result for halfwidth- katakana (used to define the class in current UAX#14)", it would be helpful for the developers to evaluate the proposed tailoring algorithm. Regards, mpsuzuki From kenwhistler at att.net Fri May 1 11:48:11 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 01 May 2015 09:48:11 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55439AE4.4020109@hiroshima-u.ac.jp> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> Message-ID: <5543AE4B.5020904@att.net> Suzuki-san, On 5/1/2015 8:25 AM, suzuki toshiya wrote: > > Excuse me, there is any discussion record how UAX#14 class for > halfwidth-katakana in 15 years ago? If there is such, I want to > see a sample text (of halfwidth-katakana) and expected layout > result for it. The *founding* document for the UTC discussion of the initial Line_Break property values 15 years ago was: http://www.unicode.org/L2/L1999/99179.pdf and the corresponding table draft (before approval and conversion into the final format that was published with UTR #14 -- later /UAX/ #14) was: http://www.unicode.org/L2/L1999/99180.pdf There is nothing different or surprising in terms of values there. The halfwidth katakana were lb=AL and the fullwidth katakana were lb=ID in that earliest draft, as of 1999. What is new information, perhaps, is the explicit correlation that can be found in those documents with the East_Asian_Width properties, and the explanation in L2/99-179 that the EAW property values were explicitly used to make distinctions for the initial LB values. There is no sample text or expected layout results from that time period, because that was not the basis for the original UTC decisions on any of this. Initial LB values were generated based on existing General_Category and EAW values, using general principles. They were not generated by examining and specifying in detail the line breaking behavior for every single script in the standard, and then working back from those detailed specifications to attempt to create a universal specification that would replicate all of that detailed behavior. Such an approach would have been nearly impossible, given the state of all the data, and might have taken a decade to complete. That said, Japanese line breaking was no doubt considered as part of the overall background, because the initial design for UTR #14 was informed by experience in implementation of line breaking algorithms at Microsoft in the 90's. > > You commented that the UAX#14 class should not be changed but > the tailoring of the line breaking behaviour would solve > the problem (as Firefox and IE11 did). However, some developers > may wonder "there might be a reason why UTC put halfwidth-katakana > to AL - without understanding it, we could not determine whether > the proposed tailoring should be enabled always, or enabled > only for a specific environment (e.g. locale, surrounding text)". See above, in L2/99-179. *That* was the justification. It had nothing to do with specific environment, locale, or surrounding text. > > If UTC can supply the "expected layout result for halfwidth- > katakana (used to define the class in current UAX#14)", it > would be helpful for the developers to evaluate the proposed > tailoring algorithm. UAX #14 was never intended to be a detailed, script-by-script specification of line layout results. It is a default, generic, universal algorithm for line breaking that does a decent, generic job of line breaking in generic contexts without tailoring or specific knowledge of language, locale, or typographical conventions in use. UAX #14 is not a replacement for full specification of kinsoku rules for Japanese, in particular. Nor is it intended as any kind of replacement for JIS X 4051. Please understand this: UAX #14 does *NOT* tell anyone how Japanese text *should* line break. Instead, it is Japanese typographers, users and standardizers who tell implementers of line break algorithms for Japanese what the expectations for Japanese text should be, in what contexts. It is then the job of the UTC and of the platform and application vendors to negotiate the details of which part of that expected behavior makes sense to try to cover by tweaking the default line-breaking algorithm and the Line_Break property values for Unicode characters, and which part of that expected behavior makes sense to try to cover by adjusting commonly accessible and agreed upon tailoring behavior (or public standards like CSS), and finally which part of that expected behavior should instead be addressed by value-added, proprietary implementations of high end publishing software. Regards, --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri May 1 14:12:59 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 01 May 2015 12:12:59 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <5543AE4B.5020904@att.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> Message-ID: <5543D03B.80603@ix.netcom.com> Thank you, Ken, for your dedicated archeological efforts. I would like to emphasize that, at the time, UAX#14 reflected observed behavior, in particular (but not exclusively) for MS products some of which (at the time) used an LB algorithm that effectively matched an untailored UAX#14. However, recently, the W3C has spent considerable effort to look into different layout-related algorithms and specification. If, in that context, a consensus approach is developed that would point to a better "default" behavior for untailored UAX#14-style line breaking, I would regard that as a critical mass of support to allow UTC to consider tinkering with such a long-standing set of property assignments. This would be true, especially, if it can be demonstrated that (other than matching legacy behavior) there's no context that would benefit from the existing classification. I note that this was something several posters implied. So, if implementers of the legacy behavior are amenable to achieve this by tailoring, and if the change augments the number of situations where untailored UAX#14-style line breaking can be used, that would be a win that might offset the cost of a disruptive change. We've heard arguments why the proposed change is technically superior for Japanese. We now need to find out whether there are contexts where a change would adversely affect users/implementers. Following that, we would look for endorsements of the proposal from implementers or other standards organizations such as W3C (and, if at all possible, agreement from those implementers who use the untailored algorithm now). With these three preconditions in place, I would support an effort of the UTC to revisit this question. A./ On 5/1/2015 9:48 AM, Ken Whistler wrote: > Suzuki-san, > > On 5/1/2015 8:25 AM, suzuki toshiya wrote: >> >> Excuse me, there is any discussion record how UAX#14 class for >> halfwidth-katakana in 15 years ago? If there is such, I want to >> see a sample text (of halfwidth-katakana) and expected layout >> result for it. > > The *founding* document for the UTC discussion of the initial > Line_Break property values 15 years ago was: > > http://www.unicode.org/L2/L1999/99179.pdf > > and the corresponding table draft (before approval and conversion > into the final format that was published with UTR #14 -- later > /UAX/ #14) was: > > http://www.unicode.org/L2/L1999/99180.pdf > > There is nothing different or surprising in terms of values there. The > halfwidth > katakana were lb=AL and the fullwidth katakana were lb=ID in > that earliest draft, as of 1999. > > What is new information, perhaps, is the explicit correlation that can > be found > in those documents with the East_Asian_Width properties, and the > explanation > in L2/99-179 that the EAW property values were explicitly used to > make distinctions for the initial LB values. > > There is no sample text or expected layout results from that time period, > because that was not the basis for the original UTC decisions on any > of this. > Initial LB values were generated based on existing General_Category > and EAW values, using general principles. They were not generated by > examining and specifying in detail the line breaking behavior for > every single script in the standard, and then working back from those > detailed specifications to attempt to create a universal specification > that would replicate all of that detailed behavior. Such an approach > would have been nearly impossible, given the state of all the data, > and might have taken a decade to complete. > > That said, Japanese line breaking was no doubt considered as part of > the overall background, because the initial design for UTR #14 was > informed > by experience in implementation of line breaking algorithms at Microsoft > in the 90's. > >> >> You commented that the UAX#14 class should not be changed but >> the tailoring of the line breaking behaviour would solve >> the problem (as Firefox and IE11 did). However, some developers >> may wonder "there might be a reason why UTC put halfwidth-katakana >> to AL - without understanding it, we could not determine whether >> the proposed tailoring should be enabled always, or enabled >> only for a specific environment (e.g. locale, surrounding text)". > > See above, in L2/99-179. *That* was the justification. It had nothing > to do with specific environment, locale, or surrounding text. > >> >> If UTC can supply the "expected layout result for halfwidth- >> katakana (used to define the class in current UAX#14)", it >> would be helpful for the developers to evaluate the proposed >> tailoring algorithm. > > UAX #14 was never intended to be a detailed, script-by-script > specification of line layout results. It is a default, generic, universal > algorithm for line breaking that does a decent, generic job of > line breaking in generic contexts without tailoring or specific > knowledge of language, locale, or typographical conventions in use. > > UAX #14 is not a replacement for full specification of kinsoku > rules for Japanese, in particular. Nor is it intended as any kind > of replacement for JIS X 4051. > > Please understand this: UAX #14 does *NOT* tell anyone how > Japanese text *should* line break. Instead, it is Japanese typographers, > users and standardizers who tell implementers of line break > algorithms for Japanese what the expectations for Japanese text should > be, in what contexts. It is then the job of the UTC and of the > platform and application vendors to negotiate the details of > which part of that expected behavior makes sense to try to > cover by tweaking the default line-breaking algorithm and the > Line_Break property values for Unicode characters, and which > part of that expected behavior makes sense to try to cover > by adjusting commonly accessible and agreed upon tailoring > behavior (or public standards like CSS), and finally which part of that > expected behavior should instead be addressed by value-added, proprietary > implementations of high end publishing software. > > Regards, > > --Ken >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gmail.com Sun May 3 11:47:49 2015 From: kojiishi at gmail.com (Koji Ishii) Date: Mon, 4 May 2015 01:47:49 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <5543D03B.80603@ix.netcom.com> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> Message-ID: Thank you so much Ken and Asmus for the detailed guides and histories. This helps me a lot. In terms of time frame, I don't insist on specific time frame, Unicode 9 is fine if that works well for all. I'm not sure how much history and postmortem has to be baked into the section of UAX#14, hope not much because I'm not familiar with how it was defined so other than what Ken and Asmus kindly provided in this thread. But from those information, I feel stronger than before that this was simply an unfortunate oversight. In the document Ken quoted, F and W are distinguished, but H and N are not. In '90, East Asian versions of Office and RichEdit were in my radar and all of them handled halfwidth Katakana as ID for the line breaking purposes. That's quite understandable given the amount of code points to work on, given the priority of halfwidth Katakana, and given the difference of "what line breaking should be" and UAX#14 as Ken noted, but writing it up as a document doesn't look an easy task. I agree that implementers and CSS WG should be involved, but given IE and FF have already tailored, and all MS products as well, I guess it should not be too hard. I'm in Chrome team now, and the only problem for me to fix it in Chrome is to justify why Chrome wants to tailor rather than fixing UAX#14 (and the bug priority...) Either Makoto or I can bring it up to CSS WG to get back to you. /koji On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) wrote: > Thank you, Ken, for your dedicated archeological efforts. > > I would like to emphasize that, at the time, UAX#14 reflected observed > behavior, in particular (but not exclusively) for MS products some of which > (at the time) used an LB algorithm that effectively matched an untailored > UAX#14. > > However, recently, the W3C has spent considerable effort to look into > different layout-related algorithms and specification. If, in that context, > a consensus approach is developed that would point to a better "default" > behavior for untailored UAX#14-style line breaking, I would regard that as > a critical mass of support to allow UTC to consider tinkering with such a > long-standing set of property assignments. > > This would be true, especially, if it can be demonstrated that (other than > matching legacy behavior) there's no context that would benefit from the > existing classification. I note that this was something several posters > implied. > > So, if implementers of the legacy behavior are amenable to achieve this by > tailoring, and if the change augments the number of situations where > untailored UAX#14-style line breaking can be used, that would be a win that > might offset the cost of a disruptive change. > > We've heard arguments why the proposed change is technically superior for > Japanese. We now need to find out whether there are contexts where a change > would adversely affect users/implementers. Following that, we would look > for endorsements of the proposal from implementers or other standards > organizations such as W3C (and, if at all possible, agreement from those > implementers who use the untailored algorithm now). With these three > preconditions in place, I would support an effort of the UTC to revisit > this question. > > A./ > > > On 5/1/2015 9:48 AM, Ken Whistler wrote: > > Suzuki-san, > > On 5/1/2015 8:25 AM, suzuki toshiya wrote: > > > Excuse me, there is any discussion record how UAX#14 class for > halfwidth-katakana in 15 years ago? If there is such, I want to > see a sample text (of halfwidth-katakana) and expected layout > result for it. > > > The *founding* document for the UTC discussion of the initial > Line_Break property values 15 years ago was: > > http://www.unicode.org/L2/L1999/99179.pdf > > and the corresponding table draft (before approval and conversion > into the final format that was published with UTR #14 -- later > *UAX* #14) was: > > http://www.unicode.org/L2/L1999/99180.pdf > > There is nothing different or surprising in terms of values there. The > halfwidth > katakana were lb=AL and the fullwidth katakana were lb=ID in > that earliest draft, as of 1999. > > What is new information, perhaps, is the explicit correlation that can be > found > in those documents with the East_Asian_Width properties, and the > explanation > in L2/99-179 that the EAW property values were explicitly used to > make distinctions for the initial LB values. > > There is no sample text or expected layout results from that time period, > because that was not the basis for the original UTC decisions on any of > this. > Initial LB values were generated based on existing General_Category > and EAW values, using general principles. They were not generated by > examining and specifying in detail the line breaking behavior for > every single script in the standard, and then working back from those > detailed specifications to attempt to create a universal specification > that would replicate all of that detailed behavior. Such an approach > would have been nearly impossible, given the state of all the data, > and might have taken a decade to complete. > > That said, Japanese line breaking was no doubt considered as part of > the overall background, because the initial design for UTR #14 was informed > by experience in implementation of line breaking algorithms at Microsoft > in the 90's. > > > You commented that the UAX#14 class should not be changed but > the tailoring of the line breaking behaviour would solve > the problem (as Firefox and IE11 did). However, some developers > may wonder "there might be a reason why UTC put halfwidth-katakana > to AL - without understanding it, we could not determine whether > the proposed tailoring should be enabled always, or enabled > only for a specific environment (e.g. locale, surrounding text)". > > > See above, in L2/99-179. *That* was the justification. It had nothing > to do with specific environment, locale, or surrounding text. > > > If UTC can supply the "expected layout result for halfwidth- > katakana (used to define the class in current UAX#14)", it > would be helpful for the developers to evaluate the proposed > tailoring algorithm. > > > UAX #14 was never intended to be a detailed, script-by-script > specification of line layout results. It is a default, generic, universal > algorithm for line breaking that does a decent, generic job of > line breaking in generic contexts without tailoring or specific > knowledge of language, locale, or typographical conventions in use. > > UAX #14 is not a replacement for full specification of kinsoku > rules for Japanese, in particular. Nor is it intended as any kind > of replacement for JIS X 4051. > > Please understand this: UAX #14 does *NOT* tell anyone how > Japanese text *should* line break. Instead, it is Japanese typographers, > users and standardizers who tell implementers of line break > algorithms for Japanese what the expectations for Japanese text should > be, in what contexts. It is then the job of the UTC and of the > platform and application vendors to negotiate the details of > which part of that expected behavior makes sense to try to > cover by tweaking the default line-breaking algorithm and the > Line_Break property values for Unicode characters, and which > part of that expected behavior makes sense to try to cover > by adjusting commonly accessible and agreed upon tailoring > behavior (or public standards like CSS), and finally which part of that > expected behavior should instead be addressed by value-added, proprietary > implementations of high end publishing software. > > Regards, > > --Ken > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun May 3 14:53:19 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 03 May 2015 12:53:19 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> Message-ID: <55467CAF.4080401@ix.netcom.com> On 5/3/2015 9:47 AM, Koji Ishii wrote: > Thank you so much Ken and Asmus for the detailed guides and histories. > This helps me a lot. > > In terms of time frame, I don't insist on specific time frame, Unicode > 9 is fine if that works well for all. > > I'm not sure how much history and postmortem has to be baked into the > section of UAX#14, hope not much because I'm not familiar with how it > was defined so other than what Ken and Asmus kindly provided in this > thread. But from those information, I feel stronger than before that > this was simply an unfortunate oversight. In the document Ken quoted, > F and W are distinguished, but H and N are not. In '90, East Asian > versions of Office and RichEdit were in my radar and all of them > handled halfwidth Katakana as ID for the line breaking purposes. > That's quite understandable given the amount of code points to work > on, given the priority of halfwidth Katakana, and given the difference > of "what line breaking should be" and UAX#14 as Ken noted, but writing > it up as a document doesn't look an easy task Koji, kana are special in that they are not shared among languages. From that perspective, there's nothing wrong with having a "general purpose" algorithm support the rules of the target language (unless that would add undue complexity, which isn't a consideration here). Based on the data presented informally here in postings, I find your conclusion (oversight) quite believable. The task would therefore be to present the same data in a more organized fashion as part of a formal proposal. Should be doable. I think you'd want to focus on survey of modern practice in implementations (and if you have data on some of them going back to the '90s the better). From the historical analysis it's clear that there was a desire to create assignments that didn't introduce random inconsistencies between LB and EAW properties, but that kind of self-consistency check just makes sure that all characters of some group defined by the intersection of property subsets are treated the same (unless there's an overriding reason to differentiate within). It seems entirely plausible that this process misfired for the characters in question, more likely so, given that the earliest drafts of the tables were based on an implementation also being created by MS around the same time. That makes any difference to other MS products even more likely to be an oversight. I do want to help UTC establish a precedent of getting changes like that endorsed by a representative sample of implementers and key external standards (where applicable, in this case that would be CSS), to avoid the chance of creating undue disruption (and to increase the chance that the resulting modified algorithm is actually usable off-the-shelf, for example for "default" or "unknown language" type scenarios. Hence my insistence that you go out and drum up support. But it looks like this should be relatively easy, as there seems to be no strong case for maintaining the status quo, other than that it is the status quo. A./ > > I agree that implementers and CSS WG should be involved, but given IE > and FF have already tailored, and all MS products as well, I guess it > should not be too hard. I'm in Chrome team now, and the only problem > for me to fix it in Chrome is to justify why Chrome wants to tailor > rather than fixing UAX#14 (and the bug priority...) > > Either Makoto or I can bring it up to CSS WG to get back to you. > > /koji > > > On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) > > wrote: > > Thank you, Ken, for your dedicated archeological efforts. > > I would like to emphasize that, at the time, UAX#14 reflected > observed behavior, in particular (but not exclusively) for MS > products some of which (at the time) used an LB algorithm that > effectively matched an untailored UAX#14. > > However, recently, the W3C has spent considerable effort to look > into different layout-related algorithms and specification. If, in > that context, a consensus approach is developed that would point > to a better "default" behavior for untailored UAX#14-style line > breaking, I would regard that as a critical mass of support to > allow UTC to consider tinkering with such a long-standing set of > property assignments. > > This would be true, especially, if it can be demonstrated that > (other than matching legacy behavior) there's no context that > would benefit from the existing classification. I note that this > was something several posters implied. > > So, if implementers of the legacy behavior are amenable to achieve > this by tailoring, and if the change augments the number of > situations where untailored UAX#14-style line breaking can be > used, that would be a win that might offset the cost of a > disruptive change. > > We've heard arguments why the proposed change is technically > superior for Japanese. We now need to find out whether there are > contexts where a change would adversely affect users/implementers. > Following that, we would look for endorsements of the proposal > from implementers or other standards organizations such as W3C > (and, if at all possible, agreement from those implementers who > use the untailored algorithm now). With these three preconditions > in place, I would support an effort of the UTC to revisit this > question. > > A./ > > > On 5/1/2015 9:48 AM, Ken Whistler wrote: >> Suzuki-san, >> >> On 5/1/2015 8:25 AM, suzuki toshiya wrote: >>> >>> Excuse me, there is any discussion record how UAX#14 class for >>> halfwidth-katakana in 15 years ago? If there is such, I want to >>> see a sample text (of halfwidth-katakana) and expected layout >>> result for it. >> >> The *founding* document for the UTC discussion of the initial >> Line_Break property values 15 years ago was: >> >> http://www.unicode.org/L2/L1999/99179.pdf >> >> and the corresponding table draft (before approval and conversion >> into the final format that was published with UTR #14 -- later >> /UAX/ #14) was: >> >> http://www.unicode.org/L2/L1999/99180.pdf >> >> There is nothing different or surprising in terms of values >> there. The halfwidth >> katakana were lb=AL and the fullwidth katakana were lb=ID in >> that earliest draft, as of 1999. >> >> What is new information, perhaps, is the explicit correlation >> that can be found >> in those documents with the East_Asian_Width properties, and the >> explanation >> in L2/99-179 that the EAW property values were explicitly used to >> make distinctions for the initial LB values. >> >> There is no sample text or expected layout results from that time >> period, >> because that was not the basis for the original UTC decisions on >> any of this. >> Initial LB values were generated based on existing General_Category >> and EAW values, using general principles. They were not generated by >> examining and specifying in detail the line breaking behavior for >> every single script in the standard, and then working back from those >> detailed specifications to attempt to create a universal >> specification >> that would replicate all of that detailed behavior. Such an approach >> would have been nearly impossible, given the state of all the data, >> and might have taken a decade to complete. >> >> That said, Japanese line breaking was no doubt considered as part of >> the overall background, because the initial design for UTR #14 >> was informed >> by experience in implementation of line breaking algorithms at >> Microsoft >> in the 90's. >> >>> >>> You commented that the UAX#14 class should not be changed but >>> the tailoring of the line breaking behaviour would solve >>> the problem (as Firefox and IE11 did). However, some developers >>> may wonder "there might be a reason why UTC put halfwidth-katakana >>> to AL - without understanding it, we could not determine whether >>> the proposed tailoring should be enabled always, or enabled >>> only for a specific environment (e.g. locale, surrounding text)". >> >> See above, in L2/99-179. *That* was the justification. It had nothing >> to do with specific environment, locale, or surrounding text. >> >>> >>> If UTC can supply the "expected layout result for halfwidth- >>> katakana (used to define the class in current UAX#14)", it >>> would be helpful for the developers to evaluate the proposed >>> tailoring algorithm. >> >> UAX #14 was never intended to be a detailed, script-by-script >> specification of line layout results. It is a default, generic, >> universal >> algorithm for line breaking that does a decent, generic job of >> line breaking in generic contexts without tailoring or specific >> knowledge of language, locale, or typographical conventions in use. >> >> UAX #14 is not a replacement for full specification of kinsoku >> rules for Japanese, in particular. Nor is it intended as any kind >> of replacement for JIS X 4051. >> >> Please understand this: UAX #14 does *NOT* tell anyone how >> Japanese text *should* line break. Instead, it is Japanese >> typographers, >> users and standardizers who tell implementers of line break >> algorithms for Japanese what the expectations for Japanese text >> should >> be, in what contexts. It is then the job of the UTC and of the >> platform and application vendors to negotiate the details of >> which part of that expected behavior makes sense to try to >> cover by tweaking the default line-breaking algorithm and the >> Line_Break property values for Unicode characters, and which >> part of that expected behavior makes sense to try to cover >> by adjusting commonly accessible and agreed upon tailoring >> behavior (or public standards like CSS), and finally which part >> of that >> expected behavior should instead be addressed by value-added, >> proprietary >> implementations of high end publishing software. >> >> Regards, >> >> --Ken >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 4 08:47:33 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 4 May 2015 14:47:33 +0100 Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?= =?ISO-8859-1?B?NDY=?= In-Reply-To: <55314383.5070507@ix.netcom.com> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> Message-ID: <20150504144733.53247fcb@JRWUBU2> On Fri, 17 Apr 2015 10:31:47 -0700 "Asmus Freytag (t)" wrote: > But permit me to ask one question up front. What would be served by > making such a sweeping change at this juncture, after 25 years of > established practice? I suspect the idea is to have a way of unobtrusively supplying the Bidi_Mirrored value in a character pick-list, namely the use of the words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'. As I dimly recall the use of ']a,?b[' to denote an open interval, the proposed solution is not complete, but the complete solution is not obvious to me. I for one don't want to have to choose a non-English locale to type right-to-left text. Richard. From richard.wordingham at ntlworld.com Mon May 4 10:07:52 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 4 May 2015 16:07:52 +0100 Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?= =?ISO-8859-1?B?NDY=?= In-Reply-To: <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net> Message-ID: <20150504160752.710e1c72@JRWUBU2> On Mon, 4 May 2015 16:07:31 +0200 (CEST) marcel.schneider20 at laposte.net wrote: > The information about OPENING and CLOSING is one part of the > Formal Alias issue. The goal is to make the true names better known > and to allow people reading English, that is a huge majority, to get > at reach the full bandwith of Unicode information in real time. > Today, IMHO, the information about (and the availability of) > formal aliases seems to be out of reach for much software users who > are confronted with when searching for information about characters. > It therefore seems to be consistent to make it better available. > The same would apply to informative aliases. > > Unicode clearly states in NamesList.txt, that ?this file should not > be parsed for machine-readable information?. > By the way, all the informative aliases Unicode added for > the information of users, implementers and developers, are lost > because they seem to be nowhere else in the UCD. The UCD file you want is ucdxml/ucd.all.grouped.xml or its flat equivalent. On the Unicode site, they exists as zip files, ucdxml/ucd.all.grouped.zip and ucdxml/ucd.all.flat.zip. Richard. From asmus-inc at ix.netcom.com Mon May 4 10:32:46 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 04 May 2015 08:32:46 -0700 Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?= In-Reply-To: <20150504144733.53247fcb@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> Message-ID: <5547911E.4080507@ix.netcom.com> On 5/4/2015 6:47 AM, Richard Wordingham wrote: > On Fri, 17 Apr 2015 10:31:47 -0700 > "Asmus Freytag (t)" wrote: > >> But permit me to ask one question up front. What would be served by >> making such a sweeping change at this juncture, after 25 years of >> established practice? > I suspect the idea is to have a way of unobtrusively supplying the > Bidi_Mirrored value in a character pick-list, namely the use of the > words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'. Reading this discussion, I sometimes wonder whether people have ever heard of character properties? They were invented, because it proved impossible (infeasible?) to cram all the useful information about a character into its name. If you look in the Unicode Character Database, you'll find a lot of useful properties, among them, the information on which bracket characters are paired, and which one is the opening/closing one in the pair. > As I dimly recall the use of ']a, b[' to denote an open interval, the > proposed solution is not complete, but the complete solution is not > obvious to me. Right. There are many conventions for the use of characters. The period is another example of a character used for many purposes... No way to pack all the information into the name, and even character properties aren't covering all of them. > I for one don't want to have to choose a non-English > locale to type right-to-left text. Non-sequitur? A./ From asmus-inc at ix.netcom.com Mon May 4 10:34:26 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 04 May 2015 08:34:26 -0700 Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?= In-Reply-To: <20150504160752.710e1c72@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net> <20150504160752.710e1c72@JRWUBU2> Message-ID: <55479182.2060202@ix.netcom.com> Richard, as I wrote in my previous message, not knowing the first thing about character properties, some people immediately propose to carry all that information in the character name... A./ On 5/4/2015 8:07 AM, Richard Wordingham wrote: > On Mon, 4 May 2015 16:07:31 +0200 (CEST) > marcel.schneider20 at laposte.net wrote: > >> The information about OPENING and CLOSING is one part of the >> Formal Alias issue. The goal is to make the true names better known >> and to allow people reading English, that is a huge majority, to get >> at reach the full bandwith of Unicode information in real time. >> Today, IMHO, the information about (and the availability of) >> formal aliases seems to be out of reach for much software users who >> are confronted with when searching for information about characters. >> It therefore seems to be consistent to make it better available. >> The same would apply to informative aliases. >> >> Unicode clearly states in NamesList.txt, that ?this file should not >> be parsed for machine-readable information?. >> By the way, all the informative aliases Unicode added for >> the information of users, implementers and developers, are lost >> because they seem to be nowhere else in the UCD. > The UCD file you want is ucdxml/ucd.all.grouped.xml or its flat > equivalent. On the Unicode site, they exists as zip files, > ucdxml/ucd.all.grouped.zip and ucdxml/ucd.all.flat.zip. > > Richard. > > From richard.wordingham at ntlworld.com Mon May 4 11:42:26 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 4 May 2015 17:42:26 +0100 Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?= =?ISO-8859-1?B?NDY=?= In-Reply-To: <5547911E.4080507@ix.netcom.com> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> Message-ID: <20150504174226.64433e65@JRWUBU2> On Mon, 04 May 2015 08:32:46 -0700 "Asmus Freytag (t)" wrote: > On 5/4/2015 6:47 AM, Richard Wordingham wrote: > > I suspect the idea is to have a way of unobtrusively supplying the > > Bidi_Mirrored value in a character pick-list, namely the use of the > > words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'. > > Reading this discussion, I sometimes wonder whether people have ever > heard of character properties? I believe most ordinary computer users have not heard of them. Most people do not knowingly have the UCD to hand, or even UnicodeData.txt. > No way to pack all the information into the name, and even character > properties aren't covering all of them. Unfortunately, when choosing a character from a character picker, the most help one is likely to get is the character name. The name is actually quite useful when the glyph is not as one expects or the distinguishing features are not readily visible. Sometimes, however, the names are distinctly unhelpful. Perhaps 'DEVANAGARI DANDA' should have a correcting alias 'DANDA' (or 'INDIAN DANDA'?) to reassure people that it is also the Bengali/Tamil etc. danda. > > I for one don't want to have to choose a non-English > > locale to type right-to-left text. > Non-sequitur? No. The clear issue raised was of knowing whether a character's glyph would change with the bidi context. One solution that immediately comes to mind is to display the character in a pick list according to the user's locale. Unfortunately, that will not always work. In these days of Unicode, locales are primarily useful for determining the user interface. Richard. From eliz at gnu.org Mon May 4 11:59:26 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 04 May 2015 19:59:26 +0300 Subject: NamesList, =?iso-8859-1?Q?Code=A0Charts=2C_ISO=2FIEC=A010646?= In-Reply-To: <20150504174226.64433e65@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> Message-ID: <83r3qws2hd.fsf@gnu.org> > Date: Mon, 4 May 2015 17:42:26 +0100 > From: Richard Wordingham > > > > I for one don't want to have to choose a non-English > > > locale to type right-to-left text. > > Non-sequitur? > > No. The clear issue raised was of knowing whether a character's glyph > would change with the bidi context. One solution that immediately > comes to mind is to display the character in a pick list according to > the user's locale. User's locale has nothing to do with bidi context, so this would be simply wrong. From asmus-inc at ix.netcom.com Mon May 4 12:22:16 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 04 May 2015 10:22:16 -0700 Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?= In-Reply-To: <20150504174226.64433e65@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> Message-ID: <5547AAC8.6000806@ix.netcom.com> On 5/4/2015 9:42 AM, Richard Wordingham wrote: > On Mon, 04 May 2015 08:32:46 -0700 > "Asmus Freytag (t)" wrote: > >> On 5/4/2015 6:47 AM, Richard Wordingham wrote: >>> I suspect the idea is to have a way of unobtrusively supplying the >>> Bidi_Mirrored value in a character pick-list, namely the use of the >>> words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'. >> Reading this discussion, I sometimes wonder whether people have ever >> heard of character properties? > I believe most ordinary computer users have not heard of them. Most > people do not knowingly have the UCD to hand, or even UnicodeData.txt. But people writing character pickers really should mine these. > >> No way to pack all the information into the name, and even character >> properties aren't covering all of them. > Unfortunately, when choosing a character from a character picker, the > most help one is likely to get is the character name. The name is > actually quite useful when the glyph is not as one expects or the > distinguishing features are not readily visible. > > Sometimes, however, the names are distinctly unhelpful. Perhaps > 'DEVANAGARI DANDA' should have a correcting alias 'DANDA' (or 'INDIAN > DANDA'?) to reassure people that it is also the Bengali/Tamil etc. > danda. That's because the creator of your character picker didn't add any value. > >>> I for one don't want to have to choose a non-English >>> locale to type right-to-left text. >> Non-sequitur? > No. The clear issue raised was of knowing whether a character's glyph > would change with the bidi context. One solution that immediately > comes to mind is to display the character in a pick list according to > the user's locale. Unfortunately, that will not always work. In these > days of Unicode, locales are primarily useful for determining the user > interface. I still don't follow. If I edit text, then the mirroring happens in real time. If it doesn't come out as expected, I can change character (or use markup). A./ > > Richard. > From verdy_p at wanadoo.fr Mon May 4 12:32:37 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 4 May 2015 19:32:37 +0200 Subject: NamesList, Code Charts, ISO/IEC 10646 In-Reply-To: <20150504174226.64433e65@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> Message-ID: 2015-05-04 18:42 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > > No way to pack all the information into the name, and even character > > properties aren't covering all of them. > > Unfortunately, when choosing a character from a character picker, the > most help one is likely to get is the character name. The name is > actually quite useful when the glyph is not as one expects or the > distinguishing features are not readily visible. > Character pickers are applications and not in scope of the standard itself. It's up to the developers of these applications to provide the necessary localisations according to the expectations of their users for a particular language, script, and/or country/region or even dialectal variant. You cannot have a single normative character name (in fact not really a name, but a technical identifier) that will match all users expectations in all cultures. So the Unicode and ISO/IEC 10646 have only chosen to use and publich a single stable identifier throughout the standardization process; even if it is bad, it will be kept. These names are not designed to be even suitable for all English users (and just consider how CJK sinograms are named, they are not suitable for anyone...). There are open projects (outside Unicode and even outside CLDR itself) to provide common character names in various locales. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 4 12:39:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 4 May 2015 18:39:56 +0100 Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?= =?ISO-8859-1?B?NDY=?= In-Reply-To: <83r3qws2hd.fsf@gnu.org> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> <83r3qws2hd.fsf@gnu.org> Message-ID: <20150504183956.191ecac6@JRWUBU2> On Mon, 04 May 2015 19:59:26 +0300 Eli Zaretskii wrote: > > Date: Mon, 4 May 2015 17:42:26 +0100 > > From: Richard Wordingham > > The clear issue raised was of knowing whether a character's > > glyph would change with the bidi context. One solution that > > immediately comes to mind is to display the character in a pick > > list according to the user's locale. > User's locale has nothing to do with bidi context, so this would be > simply wrong. If paragraph embedding level is determined by an overriding profile but there is nothing explicit, should not the locale determine the directionality? If so, the locale will often work for indicating whether a character should be displayed in its left-to-right form or its right-to-left form. Are you, for example, suggesting that a code chart in Arabic should display U+0028 LEFT PARENTHESIS and U+0029 RIGHT PARENTHESIS using the same glyphs as for one in the English language? Richard. From asmus-inc at ix.netcom.com Mon May 4 12:49:12 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 04 May 2015 10:49:12 -0700 Subject: NamesList, Code Charts, ISO/IEC 10646 In-Reply-To: References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> Message-ID: <5547B118.10506@ix.netcom.com> On 5/4/2015 10:32 AM, Philippe Verdy wrote: > 2015-05-04 18:42 GMT+02:00 Richard Wordingham > >: > > > No way to pack all the information into the name, and even character > > properties aren't covering all of them. > > Unfortunately, when choosing a character from a character picker, the > most help one is likely to get is the character name. The name is > actually quite useful when the glyph is not as one expects or the > distinguishing features are not readily visible. > > > Character pickers are applications and not in scope of the standard > itself. It's up to the developers of these applications to provide the > necessary... ... additions that make their product usable, including any... > ..localisations according to the expectations of their users for a > particular language, script, and/or country/region or even dialectal > variant. > > You cannot have a single normative character name (in fact not really > a name, but a technical identifier) that will match all users > expectations in all cultures. Right. > So the Unicode and ISO/IEC 10646 have only chosen to use and publich a > single stable identifier throughout the standardization process; even > if it is bad, it will be kept. These names are not designed to be even > suitable for all English users (and just consider how CJK sinograms > are named, they are not suitable for anyone...). > > There are open projects (outside Unicode and even outside CLDR itself) > to provide common character names in various locales. I'm sure there are - there may even be work on a character picker, but do you have any links? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon May 4 13:02:59 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 04 May 2015 11:02:59 -0700 Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?= In-Reply-To: <20150504183956.191ecac6@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> <83r3qws2hd.fsf@gnu.org> <20150504183956.191ecac6@JRWUBU2> Message-ID: <5547B453.1080806@ix.netcom.com> On 5/4/2015 10:39 AM, Richard Wordingham wrote: >> User's locale has nothing to do with bidi context, so this would be >> simply wrong. > If paragraph embedding level is determined by an overriding profile but > there is nothing explicit, should not the locale determine the > directionality? If so, the locale will often work for indicating > whether a character should be displayed in its left-to-right form or > its right-to-left form. > > Are you, for example, suggesting that a code chart in Arabic should > display U+0028 LEFT PARENTHESIS and U+0029 RIGHT PARENTHESIS using the > same glyphs as for one in the English language? > The disconnect is that a "character picker", to continue your example, could opt to show the shape that would be chosen based on the direction context of the text input location (caret position). That has nothing to do with the "language" of the names or the "locale" of the user. A./ From verdy_p at wanadoo.fr Mon May 4 13:12:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 4 May 2015 20:12:38 +0200 Subject: NamesList, Code Charts, ISO/IEC 10646 In-Reply-To: <5547B118.10506@ix.netcom.com> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> <5547B118.10506@ix.netcom.com> Message-ID: 2015-05-04 19:49 GMT+02:00 Asmus Freytag (t) : > On 5/4/2015 10:32 AM, Philippe Verdy wrote: > > So the Unicode and ISO/IEC 10646 have only chosen to use and publish a > single stable identifier throughout the standardization process; even if it > is bad, it will be kept. These names are not designed to be even suitable > for all English users (and just consider how CJK sinograms are named, they > are not suitable for anyone...). > > > There are open projects (outside Unicode and even outside CLDR itself) > to provide common character names in various locales. > > > I'm sure there are - there may even be work on a character picker, but do > you have any links? > That list is wide open, some projects will start others will end. Freqently they will change the names shown in previous versions... But you may just stat by looking in Wikipedia that frequently has articles in lots of languages, and that provide external links. All editions are also listing various aliases. Even during the standardisation process, there were multiple names discussed, but for tracking discussions and allowing plain text searches to find the related discussions, before the character was finally encoded, the technical identifier coming from a formal proposal was kept. Sometimes for some characers there were competing proposals, but once one of these formal has passed an early stage of balloting, this name is stable and should not change (unless an alias was already listed in the accepted proposal and it has been found that it was more frequently used in other early discussions. A limited number of proposed names are considered, and proper localisation is definitely not a goal at this early stage: it would have been impossible to produce the standard and encode so many characters if it was needed to provide accurate names matching exactly the mosts frequent uses (or some more rare uses, or future uses that will be made once the character will be encoded). For getting lists of character pickers, we have the choice in various kind of applications: accessories for desktop OSes, word processor tools, web sites, wikis, articles in online forums and blogs, books and facsimiles (PDF, DejaVu, photos...), spreadsheets, input method editors and custom keyboard layouts for onscreen input (or input on touch devices...). The choice is unlimited and expands everyday. Even without developing applications, users are inventive and will name the characters as they want in their informal discussions, mails, chats, SMS, tweets... The Unicode namelists are just a basic set of properties, and its names are just technical identifiers part of these properties where translation (or even translatability, even in English) is definitely not a goal. Another way to say it: ? You don't like these "names" ? Great! in fact none of us really like them. Develop your own list of names, publish it, and try convincing others to use your list! ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 4 13:51:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 4 May 2015 19:51:13 +0100 Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?= =?ISO-8859-1?B?NDY=?= In-Reply-To: <5547AAC8.6000806@ix.netcom.com> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> <5547AAC8.6000806@ix.netcom.com> Message-ID: <20150504195113.6b2844bf@JRWUBU2> On Mon, 04 May 2015 10:22:16 -0700 "Asmus Freytag (t)" wrote: > On 5/4/2015 9:42 AM, Richard Wordingham wrote: > > On Mon, 04 May 2015 08:32:46 -0700 > > "Asmus Freytag (t)" wrote: > >> Reading this discussion, I sometimes wonder whether people have > >> ever heard of character properties? > > I believe most ordinary computer users have not heard of them. Most > > people do not knowingly have the UCD to hand, or even > > UnicodeData.txt. > But people writing character pickers really should mine these. I agree. However, some don't even give the character name, which lack can be really annoying with some diacritics. My work around is to look up the codepoint from UnicodeData.txt. > > One solution that > > immediately comes to mind is to display the character in a pick > > list according to the user's locale. Unfortunately, that will not > > always work. In these days of Unicode, locales are primarily > > useful for determining the user interface. > I still don't follow. If I edit text, then the mirroring happens in > real time. If it doesn't come out as expected, I can change character > (or use markup). In a perfect world, perhaps 'in real time', but not immediately. Assuming that paragraph embedding is not set to right-to-left, if I type , on typing jeem the glyph for U+0028 changes from concave on the right to concave on the left, and moves from the right to the left of beh. The idea of displaying the glyphs according to the context of the insertion point does have much merit, but is not so straightforward if the character picker is a separate application and so lacks the information. Of course, the context is not as simple as left-to-right v. right-to-left, especially for brackets. Richard. From verdy_p at wanadoo.fr Mon May 4 14:09:47 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 4 May 2015 21:09:47 +0200 Subject: NamesList, Code Charts, ISO/IEC 10646 In-Reply-To: <20150504195113.6b2844bf@JRWUBU2> References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net> <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2> <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2> <5547AAC8.6000806@ix.netcom.com> <20150504195113.6b2844bf@JRWUBU2> Message-ID: 2015-05-04 20:51 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Mon, 04 May 2015 10:22:16 -0700 > "Asmus Freytag (t)" wrote: > > But people writing character pickers really should mine these. > > I agree. However, some don't even give the character name, which > lack can be really annoying with some diacritics. My work around is to > look up the codepoint from UnicodeData.txt. > Ideally a perfect character picker displaying names should allow users to personalize these names, and possibly even save them online to a cloud with user preferences, possibly with a sharing option allowing to feed a database per locale, with votes/ratings, so that this database will progressively be able to return names that have the best agreements. Such tool should of course use a locally cache when it will query names from the shared database, and should also contain an option to update it entirely from a snapshot (just like we perform regular software updates). Such systems do exist for various applications in other domain, e.g. for rating web sites or their security/risk per domain name, or for mail blacklists. It could be used as well for a localized database of character names. The database would also be able to list known aliases (by sorting them in rating order and extracting the top 10). With that system it would be even easier to perform plain-text searches of character names using more user-friendly descriptions, capable of finding related characters, or characters suitable for some usage: the character picker would then list all these characters found by name, or part of their name. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedberg at apple.com Mon May 4 16:19:21 2015 From: pedberg at apple.com (Peter Edberg) Date: Mon, 04 May 2015 14:19:21 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55467CAF.4080401@ix.netcom.com> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> Message-ID: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> I have been checking with various groups at Apple. The consensus here is that we would like to see the linebreak value for halfwidth katakana changed to ID. - Peter E > On May 3, 2015, at 12:53 PM, Asmus Freytag (t) wrote: > > On 5/3/2015 9:47 AM, Koji Ishii wrote: >> Thank you so much Ken and Asmus for the detailed guides and histories. This helps me a lot. >> >> In terms of time frame, I don't insist on specific time frame, Unicode 9 is fine if that works well for all. >> >> I'm not sure how much history and postmortem has to be baked into the section of UAX#14, hope not much because I'm not familiar with how it was defined so other than what Ken and Asmus kindly provided in this thread. But from those information, I feel stronger than before that this was simply an unfortunate oversight. In the document Ken quoted, F and W are distinguished, but H and N are not. In '90, East Asian versions of Office and RichEdit were in my radar and all of them handled halfwidth Katakana as ID for the line breaking purposes. That's quite understandable given the amount of code points to work on, given the priority of halfwidth Katakana, and given the difference of "what line breaking should be" and UAX#14 as Ken noted, but writing it up as a document doesn't look an easy task > > Koji, > > kana are special in that they are not shared among languages. From that perspective, there's nothing wrong with having a "general purpose" algorithm support the rules of the target language (unless that would add undue complexity, which isn't a consideration here). > > Based on the data presented informally here in postings, I find your conclusion (oversight) quite believable. The task would therefore be to present the same data in a more organized fashion as part of a formal proposal. Should be doable. > > I think you'd want to focus on survey of modern practice in implementations (and if you have data on some of them going back to the '90s the better). > > From the historical analysis it's clear that there was a desire to create assignments that didn't introduce random inconsistencies between LB and EAW properties, but that kind of self-consistency check just makes sure that all characters of some group defined by the intersection of property subsets are treated the same (unless there's an overriding reason to differentiate within). It seems entirely plausible that this process misfired for the characters in question, more likely so, given that the earliest drafts of the tables were based on an implementation also being created by MS around the same time. That makes any difference to other MS products even more likely to be an oversight. > > I do want to help UTC establish a precedent of getting changes like that endorsed by a representative sample of implementers and key external standards (where applicable, in this case that would be CSS), to avoid the chance of creating undue disruption (and to increase the chance that the resulting modified algorithm is actually usable off-the-shelf, for example for "default" or "unknown language" type scenarios. > > Hence my insistence that you go out and drum up support. But it looks like this should be relatively easy, as there seems to be no strong case for maintaining the status quo, other than that it is the status quo. > > A./ > > >> >> I agree that implementers and CSS WG should be involved, but given IE and FF have already tailored, and all MS products as well, I guess it should not be too hard. I'm in Chrome team now, and the only problem for me to fix it in Chrome is to justify why Chrome wants to tailor rather than fixing UAX#14 (and the bug priority...) >> >> Either Makoto or I can bring it up to CSS WG to get back to you. >> >> /koji >> >> >> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) > wrote: >> Thank you, Ken, for your dedicated archeological efforts. >> >> I would like to emphasize that, at the time, UAX#14 reflected observed behavior, in particular (but not exclusively) for MS products some of which (at the time) used an LB algorithm that effectively matched an untailored UAX#14. >> >> However, recently, the W3C has spent considerable effort to look into different layout-related algorithms and specification. If, in that context, a consensus approach is developed that would point to a better "default" behavior for untailored UAX#14-style line breaking, I would regard that as a critical mass of support to allow UTC to consider tinkering with such a long-standing set of property assignments. >> >> This would be true, especially, if it can be demonstrated that (other than matching legacy behavior) there's no context that would benefit from the existing classification. I note that this was something several posters implied. >> >> So, if implementers of the legacy behavior are amenable to achieve this by tailoring, and if the change augments the number of situations where untailored UAX#14-style line breaking can be used, that would be a win that might offset the cost of a disruptive change. >> >> We've heard arguments why the proposed change is technically superior for Japanese. We now need to find out whether there are contexts where a change would adversely affect users/implementers. Following that, we would look for endorsements of the proposal from implementers or other standards organizations such as W3C (and, if at all possible, agreement from those implementers who use the untailored algorithm now). With these three preconditions in place, I would support an effort of the UTC to revisit this question. >> >> A./ >> >> >> On 5/1/2015 9:48 AM, Ken Whistler wrote: >>> Suzuki-san, >>> >>> On 5/1/2015 8:25 AM, suzuki toshiya wrote: >>>> >>>> Excuse me, there is any discussion record how UAX#14 class for >>>> halfwidth-katakana in 15 years ago? If there is such, I want to >>>> see a sample text (of halfwidth-katakana) and expected layout >>>> result for it. >>> >>> The *founding* document for the UTC discussion of the initial >>> Line_Break property values 15 years ago was: >>> >>> http://www.unicode.org/L2/L1999/99179.pdf >>> >>> and the corresponding table draft (before approval and conversion >>> into the final format that was published with UTR #14 -- later >>> UAX #14) was: >>> >>> http://www.unicode.org/L2/L1999/99180.pdf >>> >>> There is nothing different or surprising in terms of values there. The halfwidth >>> katakana were lb=AL and the fullwidth katakana were lb=ID in >>> that earliest draft, as of 1999. >>> >>> What is new information, perhaps, is the explicit correlation that can be found >>> in those documents with the East_Asian_Width properties, and the explanation >>> in L2/99-179 that the EAW property values were explicitly used to >>> make distinctions for the initial LB values. >>> >>> There is no sample text or expected layout results from that time period, >>> because that was not the basis for the original UTC decisions on any of this. >>> Initial LB values were generated based on existing General_Category >>> and EAW values, using general principles. They were not generated by >>> examining and specifying in detail the line breaking behavior for >>> every single script in the standard, and then working back from those >>> detailed specifications to attempt to create a universal specification >>> that would replicate all of that detailed behavior. Such an approach >>> would have been nearly impossible, given the state of all the data, >>> and might have taken a decade to complete. >>> >>> That said, Japanese line breaking was no doubt considered as part of >>> the overall background, because the initial design for UTR #14 was informed >>> by experience in implementation of line breaking algorithms at Microsoft >>> in the 90's. >>> >>>> >>>> You commented that the UAX#14 class should not be changed but >>>> the tailoring of the line breaking behaviour would solve >>>> the problem (as Firefox and IE11 did). However, some developers >>>> may wonder "there might be a reason why UTC put halfwidth-katakana >>>> to AL - without understanding it, we could not determine whether >>>> the proposed tailoring should be enabled always, or enabled >>>> only for a specific environment (e.g. locale, surrounding text)". >>> >>> See above, in L2/99-179. *That* was the justification. It had nothing >>> to do with specific environment, locale, or surrounding text. >>> >>>> >>>> If UTC can supply the "expected layout result for halfwidth- >>>> katakana (used to define the class in current UAX#14)", it >>>> would be helpful for the developers to evaluate the proposed >>>> tailoring algorithm. >>> >>> UAX #14 was never intended to be a detailed, script-by-script >>> specification of line layout results. It is a default, generic, universal >>> algorithm for line breaking that does a decent, generic job of >>> line breaking in generic contexts without tailoring or specific >>> knowledge of language, locale, or typographical conventions in use. >>> >>> UAX #14 is not a replacement for full specification of kinsoku >>> rules for Japanese, in particular. Nor is it intended as any kind >>> of replacement for JIS X 4051. >>> >>> Please understand this: UAX #14 does *NOT* tell anyone how >>> Japanese text *should* line break. Instead, it is Japanese typographers, >>> users and standardizers who tell implementers of line break >>> algorithms for Japanese what the expectations for Japanese text should >>> be, in what contexts. It is then the job of the UTC and of the >>> platform and application vendors to negotiate the details of >>> which part of that expected behavior makes sense to try to >>> cover by tweaking the default line-breaking algorithm and the >>> Line_Break property values for Unicode characters, and which >>> part of that expected behavior makes sense to try to cover >>> by adjusting commonly accessible and agreed upon tailoring >>> behavior (or public standards like CSS), and finally which part of that >>> expected behavior should instead be addressed by value-added, proprietary >>> implementations of high end publishing software. >>> >>> Regards, >>> >>> --Ken >>>> >>>> >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Thu May 7 07:46:03 2015 From: costello at mitre.org (Costello, Roger L.) Date: Thu, 7 May 2015 12:46:03 +0000 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? Message-ID: Hi Folks, The JSON specification says that a character may be escaped using this notation: \uXXXX (XXXX are four hex digits) However, not every four hex digits corresponds to a Unicode character. Are there tools to scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? /Roger From doug at ewellic.org Thu May 7 12:49:17 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 07 May 2015 10:49:17 -0700 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode =?UTF-8?Q?character=3F?= Message-ID: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net> "Costello, Roger L." wrote: > Are there tools to scan a JSON document to detect the presence of > \uXXXX, where XXXX does not correspond to any Unicode character? A tool like this would need to scan the Unicode Character Database, for some given version, to determine which code points have been allocated to a coded character in that version and which have not. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Thu May 7 13:33:54 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 7 May 2015 11:33:54 -0700 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net> References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net> Message-ID: ?The simplest approach would be to use ICU in a little program that scans the file. For example, you could write a little Java program that would scan the file, and turn any any sequence of (\uXXXX)+ into a String, then test that string with: static final UnicodeSet OK = new UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze(); ... // inside the scanning function boolean isOk? = OK.containsAll(slashUString); It is key that it has to grab the entire sequence of \uXXXX in a row; otherwise it will get the wrong answer. Mark *? Il meglio ? l?inimico del bene ?* On Thu, May 7, 2015 at 10:49 AM, Doug Ewell wrote: > "Costello, Roger L." wrote: > > > Are there tools to scan a JSON document to detect the presence of > > \uXXXX, where XXXX does not correspond to any Unicode character? > > A tool like this would need to scan the Unicode Character Database, for > some given version, to determine which code points have been allocated > to a coded character in that version and which have not. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From senn at maya.com Thu May 7 14:23:49 2015 From: senn at maya.com (Jeff Senn) Date: Thu, 7 May 2015 15:23:49 -0400 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net> Message-ID: <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com> While this may not change the OP's need for such tool, I read the JSON specification as allowing all codepoints 0x0000 - 0xffff regardless of whether they map to "valid" unicode characters. The allowed use of quoted utf-16 surrogate pairs for characters with codepoints > 0xffff (without also specifying that unpaired surrogates are invalid) is troubling on the margin, and complicates such a validation. Another complication is that a "JSON document" might itself be non-ascii (utf8, 16 or 32) and have unicode characters as literals within quoted strings... Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted... > On May 7, 2015, at 2:33 PM, Mark Davis ?? wrote: > > ?The simplest approach would be to use ICU in a little program that scans the file. For example, you could write a little Java program that would scan the file, and turn any any sequence of (\uXXXX)+ into a String, then test that string with: > > static final UnicodeSet OK = new UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze(); > ... > // inside the scanning function > boolean isOk? = OK.containsAll(slashUString); > > It is key that it has to grab the entire sequence of \uXXXX in a row; otherwise it will get the wrong answer. > > > Mark > > ? Il meglio ? l?inimico del bene ? > > On Thu, May 7, 2015 at 10:49 AM, Doug Ewell > wrote: > "Costello, Roger L." wrote: > > > Are there tools to scan a JSON document to detect the presence of > > \uXXXX, where XXXX does not correspond to any Unicode character? > > A tool like this would need to scan the Unicode Character Database, for > some given version, to determine which code points have been allocated > to a coded character in that version and which have not. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Thu May 7 14:35:00 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 7 May 2015 21:35:00 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: Le jeudi, 7 mai 2015 ? 14:46, Costello, Roger L. a ?crit : > The JSON specification says that a character may be escaped using this notation: \uXXXX (XXXX are four hex digits) > > However, not every four hex digits corresponds to a Unicode character. If we refer to the wording of RFC 7159, they are using imprecise terminology. They are meaning "any code point in U+0000 to U+FFFF" (since you need escaped surrogate pairs to be able to escape scalar values not in the BMP). You can understand their definition of a "character that may be escaped" by this sentence of section 7 [1]: "Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF) then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point." However if you are concerned about wrong surrogate sequences or lone surrogate characters (of which the standard has sadly nothing to say about [2]), I have written a best-effort json parser [3] that reports them and allows you to continue by replacing the offending escape sequences by U+FFFD. There's a test command line tool named jsontrip in the distribution that allows you among other things to report these errors. For example: > echo '["\uDEAD"]' | jsontrip -:1.2-1.8: illegal escape, U+DEAD lone low surrogate Best, Daniel [1] https://tools.ietf.org/html/rfc7159#section-7 [2] https://tools.ietf.org/html/rfc7159#section-8.2 [3] http://erratique.ch/software/jsonm From daniel.buenzli at erratique.ch Thu May 7 14:59:11 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 7 May 2015 21:59:11 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com> References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net> <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com> Message-ID: <5CAE0752A9EF48178D1EFAD08B817590@erratique.ch> Le jeudi, 7 mai 2015 ? 21:23, Jeff Senn a ?crit : > Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted... I don't think this is an issue. It's not ambiguous: the standard mentions that JSON text shall be encoded in UTF-8, UTF-16 or UTF-32 so what you get in this case is an (UTF-16) character stream decoding error. Best, Daniel From markus.icu at gmail.com Thu May 7 14:59:54 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 7 May 2015 12:59:54 -0700 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain "text", or well-formed UTF-16, or only assigned characters. Some code stores binary data (sequence of arbitrary 16-bit unsigned integers) in a "string", just because it is easy and fairly efficient to transport. You should "validate" *text* only when you are certain that it is indeed text. And when you do validate, you might want to be narrower than "assigned character"; for example, you might require Unicode identifiers or XML NMTOKENS or whatever. Also remember that "assigned" and "identifier" and such depend on the version of Unicode your library currently implements. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Thu May 7 15:29:27 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 7 May 2015 22:29:27 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: Le jeudi, 7 mai 2015 ? 21:59, Markus Scherer a ?crit : > I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain "text", or well-formed UTF-16, or only assigned characters. > Some code stores binary data (sequence of arbitrary 16-bit unsigned integers) in a "string", just because it is easy and fairly efficient to transport. > > You should "validate" *text* only when you are certain that it is indeed text. Section 8.2 [1] of the spec specifically says that only strings that represent sequences of Unicode scalar values (they say "characters") are interoperable and that strings that do not represent such sequences like "\uDEAD" can lead to unpredictable behaviour. If you want to transmit binary data reliably in json you must apply some form of binary to Unicode scalar value encoding (like in most text based interchange formats). Best, Daniel [1] https://tools.ietf.org/html/rfc7159#section-8.2 From verdy_p at wanadoo.fr Thu May 7 19:16:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 8 May 2015 02:16:25 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units. The transport syntax of JSON does not even require that the textual syntax itself must be encoded in UTF-16, and in most cases it will be transported as UTF-8. So before processing a "text/json" content type, you have first to determine an appropriate character encoding to decode this syntax (in HTTP you would use a MIME header to specify the charset effectively used, but the "text/json" MIME type by default uses UTF-8. Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16. If you need a validation of UTF-16 this is not the job of JSON itself (or Java or Javascript or similar) but dependant on the application using the JSON data: some of them will reject the stream as invalid because they expect their input to be a valid UTF (not necessarily UTF-16 or UTF-8), or they may even restrict more the allowed characer set they support (e.g. restrict to just ASCII, or support some other encodings such as GSM encoding for SMS, or just use the lowest 8 bits of each code unit). JSON by itself is neutral, it just assumes in its syntax that any binary stream of 16-bit code unit is encodable and transportable: it could be even used to transport executable binary code or bitmap images data (such as JPEG or PNG), provided that there's a way to represent the effective binary length (when it is not an exact multiple of 16 bits) with additional data transmited in the JSON encoded data (however the most common way for such binary data is to store them in JSON using Base64, for example with the "data:" URL-encoding scheme: this scheme is commonly used in CSS which can be safely embedded in JSON strings)... I don't think this is a bad thing of JSON: JSON strings are NOT equivalent to text (and not all text is also valid Unicode text when it uses specific encodings whose character entities don't have a one-to-one mapping in the UCS, for example with private-use characters that require an external agreement if we want to map them to PUA in the UCS, or if the encoding maps them to non-characters of the UCS), even if there's a "assumed" encoding only for characters that are not reserved by the JSON syntax and not represented as escaped sequences (this assumption is also based an an external greement for the encoding used in the transport). 2015-05-07 22:29 GMT+02:00 Daniel B?nzli : > Le jeudi, 7 mai 2015 ? 21:59, Markus Scherer a ?crit : > > I assume that the JSON spec deliberately allows anything that Java and > JavaScript allow. In particular, there is no requirement for a Java String > or JavaScript string to contain "text", or well-formed UTF-16, or only > assigned characters. > > > Some code stores binary data (sequence of arbitrary 16-bit unsigned > integers) in a "string", just because it is easy and fairly efficient to > transport. > > > > You should "validate" *text* only when you are certain that it is indeed > text. > Section 8.2 [1] of the spec specifically says that only strings that > represent sequences of Unicode scalar values (they say "characters") are > interoperable and that strings that do not represent such sequences like > "\uDEAD" can lead to unpredictable behaviour. > > If you want to transmit binary data reliably in json you must apply some > form of binary to Unicode scalar value encoding (like in most text based > interchange formats). > > Best, > > Daniel > > [1] https://tools.ietf.org/html/rfc7159#section-8.2 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Thu May 7 20:22:01 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 8 May 2015 03:22:01 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit : > It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units. I suggest you have a careful read at RFC 7159 as it specifically implies that this is not the model it supports (albeit using broken or let's say ambiguous/imprecise Unicode terminology). > Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16. I don't know where you get this from but you won't find any mention of this in the standard. We are dealing with text, Unicode scalar values, not encodings. At the risk of repeating myself, read section 8.2 of RFC 7159. Best, Daniel From verdy_p at wanadoo.fr Thu May 7 22:08:21 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 8 May 2015 05:08:21 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <406345450D52417C9DEE234A6C0662A2@erratique.ch> References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: The RFC is jsut informative not normative, and thez effective usage and implementations just support JSON as plain 16-bit streams, even if the transport syntax requires encoding it in plain-text (using some UTF, not necessarily UTF-8 even if this is the default). Try by yourself, you can perfectly send JSON text containing '\uFFFF' (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON implementation complaining about one or the other, when receiving the JSON stream and using it in Javascript, you'll see no missing code unit or replaced code units and no exception as well. 2015-05-08 3:22 GMT+02:00 Daniel B?nzli : > Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit : > > It would be more exact to say that JSON strings, just like strings in > Javascript and Java or many programming languages are just binary streams > of 16-bit code units. > > I suggest you have a careful read at RFC 7159 as it specifically implies > that this is not the model it supports (albeit using broken or let's say > ambiguous/imprecise Unicode terminology). > > > Then the JSON processor will decode this text and will remap it to an > internal UTF-16 encoding (for characters that are not escaped) and the > "\uXXXX" will be decoded as plain 16-bit code units. The result will be a > stream of 16-bit code units, which can then externally be outpout and > encoded or stored in any convenient encoding that preserves this stream, > EVEN if this is not valid UTF-16. > > I don't know where you get this from but you won't find any mention of > this in the standard. We are dealing with text, Unicode scalar values, not > encodings. At the risk of repeating myself, read section 8.2 of RFC 7159. > > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 7 22:12:19 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 8 May 2015 05:12:19 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: 2015-05-08 5:08 GMT+02:00 Philippe Verdy : > Try by yourself, you can perfectly send JSON text containing '\uFFFF' > (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON > implementation complaining about one or the other, when receiving the JSON > stream and using it in Javascript, you'll see no missing code unit or > replaced code units and no exception as well. > typo : replace F800 by D800 of course -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Fri May 8 00:14:41 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 8 May 2015 05:14:41 +0000 Subject: Script / font support in Windows 10 Message-ID: This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Fri May 8 00:27:27 2015 From: marc at keyman.com (Marc Durdin) Date: Fri, 8 May 2015 05:27:27 +0000 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A73B83158@federation.tavultesoft.local> That page doesn't appear to be visible outside Microsoft. Public link is https://msdn.microsoft.com/en-us/bb688099 I think. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Friday, 8 May 2015 3:15 PM To: unicode at unicode.org Subject: Script / font support in Windows 10 This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Fri May 8 00:29:18 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 8 May 2015 05:29:18 +0000 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Thursday, May 7, 2015 10:15 PM To: unicode at unicode.org Subject: Script / font support in Windows 10 This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Fri May 8 04:27:03 2015 From: costello at mitre.org (Costello, Roger L.) Date: Fri, 8 May 2015 09:27:03 +0000 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: Philippe Verdy wrote: ? implementations just support JSON as plain 16-bit streams ? Try by yourself, you can perfectly send JSON text containing ? '\uFFFF' (non-character) or '\uD800' (unpaired surrogate) and ? I've not seen any JSON implementation complaining about one ? or the other Okay, I gave it a try. I created this string which contains binary data (sequence of arbitrary unsigned integers): " ________________________________ ??}g?? " When I validated that string against this JSON Schema: { "type" : "string" } using this online validator: https://json-schema-validator.herokuapp.com/ I got an error: Invalid JSON: parse error, line 1 I am pretty sure that Daniel is correct, JSON cannot contain arbitrary bit streams. ? The RFC is just informative not normative Interesting! What does that mean? JSON vendors are free to ignore the JSON RFC and do as they see fit? /Roger From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Thursday, May 07, 2015 11:08 PM To: Daniel B?nzli Cc: Unicode at unicode.org; Costello, Roger L.; Markus Scherer Subject: Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? The RFC is jsut informative not normative, and thez effective usage and implementations just support JSON as plain 16-bit streams, even if the transport syntax requires encoding it in plain-text (using some UTF, not necessarily UTF-8 even if this is the default). Try by yourself, you can perfectly send JSON text containing '\uFFFF' (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON implementation complaining about one or the other, when receiving the JSON stream and using it in Javascript, you'll see no missing code unit or replaced code units and no exception as well. 2015-05-08 3:22 GMT+02:00 Daniel B?nzli >: Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit : > It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units. I suggest you have a careful read at RFC 7159 as it specifically implies that this is not the model it supports (albeit using broken or let's say ambiguous/imprecise Unicode terminology). > Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16. I don't know where you get this from but you won't find any mention of this in the standard. We are dealing with text, Unicode scalar values, not encodings. At the risk of repeating myself, read section 8.2 of RFC 7159. Best, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Fri May 8 06:04:08 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 8 May 2015 13:04:08 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: Le vendredi, 8 mai 2015 ? 05:08, Philippe Verdy a ?crit : > The RFC is jsut informative not normative, RFC 7159 is not informational, it is a proposed standard. > Try by yourself, you can perfectly send JSON text containing '\uFFFF' (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON implementation complaining about one or the other, Well now you have (mine). The RFC is very clear that we are dealing with *text-based* data not *binary* data. Maybe programming languages that represent their Unicode strings as possibly invalid UTF-16 sequences will happily input this but as section 8.2 mentions that may not be the case everywhere, software receiving these values "might return different values for the length of a string value or even suffer fatal runtime exceptions". Best, Daniel From verdy_p at wanadoo.fr Fri May 8 06:48:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 8 May 2015 13:48:38 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: JSON came initially from Javascript, and it is used extensively with Javascript. My tests with their JSON parser is that any string that is valdi for Javascript is also valid in JSON (no exception raised, no replaced characters, no deleted characters even if there are unpaired surrogates or non-characters like '\uFFFF'). The RFC is deviating from the currently running implementations. 2015-05-08 13:04 GMT+02:00 Daniel B?nzli : > Le vendredi, 8 mai 2015 ? 05:08, Philippe Verdy a ?crit : > > The RFC is jsut informative not normative, > > RFC 7159 is not informational, it is a proposed standard. > > > Try by yourself, you can perfectly send JSON text containing '\uFFFF' > (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON > implementation complaining about one or the other, > Well now you have (mine). The RFC is very clear that we are dealing with > *text-based* data not *binary* data. Maybe programming languages that > represent their Unicode strings as possibly invalid UTF-16 sequences will > happily input this but as section 8.2 mentions that may not be the case > everywhere, software receiving these values "might return different values > for the length of a string value or even suffer fatal runtime exceptions". > > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 8 06:57:20 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 8 May 2015 13:57:20 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: 2015-05-08 11:27 GMT+02:00 Costello, Roger L. : > Okay, I gave it a try. I created this string which contains binary data > (sequence of arbitrary unsigned integers): > > > > " > ------------------------------ > ??}g?? " > > > I did not say that these data had not to be properly escaped. With escaping (\uXXXX) it works with arbitrary sequences of 16-bit code-units. -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Fri May 8 07:32:51 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 8 May 2015 14:32:51 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: <821476CFD30C4A6C95CA6319394C723C@erratique.ch> Le vendredi, 8 mai 2015 ? 13:48, Philippe Verdy a ?crit : > JSON came initially from Javascript, and it is used extensively with Javascript. But not *only* for a long time now. > The RFC is deviating from the currently running implementations. Well did you test them all ? There's quite a big list here http://www.json.org. Taking a random one mentioned on that page leads me to http://golang.org/pkg/encoding/json/ in which they say that they replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising since apparently go's strings as text are UTF-8 encoded so when you need to produce your results as UTF-8 then you don't have a lot of solutions... error and/or U+FFFD. In any case deviating or not, that's for good since it would be insane to impose JavaScript's string as a data structure for an interchange format that intents to be universal and *textual*. Best, Daniel From petercon at microsoft.com Fri May 8 09:15:55 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 8 May 2015 14:15:55 +0000 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: I think this is the right public link: https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx From: Peter Constable Sent: Thursday, May 7, 2015 10:29 PM To: Peter Constable; unicode at unicode.org Subject: RE: Script / font support in Windows 10 Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Thursday, May 7, 2015 10:15 PM To: unicode at unicode.org Subject: Script / font support in Windows 10 This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri May 8 09:41:49 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 08 May 2015 07:41:49 -0700 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode =?UTF-8?Q?character=3F?= Message-ID: <20150508074149.665a7a7059d7ee80bb4d670165c8327d.c8e098d352.wbe@email03.secureserver.net> I interpreted Roger Costello's original question literally, that he wanted to find instances of '\uXXXX' that do not represent an ASSIGNED Unicode character. Apologies if this discussion is really about something else. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Fri May 8 11:04:00 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 8 May 2015 09:04:00 -0700 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: Thanks! Mark *? Il meglio ? l?inimico del bene ?* On Fri, May 8, 2015 at 7:15 AM, Peter Constable wrote: > I think this is the right public link: > > > > https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx > > > > > > *From:* Peter Constable > *Sent:* Thursday, May 7, 2015 10:29 PM > *To:* Peter Constable; unicode at unicode.org > *Subject:* RE: Script / font support in Windows 10 > > > > Oops? my bad: maybe it isn?t on live servers yet. It will be soon. I?ll > update with the public link when it is. > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Peter Constable > *Sent:* Thursday, May 7, 2015 10:15 PM > *To:* unicode at unicode.org > *Subject:* Script / font support in Windows 10 > > > > This page on MSDN that provides an overview of Windows support for > different scripts has now been updated for Windows 10: > > > > https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 > > > > > > > > Peter > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 8 11:49:47 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 8 May 2015 17:49:47 +0100 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: <20150508174947.2fca36c4@JRWUBU2> On Fri, 8 May 2015 14:15:55 +0000 Peter Constable wrote: > I think this is the right public link: > > https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx Does this confirm the intention of Microsoft that at some stage the Universal Shaping Engine (USE) in Windows 10 will support the Tai Tham script? In February we discovered that the USE didn't support syllable-final SAKOT+consonant - the commonest and eponymous use of U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character in the Tai Tham script. For example, we can't write the name of the city of 'Chiang Rai' in the Tai Tham script using the USE. Richard. From Andrew.Glass at microsoft.com Fri May 8 12:16:01 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Fri, 8 May 2015 17:16:01 +0000 Subject: Script / font support in Windows 10 In-Reply-To: <20150508174947.2fca36c4@JRWUBU2> References: <20150508174947.2fca36c4@JRWUBU2> Message-ID: Hi Richard, I agree that there is some work to be done to ensure correct display of Tai Tham. That work may involve changes to USE in a future update. We will have a panel on Universal Shaping at the upcoming IUC conference. That will be a good opportunity for a discussion between implementers and font developers. If you are able to attend that would be great. If not, we can certainly go through the proposed changes you have sent. Cheers, Andrew -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Friday, May 8, 2015 9:50 AM To: unicode at unicode.org Subject: Re: Script / font support in Windows 10 On Fri, 8 May 2015 14:15:55 +0000 Peter Constable wrote: > I think this is the right public link: > > https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx Does this confirm the intention of Microsoft that at some stage the Universal Shaping Engine (USE) in Windows 10 will support the Tai Tham script? In February we discovered that the USE didn't support syllable-final SAKOT+consonant - the commonest and eponymous use of U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character in the Tai Tham script. For example, we can't write the name of the city of 'Chiang Rai' in the Tai Tham script using the USE. Richard. From richard.wordingham at ntlworld.com Fri May 8 15:27:18 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 8 May 2015 21:27:18 +0100 Subject: Script / font support in Windows 10 In-Reply-To: References: <20150508174947.2fca36c4@JRWUBU2> Message-ID: <20150508212718.2f6a48b6@JRWUBU2> On Fri, 8 May 2015 17:16:01 +0000 "Andrew Glass (WINDOWS)" wrote: > I agree that there is some work to be done to ensure correct display > of Tai Tham. That work may involve changes to USE in a future update. That's as I understood it, which I is why I was surprised by the degree of commitment in the overview. I did wonder if the overview had been written long ago, so its author was unaware of there being issues with USE and Tai Tham. For example, I got the impression that you had contemplated cloning USE and modifying that clone for Tai Tham, so as to keep the USE simpler. (In the meantime, it may make sense to use the USE for Tai Tham, and let the font clean up the inappropriate dotted circles. I currently do that for applications that use old versions of HarfBuzz.) Also, I hadn't expected you to commit to a timetable. Richard. From richard.wordingham at ntlworld.com Fri May 8 15:47:46 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 8 May 2015 21:47:46 +0100 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> Message-ID: <20150508214746.7570e528@JRWUBU2> On Fri, 8 May 2015 05:08:21 +0200 Philippe Verdy wrote: > Try by yourself, you can perfectly send JSON text containing '\uFFFF' > (non-character) or '\uF800' (unpaired surrogate) and I've not seen > any JSON implementation complaining about one or the other, when > receiving the JSON stream and using it in Javascript, you'll see no > missing code unit or replaced code units and no exception as well. Unicode Consortium standards and recommendations allow non-characters to be sent; as far as I can make out, they are just not to be thought of as unstandardised graphic characters. Richard. From doug at ewellic.org Fri May 8 17:37:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 08 May 2015 15:37:57 -0700 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) Message-ID: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> Richard Wordingham wrote: >> Try by yourself, you can perfectly send JSON text containing '\uFFFF' >> (non-character) or '\uF800' (unpaired surrogate) and I've not seen >> any JSON implementation complaining about one or the other, when >> receiving the JSON stream and using it in Javascript, you'll see no >> missing code unit or replaced code units and no exception as well. > > Unicode Consortium standards and recommendations allow non-characters > to be sent; as far as I can make out, they are just not to be thought > of as unstandardised graphic characters. As I understand it, from a purely Unicode standpoint, there are differences here between noncharacters and unpaired surrogates. Noncharacters are Unicode scalar values, while unpaired surrogates are not. This means noncharacters may appear in a well-formed UTF-8, -16, or -32 string, while unpaired surrogates may not. They may both be part of a "Unicode string" which does not claim to be in any given encoding form. Authoritative corrections are welcome to help solidify my understanding. I don't wish to get involved in debates over JSON. I've read RFC 7159 and I know what it says. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From daniel.buenzli at erratique.ch Fri May 8 19:26:59 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sat, 9 May 2015 02:26:59 +0200 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> Message-ID: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit : > Noncharacters are Unicode scalar values, Non characters are Unicode scalar values by definitions D14 and D76. > while unpaired surrogates are not. All surrogates code points are not Unicode scalar values by D71, D73 and D76. > This means noncharacters may appear in a well-formed UTF-8, -16, or > -32 string, It take "appear" to mean "be encoded". Yes, any Unicode encoding forms allows to interchange all scalar values by D79. (However noncharacters are not designed to be openly interchanged see "Restricted interchange" on p. 31. of 7.0.0) > while unpaired surrogates may not. All surrogate code points *paired or not* cannot be encoded in UTF-{8,16,32} by D92, D91, D90. All these encoding forms, by definition, assign only Unicode scalar values to code units sequences (see also the already mentioned p. 31. which clarifies this). However in UTF-16 code unit sequences may contain surrogate pairs (that taken together represent a Unicode scalar value). > They may both be part of a "Unicode string" which does not claim to be in any given encoding > form. Not sure what you mean by that. So I let someone else answer. Best, Daniel From verdy_p at wanadoo.fr Fri May 8 19:33:20 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 02:33:20 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <821476CFD30C4A6C95CA6319394C723C@erratique.ch> References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> Message-ID: 2015-05-08 14:32 GMT+02:00 Daniel B?nzli : > Le vendredi, 8 mai 2015 ? 13:48, Philippe Verdy a ?crit : > > JSON came initially from Javascript, and it is used extensively with > Javascript. > > But not *only* for a long time now. > > > The RFC is deviating from the currently running implementations. > > Well did you test them all ? There's quite a big list here > http://www.json.org. Taking a random one mentioned on that page leads me > to http://golang.org/pkg/encoding/json/ in which they say that they > replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very > surprising since apparently go's strings as text are UTF-8 encoded so when > you need to produce your results as UTF-8 then you don't have a lot of > solutions... error and/or U+FFFD. > I've already saif that JSON is UTF-8 encoded by default, but this does not mean that JSON invalidates the escape sequence '\uD800' isolated in a string. For this reason JSON strings are not restricted by the textual encoding of its syntaxic representation. So no error returned, no replacement by U+FFFD and even unpaired surrogates are possible, provided that they are escaped. Basically JSON strings remain equivalent to Javascript strings where '\uD800' is also a perfectly valid "string". I make the difference between a "string" and plain-text. And if the RFC had not been so confusive by mixing terms (notably the term "code point", it would have may be become a standard. For now it is just a tentative attempt to standardize it, but it does not work with existing implementation which have started since the begining as a data serialization format based on Javascript syntax (with only the removal of items that are not pure data, such as functions/methods, and more complex objects like Javascript regexp literals (functionaly equivalent to an object constructor), object references... keeping only strings, numbers, and only two structures: ordered arrays and unordered associative arrays (also called dictionaries and that are also including ordered arrays considered as associative using number keys, thus reducing it to only one effetctive structure even if ordered arrays have also a simpler syntaxic sugar to represent them in a more compact way). If you mean that JSON string "\uD800" is invalid, it is not longer a data serialization for Javascript, or other languages also using JSON as a possible syntax for serializing data into plain-text. JSON was created because XML (the alternative) was too verbose and had restrictions in its "text" elements. It seems that the RFC just wants to apply to JSON the same restrictions as found in XML, but it deviates JSON from its objective, and I'm convinced that such restrictions are not enforced at all in many JSON implementations that do not attempt to validate if the value of the represented string a valid plain-text. JSON is only transforming strings into valid plain-text representation using an encoding syntax using separators and escape sequences, nothing else. If the RFC wants to add such restrictions, it is mixing two layers: the syntaxic (plain text) layer and the lower layer for the internally represented values which are just a stream of code units. And the only difference in that case is the behavior for isolated/unpaired surrogates (not restricted in Javascript or many languages defining "strings", but restricted in plain-text, but JSON is there to offer the serializatrion scheme allowing strings to be safely converted to plain-text) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Fri May 8 20:27:20 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sat, 9 May 2015 03:27:20 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> Message-ID: Le samedi, 9 mai 2015 ? 02:33, Philippe Verdy a ?crit : > 2015-05-08 14:32 GMT+02:00 Daniel B?nzli : > > Well did you test them all ? There's quite a big list here http://www.json.org. Taking a random one mentioned on that page leads me to http://golang.org/pkg/encoding/json/ in which they say that they replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising since apparently go's strings as text are UTF-8 encoded so when you need to produce your results as UTF-8 then you don't have a lot of solutions... error and/or U+FFFD. > > > I've already saif that JSON is UTF-8 encoded by default, but this does not mean that JSON invalidates the escape sequence '\uD800' isolated in a string. You didn't get what I said. When a parser returns a JSON string it just parsed and that it wants to give it back to the programmer using the native string of the language and that these strings happen to be UTF-8 encoded in this language, then in presence of such lone surrogates you are stuck and need to do something as you cannot encode them in the UTF-8 string. (I understand that in *your* interpretation this should not happen since I should define a special data type to represent these JSON strings so that they behave like JavaScript strings; that would be indeed very practical, none of my language native string tools can be used on that?) Anyways, we are largely OT at this point. Best, Daniel From richard.wordingham at ntlworld.com Fri May 8 22:13:52 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 May 2015 04:13:52 +0100 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> Message-ID: <20150509041352.60c24989@JRWUBU2> On Sat, 9 May 2015 02:26:59 +0200 Daniel B?nzli wrote: > Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit : > > Noncharacters are Unicode scalar values, > (However noncharacters are not designed to be openly interchanged see > "Restricted interchange" on p. 31. of 7.0.0) That didn't stop their being openly interchanged. > > They may both be part of a "Unicode string" which does not claim to > > be in any given encoding form. > Not sure what you mean by that. So I let someone else answer. There are a number of phrases whose declared meanings cannot be deduced from the individual words. A UTF-8, UTF-16 or UTF-32 string defines a sequence of scalar values. However, Unicode 8-bit, 16-bit or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit values that may occur in a UTF-8, UTF-16 or UTF-32 string respectively. This definition has some odd consequences: A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a multi-word encoding. An arbitrary string of unsigned 32-bit values is not in general a Unicode 32-bit string. All strings of unsigned 16-bit values are Unicode 16-bit strings. Not all (Unicode) 16-bit strings are UTF-16 strings. Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and not all Unicode 8-bit strings are UTF-8 strings. I can't think of a practical use for the specific concepts of Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are essentially the same as 16-bit strings, and Unicode 32-bit strings are UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in pedantry; there are more useful categories of 8-bit strings that are not UTF-8 strings. Richard. From richard.wordingham at ntlworld.com Fri May 8 22:42:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 May 2015 04:42:13 +0100 Subject: Surrogates and noncharacters In-Reply-To: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> Message-ID: <20150509044213.28b48ac8@JRWUBU2> On Sat, 9 May 2015 02:26:59 +0200 Daniel B?nzli wrote: > Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit : > > This means noncharacters may appear in a well-formed UTF-8, -16, or > > -32 string, > It take "appear" to mean "be encoded". Yes, any Unicode encoding > forms allows to interchange all scalar values by D79. > (However noncharacters are not designed to be openly interchanged see > "Restricted interchange" on p. 31. of 7.0.0) That is irrelevant, for JSON is not restricted to open interchange. Richard. From verdy_p at wanadoo.fr Fri May 8 23:13:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 06:13:33 +0200 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: <20150509041352.60c24989@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: 2015-05-09 5:13 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > I can't think of a practical use for the specific concepts of Unicode > 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are > essentially the same as 16-bit strings, and Unicode 32-bit strings are > UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in > pedantry; there are more useful categories of 8-bit strings that are > not UTF-8 strings. > And here you're wrong: a 16-bit string is just a sequence of arbitrary 16-bit code units, but an Unicode string (whatever the size of its code units) adds restrictions for validity (the only restriction being in fact that surrogates (when present in 16-bit strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are forbidden. So the concept of "Unicode string" is in fact the same as valid Unicode text: it is a subset of possible strings, restricted by validation rules: - for 8-bit strings (UTF-8) there are other constraints (not all bytes are acceptable and some pairs of bytes are also restricted, and final bytes cannot occur alone) - for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired surrogates - for 32-bit strings (UTF-32), the only constaint is on the two allowed ranges of encoded code points (U+0000..U+D7FF and U+E000..U+10FFFF). For being "plain-text" there are additional restrictions: non-characters are also excluded, and only a small subset of controls (basically tabs and newlines) is allowed (the other controls, including U+0000 are restricted for private protocols and not designed for plain text... except specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO 2022 or Videotext which need these controls in fact to represent characters into sequences, possibly with contextual encoding). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 8 23:24:36 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 06:24:36 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> Message-ID: 2015-05-09 3:27 GMT+02:00 Daniel B?nzli : > Le samedi, 9 mai 2015 ? 02:33, Philippe Verdy a ?crit : > > 2015-05-08 14:32 GMT+02:00 Daniel B?nzli (mailto:daniel.buenzli at erratique.ch)>: > > > Well did you test them all ? There's quite a big list here > http://www.json.org. Taking a random one mentioned on that page leads me > to http://golang.org/pkg/encoding/json/ in which they say that they > replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very > surprising since apparently go's strings as text are UTF-8 encoded so when > you need to produce your results as UTF-8 then you don't have a lot of > solutions... error and/or U+FFFD. > > > > > > I've already saif that JSON is UTF-8 encoded by default, but this does > not mean that JSON invalidates the escape sequence '\uD800' isolated in a > string. > > You didn't get what I said. When a parser returns a JSON string it just > parsed and that it wants to give it back to the programmer using the native > string of the language and that these strings happen to be UTF-8 encoded in > this language, then in presence of such lone surrogates you are stuck and > need to do something as you cannot encode them in the UTF-8 string. > You are not stuck! You can still regenerate a valid JSON output encoded in UTF-8: it will once again use escape sequences (which are also needed if your text contains quotation marks used to delimit the JSON strings in its syntax. Unlike UTF-8, JSON has never been designed to restrict its strings to have its represented values to be only plain-text, it is a only a serialization of "strings" to valid plain-text using a custom syntax. There's absolutely no need to restrict strings values to the same validation rules and the same subset as the set of acceptable plain-text: this is not the same layer: one is the string level (in fact not bound to any character encoding and not restricted to text), another is the plain-text, and JSON is the adapter/converter between these two representations. Do not mix these two distinct layers. (this is also the case when someone confuses an XML document with its DOM: not the same layer) -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri May 8 23:37:40 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 8 May 2015 21:37:40 -0700 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy wrote: > 2015-05-09 5:13 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > >> I can't think of a practical use for the specific concepts of Unicode >> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are >> essentially the same as 16-bit strings, and Unicode 32-bit strings are >> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in >> pedantry; there are more useful categories of 8-bit strings that are >> not UTF-8 strings. >> > > And here you're wrong: a 16-bit string is just a sequence of arbitrary > 16-bit code units, but an Unicode string (whatever the size of its code > units) adds restrictions for validity (the only restriction being in fact > that surrogates (when present in 16-bit strings, i.e. UTF-16) must be > paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are > forbidden. > No, Richard had it right. See for example definition D82 "Unicode 16-bit string" in the standard. (Section 3.9 Unicode Encoding Forms, http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf) I agree that the definitions for Unicode 8-bit and 32-bit strings are not particularly useful. For being "plain-text" there are additional restrictions: non-characters > are also excluded, and only a small subset of controls (basically tabs and > newlines) is allowed (the other controls, including U+0000 are restricted > for private protocols and not designed for plain text... except > specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO > 2022 or Videotext which need these controls in fact to represent characters > into sequences, possibly with contextual encoding). > Where did you find that definition of "plain text"? Unicode just defines "plain text" by contrast with "rich text" which is text with markup or other such structure. There is no limitation of code points associated with that term. http://unicode.org/glossary/#plain_text markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 9 00:55:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 07:55:17 +0200 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: 2015-05-09 6:37 GMT+02:00 Markus Scherer : > On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy wrote: > >> 2015-05-09 5:13 GMT+02:00 Richard Wordingham < >> richard.wordingham at ntlworld.com>: >> >>> I can't think of a practical use for the specific concepts of Unicode >>> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are >>> essentially the same as 16-bit strings, and Unicode 32-bit strings are >>> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in >>> pedantry; there are more useful categories of 8-bit strings that are >>> not UTF-8 strings. >>> >> >> And here you're wrong: a 16-bit string is just a sequence of arbitrary >> 16-bit code units, but an Unicode string (whatever the size of its code >> units) adds restrictions for validity (the only restriction being in fact >> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be >> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are >> forbidden. >> > > No, Richard had it right. See for example definition D82 "Unicode 16-bit > string" in the standard. (Section 3.9 Unicode Encoding Forms, > http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf) > I was right, D82 refers to "UTF-16", which implies the restriction of validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of non-characters). I was right, You and Richard were wrong. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 9 00:56:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 07:56:52 +0200 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: Note: I used "16-bit string" in my sentence, NOT "Unicode 16-bit string" which I used in the later part of my sentence (but also including 8-bit and 32-bit for the same restrictions in "Unicode strings")... So no contradiction. 2015-05-09 7:55 GMT+02:00 Philippe Verdy : > > > 2015-05-09 6:37 GMT+02:00 Markus Scherer : > >> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy >> wrote: >> >>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham < >>> richard.wordingham at ntlworld.com>: >>> >>>> I can't think of a practical use for the specific concepts of Unicode >>>> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are >>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are >>>> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in >>>> pedantry; there are more useful categories of 8-bit strings that are >>>> not UTF-8 strings. >>>> >>> >>> And here you're wrong: a 16-bit string is just a sequence of arbitrary >>> 16-bit code units, but an Unicode string (whatever the size of its code >>> units) adds restrictions for validity (the only restriction being in fact >>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be >>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are >>> forbidden. >>> >> >> No, Richard had it right. See for example definition D82 "Unicode 16-bit >> string" in the standard. (Section 3.9 Unicode Encoding Forms, >> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf) >> > > I was right, D82 refers to "UTF-16", which implies the restriction of > validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of > non-characters). > > I was right, You and Richard were wrong. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 9 01:00:34 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 08:00:34 +0200 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: 2015-05-09 6:37 GMT+02:00 Markus Scherer : > Where did you find that definition of "plain text"? > I have not said that Unicode defines what is plain-text. It is defined in RFC describing the MIME type and giving the name "plain text". > Unicode just defines "plain text" by contrast with "rich text" which is > text with markup or other such structure. There is no limitation of code > points associated with that term. > http://unicode.org/glossary/#plain_text > This is not a definition, or just a mere definition of "Unicode plain text" (i.e. more restrictive than "plain text"). Please don't add restriction/qualifying words ("Unicode") that I did not use in my sentence **on purpose**. Plain text has been defined much longer before Unicode wrote its informative glossary. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 9 04:59:57 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 May 2015 10:59:57 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: <20150509105957.66267e13@JRWUBU2> On Sat, 9 May 2015 07:55:17 +0200 Philippe Verdy wrote: > 2015-05-09 6:37 GMT+02:00 Markus Scherer : > > > On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy > > wrote: > >> 2015-05-09 5:13 GMT+02:00 Richard Wordingham < > >> richard.wordingham at ntlworld.com>: WARNING: This post belongs in pedants' corner, or possibly a pantomime. > >>> I can't think of a practical use for the specific concepts of > >>> Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings > >>> are essentially the same as 16-bit strings, and Unicode 32-bit > >>> strings are UTF-32 strings. 'Unicode 8-bit string' strikes me > >>> as an exercise in pedantry; there are more useful categories of > >>> 8-bit strings that are not UTF-8 strings. > >> And here you're wrong: a 16-bit string is just a sequence of > >> arbitrary 16-bit code units, but an Unicode string (whatever the > >> size of its code units) adds restrictions for validity (the only > >> restriction being in fact that surrogates (when present in 16-bit > >> strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and > >> 8-bit (UTF-8) strings, surrogates are forbidden. You are thinking of a Unicode string as a sequence of codepoints. Now that may be a linguistically natural interpretation of 'Unicode string', but 'Unicode string' has a different interpretation, given in D80. A 'Unicode string' (D80) is a sequence of code-units occurring in some Unicode encoding form. By this definition, every permutation of the code-units in a Unicode string is itself a Unicode string. UTF-16 is unique in that every code-unit corresponds to a codepoint. (We could extend the Unicode codespace (D9, D10) by adding integers for the bytes of multibyte UTF-8 encodings, but I see no benefit.) A Unicode 8-bit string may have no interpretation as a sequence of codepoints. For example, the 8-bit string is a Unicode 8-bit string denoting a sequence of one Unicode scalar value, namely U+00A0. is therefore also a Unicode 8-bit string, but it has no defined or obvious interpretation as a codepoint; it is *not* a UTF-8 string. The string is also a Unicode 8-bit string, but is not a UTF-8 string because the sequence is not the shortest representation of U+0000. The 8-bit string is *not* a Unicode 8-bit string, for the byte C0 does not occur in well-formed UTF-8; one does not even need to note that it is not the shortest representation of U+0000. > > No, Richard had it right. See for example definition D82 "Unicode > > 16-bit string" in the standard. (Section 3.9 Unicode Encoding Forms, > > http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf) > I was right, D82 refers to "UTF-16", which implies the restriction of > validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of > non-characters). No, D82 merely requires that each 16-bit value be a valid UTF-16 code unit. Unicode strings, and Unicode 16-bit strings in particular, need not be well-formed. For x = 8, 16, 32, a 'UTF-x string', equivalently a 'valid UTF-x string', is one that is well-formed in UTF-x. > I was right, You and Richard were wrong. I stand by my explanation. I wrote it with TUS open at the definitions by my side. Richard. From daniel.buenzli at erratique.ch Sat May 9 06:09:11 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sat, 9 May 2015 13:09:11 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150509044213.28b48ac8@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509044213.28b48ac8@JRWUBU2> Message-ID: <7BB6B573C8F448B9BF024AC65B86AC18@erratique.ch> Le samedi, 9 mai 2015 ? 05:42, Richard Wordingham a ?crit : > > (However noncharacters are not designed to be openly interchanged see > > "Restricted interchange" on p. 31. of 7.0.0) > > That is irrelevant, for JSON is not restricted to open interchange. Irrelevant to what ? I never said such a thing. Of course you can have non characters in JSON strings. I was just mentioning that it is not *advised* by the standard to interchange non characters. In practice you can always have them. Best, Daniel From daniel.buenzli at erratique.ch Sat May 9 07:16:28 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sat, 9 May 2015 14:16:28 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> Message-ID: <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch> Le samedi, 9 mai 2015 ? 06:24, Philippe Verdy a ?crit : > You are not stuck! You can still regenerate a valid JSON output encoded in UTF-8: it will once again use escape sequences (which are also needed if your text contains quotation marks used to delimit the JSON strings in its syntax. That's a possible resolution, but a very bad one: I can then no longer in my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD". This exactly leads to the interoperability problems mentioned in section 8.2 of RFC 7159. You say passing escapes to the programmer is needed if your text contains quotation marks, this is nonsense. A good and sane JSON codec will never let the programmer deal with escapes directly, it is its responsability to allow the programmer to only deal with the JSON *data* not the details of the encoding of the data. As such it will automatically unescape on decoding to give you the data represented by the encoding and automatically escape (if needed) the data you give it on encoding. > Unlike UTF-8, JSON has never been designed to restrict its strings to have its represented values to be only plain-text, it is a only a serialization of "strings" to valid plain-text using a custom syntax. You say a lot of things about what JSON is supposed to be/has been designed for. It would be nice to substantiate your claims by pointing at relevant standards. If JSON as in RFC 4627 really wanted to transmit sequences of bytes I think it would have been *much more* explicit. The introduction of both RFC 4627 (remember, written by the *inventor* of JSON) and RFC 7159 (that obsoletes 4627) say "A string is a sequence of zero or more Unicode characters" as we already mentioned an both agree on this is very imprecise. There are two interpretations: * This is a sequence of Unicode scalar values, i.e. text (mine) * This is a sequence of Unicode code points, i.e. a JavaScript string (yours) Now given this imprecision the fact is that you cannot ignore that some stupid people that are very wrong like me will take the first interpretation. Since this interpretation is less liberal you will have to cope with it and acknowledge the fact that lone escaped surrogates may not be interpreted correctly in the wild. This leads to the clarification and the interoperability warnings of section 8.2 in RFC 7159. If you read carefully these two paragraphs you may infer that their "Unicode character" is more likely to be "Unicode scalar value". These paragraphs were not present in RFC 4267 so the latter was really ambiguous, I would however say RFC 7159 is not, if you don't agree with that we are still left with the above two possible interpretations and if you care about interoperability you should know which interpretation to take. Best, Daniel From verdy_p at wanadoo.fr Sat May 9 07:51:18 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 14:51:18 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch> References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch> Message-ID: 2015-05-09 14:16 GMT+02:00 Daniel B?nzli : > Le samedi, 9 mai 2015 ? 06:24, Philippe Verdy a ?crit : > > You are not stuck! You can still regenerate a valid JSON output encoded > in UTF-8: it will once again use escape sequences (which are also needed if > your text contains quotation marks used to delimit the JSON strings in its > syntax. > > That's a possible resolution, but a very bad one: I can then no longer in > my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD". > This exactly leads to the interoperability problems mentioned in section > 8.2 of RFC 7159. > > You say passing escapes to the programmer is needed if your text contains > quotation marks, this is nonsense. A good and sane JSON codec will never > let the programmer deal with escapes directly, it is its responsability to > allow the programmer to only deal with the JSON *data* not the details of > the encoding of the data. Yes, this is part of the codec, the data itself is not modified and does not have to handle the syntax (for quotation marks or escapes). > As such it will automatically unescape on decoding to give you the data > represented by the encoding and automatically escape (if needed) the data > you give it on encoding. > > > Unlike UTF-8, JSON has never been designed to restrict its strings to > have its represented values to be only plain-text, it is a only a > serialization of "strings" to valid plain-text using a custom syntax. > You say a lot of things about what JSON is supposed to be/has been > designed for. It would be nice to substantiate your claims by pointing at > relevant standards. If JSON as in RFC 4627 really wanted to transmit > sequences of bytes I think it would have been *much more* explicit. > No instead it speaks (incorrectly) about code points and mixes the concept with code units. Code units are just code units nothing else, they are not "characters", and certainly not in the meaning of "Unicode abstract characters" and not even "code points" or "scalar values" (and I did not speak about sequences of "bytes", which is the result of the UTF-8 encoding if this is the charset used for the transport of the plain-text JSON syntax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Sat May 9 08:04:10 2015 From: costello at mitre.org (Costello, Roger L.) Date: Sat, 9 May 2015 13:04:10 +0000 Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> Message-ID: Hi Folks, Just want you to know, this discussion is EXCELLENT. I am learning a lot. Thank you! /Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 9 08:07:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 15:07:12 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: <406345450D52417C9DEE234A6C0662A2@erratique.ch> <821476CFD30C4A6C95CA6319394C723C@erratique.ch> <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch> Message-ID: 2015-05-09 14:51 GMT+02:00 Philippe Verdy : > You say a lot of things about what JSON is supposed to be/has been >> designed for. It would be nice to substantiate your claims by pointing at >> relevant standards. If JSON as in RFC 4627 really wanted to transmit >> sequences of bytes I think it would have been *much more* explicit. >> > > No instead it speaks (incorrectly) about code points and mixes the concept > with code units. > In fact it mixes/confuses three separate concepts, i.e. three layers distinct (that the Unicode standard distinguishes clearly): -1. the internal dataset (values of "strings" as expected by programmers and transmitted via the CODEC of the JSON parser/encoder), using code units in a fixed size (16-bit) -2. the plain-text syntax of JSON (which is independant of the actual character encoding but can be formalized as a stream of Unicode code points -3. the serialization of this plain-text in a stream of bytes (using some UTF encoding scheme, or other legacy 8-bit charsets). The initial implementation of JSON, in Javascript, still used today, just performs the adaptation of the internal dataset (16-bit streams) to plain-text (layers 1. and 2. above). Then Javascript itself specifies no seialization of its source: this is part of the MIME standard for the transport (using MIME "charset" attribute to the media type) when using protocols like HTTP or HTTPS, or some external metadata, or a static definition which is system-dependant (for example in local file systems if they do not store the metadata as a file attribute, a case for which the "BOM" or similar signatures was created or for which there is specific syntax in some languages like XML or HTML for specifying the charset at the beginning of the file, or by using some "charset guesser"). Here also Javascript programmers do not have to worry about the layers 2. and 3. above, they just have to handle 16-bit streams (same remark in PHP, Java or many programming languages): they work at the layer 1 where there's a single encoding, a single size of code unit for everything, and no restriction of values on code units. Same thing when working with the DOM API in XML, HTTP, XVG... -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 9 08:11:51 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 15:11:51 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150509105957.66267e13@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> Message-ID: 2015-05-09 11:59 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > No, D82 merely requires that each 16-bit value be a valid UTF-16 code > unit. Unicode strings, and Unicode 16-bit strings in particular, need > not be well-formed. For x = 8, 16, 32, a 'UTF-x string', equivalently a > 'valid UTF-x string', is one that is well-formed in UTF-x. > > > I was right, You and Richard were wrong. > > I stand by my explanation. I wrote it with TUS open at the definitions > by my side. > Except that you are explaining something else. You are speaking about "Unicode strings" which are bound to a given UTF, I was speaking ONLY about "16-bit strings" which were NOT bound to Unicode (and did not have to). So TUS is compeltely not relevant here I have NOT written "Unicode 16-bit strings", only "16-bit strings" and I clearly opposed the two DISTINCT concepts in the SAME sentence so that no confusion was possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 9 09:26:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 May 2015 15:26:34 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> Message-ID: <20150509152634.47f815f0@JRWUBU2> On Sat, 9 May 2015 15:11:51 +0200 Philippe Verdy wrote: > Except that you are explaining something else. You are speaking about > "Unicode strings" which are bound to a given UTF, I was speaking ONLY > about "16-bit strings" which were NOT bound to Unicode (and did not > have to). So TUS is compeltely not relevant here I have NOT written > "Unicode 16-bit strings", only "16-bit strings" and I clearly opposed > the two DISTINCT concepts in the SAME sentence so that no confusion > was possible. The long sentence of yours I am responding to is: "And here you're wrong: a 16-bit string is just a sequence of arbitrary 16-bit code units, but an Unicode string (whatever the size of its code units) adds restrictions for validity (the only restriction being in fact that surrogates (when present in 16-bit strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are forbidden." The point I made is that every string of 16-bit values is (valid as) a Unicode string. Do you accept that? If not, please exhibit a counter-example. In particular, I claim that all 6 permutations of are Unicode strings, but that only two, namely and <0054, D800, DCC1>, are UTF-16 strings. Richard. From verdy_p at wanadoo.fr Sat May 9 09:54:30 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 16:54:30 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150509152634.47f815f0@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> Message-ID: 2015-05-09 16:26 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > In particular, I claim that all 6 permutations of > are Unicode strings, but that only two, namely and > <0054, D800, DCC1>, are UTF-16 strings. > Again you use "Unicode strings" for your 6 permutations, but in your example they have nothing that make them "Unicode strings", given you allow arbitrary code units in arbitrary order, including unpaired ones. The 6 permutations are just "16-bit strings" (addding "Unicode" for these 6 permutations gives absolutely no value if you keep your definition, but visibly it cannot fit with the term used in the RFC trying to normalize JSON, with similar confusions !). TUS does not define what is a "Unicode string" like you do here. TUS just defines "Unicode 16-bit strings" with a direct reference to UTF-16 (which implies conformance and only accepts the later two strings, that TUS names "Unicode 16-bit strings", not "UTF-16 strings"...) TUS goes further by then distinguishing its encoding schemes (taking into account their serialization ti 8-bit streams, and also considering the byte order, for defining the 3 supported UTF-16 encoding schemes: with or without BOM): then an "UTF-16 string" become "UTF-16 encoded text" (or UTF-16BE or UTF16-LE). Note also that I used the term "stream" instead of "string" only to avoid restricting the length (but JSON does not support encoding streams of arbitrary lengths, all of them must have a start, an end, and a defined bounded length (while streams don't necessarily have any defined length property, independantly of the way we measure length: either in bytes, code units, code points, combining sequences, grapheme clusters...). -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 9 10:51:21 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 May 2015 16:51:21 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> Message-ID: <20150509165121.502d9906@JRWUBU2> On Sat, 9 May 2015 16:54:30 +0200 Philippe Verdy wrote: > 2015-05-09 16:26 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > In particular, I claim that all 6 permutations of > > are Unicode strings, but that only two, namely > > and <0054, D800, DCC1>, are UTF-16 strings. > > > > Again you use "Unicode strings" for your 6 permutations, but in your > example they have nothing that make them "Unicode strings", given you > allow arbitrary code units in arbitrary order, including unpaired > ones. The 6 permutations are just "16-bit strings" (addding "Unicode" > for these 6 permutations gives absolutely no value if you keep your > definition, but visibly it cannot fit with the term used in the RFC > trying to normalize JSON, with similar confusions !). > TUS does not define what is a "Unicode string" like you do here. D80 _Unicode string:_ A code unit sequence containing code units of a particular Unicode encoding form RW: Note that by this definition, a permutation of a Unicode string is a Unicode string. D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16 code units. D85 _Well-formed:_ A Unicode code unit sequence that purports to be in a Unicode encoding form is called well-formed if and only if it _does_ follow the specification of that Unicode encoding form D89 _In a Unicode encoding form:_ A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. ? A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be _in UTF-8_. Such a Unicode string is referred to as a _valid UTF-8 string_, or a _UTF-8 string_ for short. ? A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be _in UTF-16_. Such a Unicode string is referred to as a _valid UTF-16 string_, or a _UTF-16 string_ for short. ? A Unicode string consisting of a well-formed UTF-32 code unit sequence is said to be _in UTF-32_. Such a Unicode string is referred to as a _valid UTF-32 string_, or a _UTF-32 string_ for short. > TUS just defines "Unicode 16-bit strings" with a direct reference to > UTF-16 (which implies conformance and only accepts the later two > strings, that TUS names "Unicode 16-bit strings", not "UTF-16 > strings"...) Look at D82 again. It refers to UTF-16 code units and does not otherwise reference UTF-16. If you still do not believe me, consider D89. Can you think of an example of a Unicode string consisting of UTF-8 code units, UTF-16 code units or UTF-32 code units that is not a UTF-8 string, not a UTF-16 and is not a UTF-32 string? If you can't, the use of "well-formed" is curiously redundant in D89. Richard. From unicode at lindenbergsoftware.com Sat May 9 01:26:56 2015 From: unicode at lindenbergsoftware.com (Norbert Lindenberg) Date: Fri, 8 May 2015 23:26:56 -0700 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: References: Message-ID: <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com> RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode code points in the Basic Multilingual Plane, but also a 12-character sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800 ? YYYY < 0xDC00 ? ZZZZ ? 0xDFFF) for supplementary Unicode code points. A tool checking for escape sequences that don?t correspond to any Unicode character must be aware of this, because neither \uYYYY nor \uZZZZ by itself would correspond to any Unicode character, but their combination may well do so. Norbert [1] https://tools.ietf.org/html/rfc7158#section-7 > On May 7, 2015, at 5:46 , Costello, Roger L. wrote: > > Hi Folks, > > The JSON specification says that a character may be escaped using this notation: \uXXXX (XXXX are four hex digits) > > However, not every four hex digits corresponds to a Unicode character. > > Are there tools to scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? > > /Roger > From verdy_p at wanadoo.fr Sat May 9 13:44:32 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 9 May 2015 20:44:32 +0200 Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character? In-Reply-To: <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com> References: <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com> Message-ID: If is not necessary, in fact that same section is also repeating that any "code point" from U+0000 to U+FFFF is representable with the escape sequence, without restriction ! This just confirms that JSON does not really encode Unicode strings but just streams of arbitrary 16-bit code-units (and then possibly reencoded into an internal encoding scheme used by JSON parsers, that internal encoding being bound to the programming environment and its internal binary API or exposed variables or properties). The fact that it is also bond to the plain-text encoding is just because the plain-text characters used in its syntax that not encoded with those escape sequences, and that are not assigned a special role for delimiting string literals, will be decoded from the input syntax and then reencoded into their equivalent in the internal encoding (in the parser, or exposed by the parser in its returned variables or properties): - if the transport format is UTF-8, the syntaxic file will be read using an UTF-8 scanner returning code points or small strings containing the full sequence representing a single code point (over MIME-compatible transports this uses the charset settings of this transport). These codepoints are then converted to one or two 16-bit code units. Then the JSON syntax is recognized by its parser, which will recognize string delimiters, and then also the escape sequences which will be parsed and also converted to 16-bit code units. Then this internal stream of 16-bit code units will be exposed to the output using the encoding expected by the JSON client or programming environement. In summary, the refernece to Unicode in the RFCs for JSON is not really necesssary, all it needs to say is that the JSON parsers must be able to accept a file containing any plain-text valid in its transport encoding scheme, and that it will be able to decode from it the stream of 16bit code units and generate a valid output in the encoding expected by the client (when the client is Javascript or Java, the internal encoding will be the same as the exposed encoding ; this won't be true in Lua, or PHP or many C/C++ programs that often prefer using 8-bit strings; Some languages are hybrids and support two kinds of strings: 8-bit strings and 16-bit strings, rarely 32-bit strings) 2015-05-09 8:26 GMT+02:00 Norbert Lindenberg : > RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode > code points in the Basic Multilingual Plane, but also a 12-character > sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800 > ? YYYY < 0xDC00 ? ZZZZ ? 0xDFFF) for supplementary Unicode code points. A > tool checking for escape sequences that don?t correspond to any Unicode > character must be aware of this, because neither \uYYYY nor \uZZZZ by > itself would correspond to any Unicode character, but their combination may > well do so. > > Norbert > > [1] https://tools.ietf.org/html/rfc7158#section-7 > > > > On May 7, 2015, at 5:46 , Costello, Roger L. wrote: > > > > Hi Folks, > > > > The JSON specification says that a character may be escaped using this > notation: \uXXXX (XXXX are four hex digits) > > > > However, not every four hex digits corresponds to a Unicode character. > > > > Are there tools to scan a JSON document to detect the presence of > \uXXXX, where XXXX does not correspond to any Unicode character? > > > > /Roger > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun May 10 00:42:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 10 May 2015 07:42:14 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150509165121.502d9906@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> <20150509165121.502d9906@JRWUBU2> Message-ID: OK, but D80 and D82 have no purpose, except adding the term "Unicode" redundantly to these expressions. - D80 defines "Unicode string" but in fact it just defines a generic "string" as an arbitrary stream of fixed-size code units. This is the basic definition applicable to all languages I've seen (even if they add additional properties or methods in OOP). It is the same as a C/C++ string (if we ignore the additonal convention of using null as a terminator, soething that is not required in the language, but only a convention of its oldest standard libraries; newer libraries encode length separately) - D82 defines "Unicode 16-bit string" but in fact it just defines a generic "16-bit string" as an arbitrary stream of 16-bit code units. This is basically the same as Javascript and Java strings (where they are objects not requiring the null-byte termination but storing the length as an internal property). These two rules are not productive at all, except for saying that all values of fixed size code units are acceptable (including for example 0xFF in 8-bit strings, which is invalid in UTF-8) Curiously D80 and D82 just restrict themselves to bounded strings (with a defined length), instead of streams (with undetermined length, no start index, no absolute position, no terminator, but just a special distinct value returned for EOF or a method to query the current termination state of the stream, which may be time-dependant). However I wonder what would be the effect of D80 in UTF-32: is <0xFFFFFFFF> a valid "32-bit string" ? After all it is also containing a single 32-bit code unit (for at least one Unicode encoding form), even if it has no "scalar value" and then does not have to validate D89 (for UTF-32)... If there are confusions in other documents, it's now probably because of the completely unproductive D80 and D82 definitions of specific terms (which are probably not definitions of terms but just fixing the needed local context in order to define D89). the two rules D80 and D82 have absolutely no use in TUS outside D89. So D80 and D82 are probaly excessive definitions, D89 would be enough (TUS shoukd not have to dictate other lower-level behavior to programming environments or protocols) 2015-05-09 17:51 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sat, 9 May 2015 16:54:30 +0200 > Philippe Verdy wrote: > > > 2015-05-09 16:26 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > In particular, I claim that all 6 permutations of > > > are Unicode strings, but that only two, namely > > > and <0054, D800, DCC1>, are UTF-16 strings. > > > > > > > Again you use "Unicode strings" for your 6 permutations, but in your > > example they have nothing that make them "Unicode strings", given you > > allow arbitrary code units in arbitrary order, including unpaired > > ones. The 6 permutations are just "16-bit strings" (addding "Unicode" > > for these 6 permutations gives absolutely no value if you keep your > > definition, but visibly it cannot fit with the term used in the RFC > > trying to normalize JSON, with similar confusions !). > > > TUS does not define what is a "Unicode string" like you do here. > > D80 _Unicode string:_ A code unit sequence containing code units of > a particular Unicode encoding form > > RW: Note that by this definition, a permutation of a Unicode string is > a Unicode string. > > D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16 > code units. > > D85 _Well-formed:_ A Unicode code unit sequence that purports to be > in a Unicode encoding form is called well-formed if and only if it > _does_ follow the specification of that Unicode encoding form > > D89 _In a Unicode encoding form:_ A Unicode string is said to be in > a particular Unicode encoding form if and only if it consists of a > well-formed Unicode code unit sequence of that Unicode encoding form. > ? A Unicode string consisting of a well-formed UTF-8 code unit > sequence is said to be _in UTF-8_. Such a Unicode string is referred to > as a _valid UTF-8 string_, or a _UTF-8 string_ for short. > ? A Unicode string consisting of a well-formed UTF-16 code unit > sequence is said to be _in UTF-16_. Such a Unicode string is referred to > as a _valid UTF-16 string_, or a _UTF-16 string_ for short. > ? A Unicode string consisting of a well-formed UTF-32 code unit > sequence is said to be _in UTF-32_. Such a Unicode string is referred to > as a _valid UTF-32 string_, or a _UTF-32 string_ for short. > > > TUS just defines "Unicode 16-bit strings" with a direct reference to > > UTF-16 (which implies conformance and only accepts the later two > > strings, that TUS names "Unicode 16-bit strings", not "UTF-16 > > strings"...) > > Look at D82 again. It refers to UTF-16 code units and does not > otherwise reference UTF-16. > > If you still do not believe me, consider D89. Can you think of an > example of a Unicode string consisting of UTF-8 code units, UTF-16 > code units or UTF-32 code units that is not a UTF-8 string, not a > UTF-16 and is not a UTF-32 string? If you can't, the use of > "well-formed" is curiously redundant in D89. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun May 10 05:23:41 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 10 May 2015 11:23:41 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> <20150509165121.502d9906@JRWUBU2> Message-ID: <20150510112341.4ea1ea4e@JRWUBU2> On Sun, 10 May 2015 07:42:14 +0200 Philippe Verdy wrote: I as replying out of order for greater coherence of my reply. > However I wonder what would be the effect of D80 in UTF-32: is > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also > containing a single 32-bit code unit (for at least one Unicode > encoding form), even if it has no "scalar value" and then does not > have to validate D89 (for UTF-32)... The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it cannot represent a unit of encoded text in a UTF-32 string. By D77 paragraph 1, "Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange", it is therefore not a code unit. The effect of D77, D80 and D83 is that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string. > - D80 defines "Unicode string" but in fact it just defines a generic > "string" as an arbitrary stream of fixed-size code units. No - see argument above. > These two rules [D80 and D82 - RW] are not productive at all, except > for saying that all values of fixed size code units are acceptable > (including for example 0xFF in 8-bit strings, which is invalid in > UTF-8) Do you still maintain this reading of D77? D77 is not as clear as it should be. > D80 and D82 have no purpose, except adding the term "Unicode" > redundantly to these expressions. I have the cynical suspicion that these definitions were added to preserve the interface definitions of routines processing UCS-2 strings when the transition to UTF-16 occurred. They can also have the (intentional?) side-effect of making more work for UTF-8 and UTF-32 processing, because arbitrary 8-bit strings and 32-bit strings are not Unicode strings. Richard. From haberg-1 at telia.com Sun May 10 13:35:41 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Sun, 10 May 2015 20:35:41 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150510112341.4ea1ea4e@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> <20150509165121.502d9906@JRWUBU2> <20150510112341.4ea1ea4e@JRWUBU2> Message-ID: <881564BF-0C35-4947-8CAD-04CFAEB0AC6C@telia.com> > On 10 May 2015, at 12:23, Richard Wordingham wrote: >> However I wonder what would be the effect of D80 in UTF-32: is >> <0xFFFFFFFF> a valid "32-bit string" ? > > The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > cannot represent a unit of encoded text in a UTF-32 string. Even though the values with highest bit set are not a part of original UTF-32, it can easily be extended also to original UTF-8, which may be simpler to implement. From verdy_p at wanadoo.fr Sun May 10 14:19:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 10 May 2015 21:19:52 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150510112341.4ea1ea4e@JRWUBU2> References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> <20150509165121.502d9906@JRWUBU2> <20150510112341.4ea1ea4e@JRWUBU2> Message-ID: The wy I read D77 (code unit) it is not bound to any Unicode encoding form; "The minimal bit combination that can represent a unit of encoded text for processing or interchange" can beany bit length and can even use non binary repreentation (not bit-based; it could be ternary; or floatting point, or base ten with the remaining bit patterns posibly used for other functions (such as clock synchronization!calibration, polarization balancing; lieving only some patterns distinctable but not necessarily an exact power of two...) I don't see why a 32-bit code unit or 8-bit code unit has to be bound to UTF-32 or UTF-8 in D77; the code unit is just a code unit; it does not have to be assigned any Unicode scalar value or exist in a specific pattern valid for UTF-32 or UTF-8 (in addition these two UTF's are not the only two ones supported; look as SCSU for example; or GB18030 which are also conforming UTF's): The code unit is just one element within an enumerable and finite set of elements that is transmissible to some interface and interchangeable. It's up to each UTF to define how they can use them: these UTF's are usable on these stes provided that these sets are large nuitto contain at least a the number of code units required for this UTF to be supported (which means that the actual bitcount of the transported code units does not matter; this is out of scope of TUS which jsut requires sets with sufficient cardinality): For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF cannot be a valid 32-bit code unit and then why <0xFFFFFFFF> cant be a valid 32-bit string (or "Unicode 32-bit string> liek TUS renames it in D80-D83 in a way that is really unproductive (and in fact confusive). As well nothing prohibits supportng the UTF-32 encoding form over a 21-bit stream, using another "encding scheme" (which cannt be named also UTF-32 or UT-32BE or UTF-32LE" but could be named 'UTF-32-21": the result witll be a 21-bit strng; but still the 21(bit code unit 0x1FFFFF will still be valid. 2015-05-10 12:23 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 10 May 2015 07:42:14 +0200 > Philippe Verdy wrote: > > I as replying out of order for greater coherence of my reply. > > > However I wonder what would be the effect of D80 in UTF-32: is > > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also > > containing a single 32-bit code unit (for at least one Unicode > > encoding form), even if it has no "scalar value" and then does not > > have to validate D89 (for UTF-32)... > > The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > cannot represent a unit of encoded text in a UTF-32 string. By D77 > paragraph 1, "Code unit: The minimal bit combination that can > represent a unit of encoded text for processing or interchange", it is > therefore not a code unit. The effect of D77, D80 and D83 is that > <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string. > > > - D80 defines "Unicode string" but in fact it just defines a generic > > "string" as an arbitrary stream of fixed-size code units. > > No - see argument above. > > > These two rules [D80 and D82 - RW] are not productive at all, except > > for saying that all values of fixed size code units are acceptable > > (including for example 0xFF in 8-bit strings, which is invalid in > > UTF-8) > > Do you still maintain this reading of D77? D77 is not as clear as it > should be. > > > D80 and D82 have no purpose, except adding the term "Unicode" > > redundantly to these expressions. > > I have the cynical suspicion that these definitions were added to > preserve the interface definitions of routines processing UCS-2 > strings when the transition to UTF-16 occurred. They can also have the > (intentional?) side-effect of making more work for UTF-8 and UTF-32 > processing, because arbitrary 8-bit strings and 32-bit strings are not > Unicode strings. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun May 10 15:44:29 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 10 May 2015 21:44:29 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net> <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch> <20150509041352.60c24989@JRWUBU2> <20150509105957.66267e13@JRWUBU2> <20150509152634.47f815f0@JRWUBU2> <20150509165121.502d9906@JRWUBU2> <20150510112341.4ea1ea4e@JRWUBU2> Message-ID: <20150510214429.5d1ad31f@JRWUBU2> On Sun, 10 May 2015 21:19:52 +0200 Philippe Verdy wrote: > The wy I read D77 (code unit) it is not bound to any Unicode encoding > form; Agreed. > "The minimal bit combination that can represent a unit of > encoded text for processing or interchange" can beany bit length and > can even use non binary repreentation (not bit-based; it could be > ternary; or floatting point, or base ten with the remaining bit > patterns posibly used for other functions (such as clock > synchronization!calibration, polarization balancing; lieving only > some patterns distinctable but not necessarily an exact power of > two...) I don't object to that reading, but I'm not sure it's correct. > I don't see why a 32-bit code unit or 8-bit code unit has to > be bound to UTF-32 or UTF-8 in D77; the code unit is just a code > unit; it does not have to be assigned any Unicode scalar value or > exist in a specific pattern valid for UTF-32 or UTF-8 (in addition > these two UTF's are not the only two ones supported; look as SCSU for > example; or GB18030 which are also conforming UTF's): D77 is definitely not bound to Unicode encoding forms - it gives Shift-JIS as an example of an encoding that has code units. > The code unit is just one element within an enumerable and finite set > of elements that is transmissible to some interface and > interchangeable. > > It's up to each UTF to define how they can use them: these UTF's are > usable on these stes provided that these sets are large nuitto > contain at least a the number of code units required for this UTF to > be supported (which means that the actual bitcount of the transported > code units does not matter; this is out of scope of TUS which jsut > requires sets with sufficient cardinality): The critical matter is the number of array elements needed for each scalar value and the pattern of which elements of the scalar values have the 'same' values. > For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF > cannot be a valid 32-bit code unit Fair point so far. I agree it can be a 32-bit code unit in some character encoding. However, it is not a UTF-32 code unit. > and then why <0xFFFFFFFF> cant be a > valid 32-bit string I agree that it is a 32-bit string. I don't know what you mean by the word 'valid' in this context. > (or "Unicode 32-bit string> liek TUS renames it in > D80-D83 in a way that is really unproductive (and in fact confusive). I hope you now see that it cannot be Unicode 32-bit string, for 0xFFFFFFFF is not a UTF-32 code unit. This is a key point in the difference between: a) x-bit string, b) Unicode x-bit string, and c) UTF-x string For x=8, these are three different things. For x=16 or x=32, these are two different things, but they do not split the same way. D80-D83 do not directly rename 8-bit strings, 16-bit strings or 32-bit strings as Unicode 8-bit strings, Unicode 16-bit strings or Unicode 32-bit strings. That all 16-bit strings are Unicode 16-bit strings is a consequence of the definition of UTF-16. Similarly, not all 8-bit strings being Unicode 8-bit strings and not all 32-bit strings are consequences of the definitions of UTF-8 and UTF-32 respectively. I agree that the concept of Unicode 8-bit strings is not useful. The separate concept of Unicode 32-bit strings is also not useful, for I contend that all Unicode 32-bit strings are in fact UTF-32 strings. The latter result is an immediate consequence of UTF-32 not being a multi-code unit encoding. > As well nothing prohibits supportng the UTF-32 encoding form over a > 21-bit stream, using another "encding scheme" (which cannt be named > also UTF-32 or UT-32BE or UTF-32LE" but could be named 'UTF-32-21": > the result witll be a 21-bit strng; but still the 21(bit code unit > 0x1FFFFF will still be valid. > > 2015-05-10 12:23 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Sun, 10 May 2015 07:42:14 +0200 > > Philippe Verdy wrote: > > > > I as replying out of order for greater coherence of my reply. > > > > > However I wonder what would be the effect of D80 in UTF-32: is > > > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also > > > containing a single 32-bit code unit (for at least one Unicode > > > encoding form), even if it has no "scalar value" and then does not > > > have to validate D89 (for UTF-32)... > > > > The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > > cannot represent a unit of encoded text in a UTF-32 string. By D77 > > paragraph 1, "Code unit: The minimal bit combination that can > > represent a unit of encoded text for processing or interchange", it > > is therefore not a code unit. Correction: "is therefore not a UTF-32 code unit." > > The effect of D77, D80 and D83 is > > that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit > > string. > > > > > - D80 defines "Unicode string" but in fact it just defines a > > > generic "string" as an arbitrary stream of fixed-size code units. > > > > No - see argument above. > > > > > These two rules [D80 and D82 - RW] are not productive at all, > > > except for saying that all values of fixed size code units are > > > acceptable (including for example 0xFF in 8-bit strings, which is > > > invalid in UTF-8) I ask again: Do you still maintain this reading of D77? D77 is not as clear as it should be. Richard. From ishida at w3.org Mon May 11 03:25:38 2015 From: ishida at w3.org (Richard Ishida) Date: Mon, 11 May 2015 09:25:38 +0100 Subject: Notes on Mongolian variant forms Message-ID: <55506782.2090001@w3.org> fyi, i have been developing a page Notes on Mongolian variant forms http://r12a.github.io/scripts/mongolian/variants the page compares variant glyph shapes proposed in three documents, and shows what shapes fonts actually produce. i have been documenting changes at http://lists.w3.org/Archives/Public/public-i18n-mongolian/ - if you want to discuss the page, you are free to join and contribute to that list. introduction to the page: ====================================== There is some confusion about which shapes should be produced by fonts for Mongolian characters. Most letters have at least one isolated, initial, medial and final shape, but other shapes are produced by contextual factors, such as vowel harmony. Unicode has a list of standardised variant shapes, dating from 27 November 2013, but that list is not complete and contains what are currently viewed by some as errors. The original list of standardised variants was based on ????? by Professor Quejingzhabu in 2000. A new proposal was published on 20 January 2014, which attempts to resolve the current issues. The other factor in this is what the actual fonts do. Sometimes they follow the Unicode standardised variants list, other times they diverge from it. Occasionally a majority of implementations appear to diverge in the same way, suggesting that the standardised list should be adapted to reality. In this document I map the changes between the various proposals, and compare to various font implementations. From petercon at microsoft.com Mon May 11 11:45:09 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 11 May 2015 16:45:09 +0000 Subject: Script / font support in Windows 10 In-Reply-To: References: Message-ID: When the update with Windows 10 info was posted, earlier sections for Windows 2000 / XP / XP SP2 were inadvertently deleted. Those have been restored. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Friday, May 8, 2015 7:16 AM To: unicode at unicode.org Subject: RE: Script / font support in Windows 10 I think this is the right public link: https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx From: Peter Constable Sent: Thursday, May 7, 2015 10:29 PM To: Peter Constable; unicode at unicode.org Subject: RE: Script / font support in Windows 10 Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Thursday, May 7, 2015 10:15 PM To: unicode at unicode.org Subject: Script / font support in Windows 10 This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon May 11 12:44:19 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 11 May 2015 10:44:19 -0700 Subject: Surrogates and noncharacters Message-ID: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> Hans Aberg wrote: >>> However I wonder what would be the effect of D80 in UTF-32: is >>> <0xFFFFFFFF> a valid "32-bit string" ? >> >> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it >> cannot represent a unit of encoded text in a UTF-32 string. > > Even though the values with highest bit set are not a part of original > UTF-32, it can easily be extended also to original UTF-8, which may be > simpler to implement. "Original UTF-8," regardless of where defined, only ever encoded scalar values up to 0x7FFFFFFF. See, for example, RFC 2279. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From haberg-1 at telia.com Mon May 11 13:05:23 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 11 May 2015 20:05:23 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> Message-ID: <3EE00C84-398E-4A21-B18E-A27D8CB49F21@telia.com> > On 11 May 2015, at 19:44, Doug Ewell wrote: > > Hans Aberg wrote: > >>>> However I wonder what would be the effect of D80 in UTF-32: is >>>> <0xFFFFFFFF> a valid "32-bit string" ? >>> >>> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it >>> cannot represent a unit of encoded text in a UTF-32 string. >> >> Even though the values with highest bit set are not a part of original >> UTF-32, it can easily be extended also to original UTF-8, which may be >> simpler to implement. > > "Original UTF-8," regardless of where defined, only ever encoded scalar > values up to 0x7FFFFFFF. See, for example, RFC 2279. The intended meaning is that also original UTF-8 can be extended to full 32-bit by using 6-byte sequences leading byte 111111xx bit pattern. From verdy_p at wanadoo.fr Mon May 11 14:25:29 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 11 May 2015 21:25:29 +0200 Subject: Surrogates and noncharacters In-Reply-To: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> Message-ID: Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code unit in "32-bit strings", even if it is not a valid code point with a valid scaar value in any legacy or standard version of UTF-32. The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned differences in 32-bit integers (if ever they were in fact converted to larger integers such as 64-bit to exhibit differences in APIs returning individual code units). It's true that in 32-bit integers (signed or unsigned) you cannot differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ standard libraries for representing the EOF condition returned by functions or macros like getchar(). But EOF conditions do not require to be differentiated when you are scanning positions in a buffer of 32-bit integers (instead you compare the relative index in the buffer with the buffer length, or the buffer object includes a separate method to test this condition). But today, where programming environment are going to 64-bit by default, the APIs that return an integer when reading individual code positions will return them as 64-bit integers, even if the inner storage uses 32-bit code units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used for EOF. This was not still true when the legacy UTF-32 encoding was created, where a majority of environments were still only running 32-bit or 16-bit code; for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be assigned to a non-character to limit problems of confusions with the EOF condition in C/C++ or similar APIs in other languages (when they cannot throw an exception instead of a distinct EOF value). Well, there are stil la lot of devices running 32-bit code (notably in guest VMs, and in small devices) and written in C/C++ with the old standard C library, but without OOP features (such as exceptions, or methods for buffering objects). In Java, the "int" datatype (which is 32-bit and signed) has not been extended to 64-bit, even on platforms where 64-bit integers are the internal datatype used by the JVM in its natively compiled binary code. Once again, "code units" and "x-bit strings" are not bound to any Unicode or ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs or legacy (obsoleted) UTF's. And I still don't see any productive need for "Unicode x-bit strings" in TUS D80-D83, when all that is needed for the conformance is NOT the whole range of valid code units, but only the allowed range of scalar values (which there's only the need for code units to be defined in a large enough set of distinct values: The exact cardinality of this set does not matter, and there can always exist additional valid "code units" not bound to any valid "scalar value" or to a minimal set of distinct "Unicode code units" needed to support the standard Unicode encoding forms). Even the Unicode scalar values or the implied values of "Unicode code units" to not have to be aligned with the effective native values of "code units" used in the lower level... except for the standard encoding schemes for 8-bit interchanges, where byte order matters... but still not the lower level bit order and the native hardware representation of invidually addressable bytes which may be sometimes larger than 8-bit, with some other control bits or framing bits, and sometimes even with variable bit sizes depending on their relative position in transport frames ! 2015-05-11 19:44 GMT+02:00 Doug Ewell : > Hans Aberg wrote: > > >>> However I wonder what would be the effect of D80 in UTF-32: is > >>> <0xFFFFFFFF> a valid "32-bit string" ? > >> > >> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > >> cannot represent a unit of encoded text in a UTF-32 string. > > > > Even though the values with highest bit set are not a part of original > > UTF-32, it can easily be extended also to original UTF-8, which may be > > simpler to implement. > > "Original UTF-8," regardless of where defined, only ever encoded scalar > values up to 0x7FFFFFFF. See, for example, RFC 2279. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 11 15:43:21 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 11 May 2015 21:43:21 +0100 Subject: Surrogates and noncharacters In-Reply-To: References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> Message-ID: <20150511214321.55a94551@JRWUBU2> On Mon, 11 May 2015 21:25:29 +0200 Philippe Verdy wrote: > Once again, "code units" and "x-bit strings" are not bound to any > Unicode or ISO/IEC 10646 or legacy RFC contraints related to the > current standard UTFs or legacy (obsoleted) UTF's. Who says they are? I'm just saying that the concepts of Unicode x-bit strings are. Richard. From haberg-1 at telia.com Mon May 11 16:53:02 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 11 May 2015 23:53:02 +0200 Subject: Surrogates and noncharacters In-Reply-To: References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> Message-ID: <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> > On 11 May 2015, at 21:25, Philippe Verdy wrote: > > Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code unit in "32-bit strings", even if it is not a valid code point with a valid scaar value in any legacy or standard version of UTF-32. The reason I did it was to avoid having a check to throw an exception. It merely means that the check for valid Unicode code points, in such a context, must be elsewhere. > The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned differences in 32-bit integers (if ever they were in fact converted to larger integers such as 64-bit to exhibit differences in APIs returning individual code units). Indeed, so I use uint32_t combined with uint32_t, because char can be signed at the will of the C/C++ compiler implementer. > It's true that in 32-bit integers (signed or unsigned) you cannot differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ standard libraries for representing the EOF condition returned by functions or macros like getchar(). But EOF conditions do not require to be differentiated when you are scanning positions in a buffer of 32-bit integers (instead you compare the relative index in the buffer with the buffer length, or the buffer object includes a separate method to test this condition). It is s good point - perhaps that was the reason to not allow highest bit set. But it is not a problem in C++, would it get UTF-32 streams, as they can throw an exception > But today, where programming environment are going to 64-bit by default, the APIs that return an integer when reading individual code positions will return them as 64-bit integers, even if the inner storage uses 32-bit code units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used for EOF. Right, the C/C++ languages specifications say that size_t and friend must be able to hold any size, and similar for differences. So this forces signed and unsigned 64-bit integral types on a 64-bit platform. > This was not still true when the legacy UTF-32 encoding was created, where a majority of environments were still only running 32-bit or 16-bit code; for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be assigned to a non-character to limit problems of confusions with the EOF condition in C/C++ or similar APIs in other languages (when they cannot throw an exception instead of a distinct EOF value). Right, it might be a non-issue today. > Well, there are stil la lot of devices running 32-bit code (notably in guest VMs, and in small devices) and written in C/C++ with the old standard C library, but without OOP features (such as exceptions, or methods for buffering objects). In Java, the "int" datatype (which is 32-bit and signed) has not been extended to 64-bit, even on platforms where 64-bit integers are the internal datatype used by the JVM in its natively compiled binary code. Legacy is a problem. > Once again, "code units" and "x-bit strings" are not bound to any Unicode or ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs or legacy (obsoleted) UTF's. > > And I still don't see any productive need for "Unicode x-bit strings" in TUS D80-D83, when all that is needed for the conformance is NOT the whole range of valid code units, but only the allowed range of scalar values (which there's only the need for code units to be defined in a large enough set of distinct values: > > The exact cardinality of this set does not matter, and there can always exist additional valid "code units" not bound to any valid "scalar value" or to a minimal set of distinct "Unicode code units" needed to support the standard Unicode encoding forms). > > Even the Unicode scalar values or the implied values of "Unicode code units" to not have to be aligned with the effective native values of "code units" used in the lower level... except for the standard encoding schemes for 8-bit interchanges, where byte order matters... but still not the lower level bit order and the native hardware representation of invidually addressable bytes which may be sometimes larger than 8-bit, with some other control bits or framing bits, and sometimes even with variable bit sizes depending on their relative position in transport frames ! It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. One is going check that the code points are valid Unicode values somewhere, so it is hard to see to point of restricting UTF-8 to align it with UTF-16. From verdy_p at wanadoo.fr Tue May 12 08:45:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 12 May 2015 15:45:52 +0200 Subject: Surrogates and noncharacters In-Reply-To: <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> Message-ID: 2015-05-11 23:53 GMT+02:00 Hans Aberg : > It is perfectly fine considering the Unicode code points as abstract > integers, with UTF-32 and UTF-8 encodings that translate them into byte > sequences in a computer. The code points that conflict with UTF-16 might > have been merely declared not in use until UTF-16 has been fallen out of > use, replaced by UTF-8 and UTF-32. The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in MIME) is already very advanced. But they will certinaly not likely disappear as encoding *forms* for internal use in binary APIs and in several very popular programming languages: Java, Javascript, even C++ on Windows platforms (where it is the 8-bit interface, based on legacy "code pages" and with poor support of the UTF-8 encoding scheme as a Windows "code page", is the one that is now being phased out), C#, J#... UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype). In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their "character" or "byte" datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits) One is going check that the code points are valid Unicode values somewhere, > so it is hard to see to point of restricting UTF-8 to align it with UTF-16. > What I meant when starting discussing in this thread was just to obsolete the unnecessary definitions of "x-bit strings" from TUS. The stadnard does not need these definitions and if we want it to be really open to various architectures, languages, protocols, all that is needed is only the definition of "code units" specific to each standard UTF (encoding form or encoding scheme when splitting code units to smaller code units and ordering them, by only determining this order and the minimum set of distinct values that these code units must support: we should not speak about "bits", just about "sets" of distinct elements with a sufficient cardinality). So let's jsut speak about "UTF-8 code units", "UTF-16 code units", "UTF-32 code units" (not just "code units" and not even "Unicode code units", which is also a non-sense given the existence of standardized compression schemes defining also their own "XXX code units"). If the expressions "16-bit code units" has been used, it's purely for internal use as a shortcut for the complete name, and these shortcuts are not part of the external entities to standardize (they are not precise enough and cannot be used safely out of their local context): consider these definitions just as "private" ones (same meaning as in OOP) boxed as internals to the TUS seen as a blackbox. It's not the focus of TUS to discuss what are "strings": it's just the mater of each integration platform that wants to use TUS. In summary, the definitions in TUS should be split in two parts: those that are "public" and needed by external references (in other standards), and those that are private (many of them do not have even to be within the generic section of the standard, they should be listed in the appropriate sections needing them locally, and also clearly separating the "public" and "private" interfaces. In all cases, the public interfaces msut define precise and anambiguous terms, bound to the standard or section of the standard defining them. Even if later within that section a shortcut will be used as a convenience (to make the text easier to read). We need "scopes" for these definitions (and shorter aliases must be made private). -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Tue May 12 08:56:04 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 12 May 2015 15:56:04 +0200 Subject: Surrogates and noncharacters In-Reply-To: References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> Message-ID: <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> > On 12 May 2015, at 15:45, Philippe Verdy wrote: > > > > 2015-05-11 23:53 GMT+02:00 Hans Aberg : >> It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. >> > The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in MIME) is already very advanced. UTF-32 is usable for internal use in programs. > But they will certinaly not likely disappear as encoding *forms* for internal use in binary APIs and in several very popular programming languages: Java, Javascript, even C++ on Windows platforms (where it is the 8-bit interface, based on legacy "code pages" and with poor support of the UTF-8 encoding scheme as a Windows "code page", is the one that is now being phased out), C#, J#? That is legacy, which may remain for long. For example, C/C++ trigraphs are only removed now, since long just a bother for compiler implementation. Java is very old, designed around 32-bit programming with limits on function code size, which was a limitation in pre-PPC CPU that went out of use in the early 1990s. > UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype). > > In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their "character" or "byte" datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits) Indeed, that is why UTF-8 was invented for use in Unix-like environments. From verdy_p at wanadoo.fr Tue May 12 09:50:02 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 12 May 2015 16:50:02 +0200 Subject: Surrogates and noncharacters In-Reply-To: <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> Message-ID: 2015-05-12 15:56 GMT+02:00 Hans Aberg : > > Indeed, that is why UTF-8 was invented for use in Unix-like environments. > Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). UTF-8 is the default choice for all Internet protocols because all these protocols are based on these units This last remark is true except at lower levels, on the link interfaces and on physical links where the unit is the bit or sometimes even smaller units with fractions of bits, grouped into frames that not only transport data bits but also specific items needed by the physical constraints, such as maintaining the mean polarity, restricting the frequency bandwidth, reducing noise in lateral bands, synchronizing clocks for data sampling, reducing the power usage, allowing adaptation of bandwidth by insertion of new parallel streams in the same shared band, allowing changing the framing format in the case where the signal-noise ratio is degraded by using some additional signals normally not used by the normal data stream, or adapting to the degradation of the transport medium, or to some emergency situations (or sometimes to local legal requirements) that require reducing the usage to leave space for priority traffic (e.g. air regulation or military use)... Each time the transport medium has to be shared with third parties (this is the case for infrastructure networks or for the radiofrequencies in the public airspace which may also be shared internationally), or if the medium is known to have a slowly degrading quality (e.g. SSD storage), the transport and storage protocols never use the whole bandwidth available and reserve some regulatory space for specific signalisation that may be needed to allow the current usages to be autoadapted: the physical format of datastreams can change at any time, and what was initially encoded one way will then be encoded another way (such things also occur extremely locally, for example on databuses within computers, for example between the various electronic chips on the same motherboard, or that could be plugged to it as optional extensions ! Electronic devices are full of bus adapters that have to manage the priority between concurrent traffics that are unpredictable, and with changing environment conditions such as the current state of power sources). Programmers however only see the result on the upper layer data frames where they manage bits, then they can create streams of bytes, that are usable for transport protocols and interchange over a larger network or computing system. But for the worldwide network (Internet), everything is based on 8-bit bytes that are the minimal units of information (and also the maximal units: over larger units are not portable, not interoperable over the global network) in all related protocols (including for negociating options in these protocols): UTF-8 is then THE universal encoding that will interoperate everywhere on the Internet, even if locally (in connected hosts), other encoding may be used (which ''may'' be more efficiently processed) after a simple conversion (this does not necessarly requires changing the size of code units used in local protocols and interfaces, for example there could exist some reencoding, or data compression or expansion). -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Tue May 12 10:58:00 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 12 May 2015 17:58:00 +0200 Subject: Surrogates and noncharacters In-Reply-To: References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> Message-ID: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com> > On 12 May 2015, at 16:50, Philippe Verdy wrote: > >> Indeed, that is why UTF-8 was invented for use in Unix-like environments. >> > Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks). There is some history here: https://en.wikipedia.org/wiki/UTF-8#History From verdy_p at wanadoo.fr Tue May 12 11:44:15 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 12 May 2015 18:44:15 +0200 Subject: Surrogates and noncharacters In-Reply-To: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> <26383F58-189A-4167-9530-1CE33EE9536F@telia.com> Message-ID: Even if UTF-8 initially started as part of some Unix standardization process, it was for the prupose of allowing interchanges across systems. The networking concept was already there (otherwise it would not have been part of the emerging *nix standardization processes, and would have remained a proprietary encoding in local systems). At the same time, The Internet was also about to emerge as a worldwide network, but Internet was still very limited and full of restrictions, accessible only from a few (very costly) gateways in other countries, and not even with the IP protocol but with many specific protocols (may be you remember the time of CompuServe, billed only in US dollars and only via international payments and costly bank processing fees; you also had to call an international phone number before a few national phone numbers appeared, cooperated by CompuServe and some national or regional services At that time, the Telcos were not even interested to participate and all wanted to develop their own national or regional networks with their own protocols and "national" standards; real competition in telecommunications only started just before Y2K, with the deregulation in North America and some parts of Europe, in fact just in the EEA, before progressively going worldwide when the initial competitors started to restructure/split/merge and aligning their too many technical standards with the need of a common interoperable one that would worlk in all their new local branches). In fact the worldwide Internet would not have become THE global network without the reorganisation of older dereregulated national telcos and the end of their monopoles. The development of "the" Internet, and the development of the UCS, were then completely made in parallel. Both were appearing to replace former national standards in the same domains previously operated by the former monopoles in telecommunications (and that also needed computing and data standards, not just networking standards). In the early time of Internet, the IP protocol was still not really adapted as the universal internetworking protocol (other competitors were also proposed by private companies, notably Token-Ring by IBM, and the X21-X25 family promoted essentially by European telcos (which prefered realtime protocols with warrantied/reserved bandwidth, and commutation by packets instead of by frames of variable sizes). Even today, there are some remaining parts of the X* network family, but only for short-distance private links: e.g. with ATM (in xDSL technologies), or for local buses within electronic devices (under the 1 meter limit), or within some critical missions (realtime constraints used for networking equipements in aircrafts, that have their own standard, wit ha few of them developped recently as adaptation of Internet technologies over channels in a realtime network, generally not structured in a "mesh" but with a "star" topology and dedicated bandwidths). If you want to look for remaining text encoding standards that are still not based on the UCS, look into aircraft technologies, and military equipements (there's also the GSM family of protocols, which continues to keep many legacy proprietary standards, with poor adaptation to Internet technologies and the UCS...) The situation is starting to change now in aircraft/military technology too (first Airbus in Europe, now also adopted by its major US competitors) and mobile networks (4G), with the full integration of the the IEEE Ethernet standard, that allows a more natural and straightforward integration of IP protocols and the UCS standards with it (even if compatibility is kept by reserving a space for former protocols, something that the IEEE Ethernet standard has already facilitated for the Internet we know now, both in worldwide communications, and in private LANs)... 2015-05-12 17:58 GMT+02:00 Hans Aberg : > > > On 12 May 2015, at 16:50, Philippe Verdy wrote: > > > >> Indeed, that is why UTF-8 was invented for use in Unix-like > environments. > >> > > Not the main reason: communication protocols, and data storage is also > based on 8-bit code units (even if storage group them by much larger > blocks). > > There is some history here: > https://en.wikipedia.org/wiki/UTF-8#History > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdaoden at yandex.com Tue May 12 13:46:26 2015 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Tue, 12 May 2015 20:46:26 +0200 Subject: Surrogates and noncharacters In-Reply-To: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com> References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net> <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com> <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com> <26383F58-189A-4167-9530-1CE33EE9536F@telia.com> Message-ID: <20150512184626.1YIf9x0Co6o=%sdaoden@yandex.com> Hans Aberg wrote: |> On 12 May 2015, at 16:50, Philippe Verdy wrote: |>> Indeed, that is why UTF-8 was invented for use in Unix-like environments. |>> |> Not the main reason: communication protocols, and data storage \ |> is also based on 8-bit code units (even if storage group \ |> them by much larger blocks). | |There is some history here: | https://en.wikipedia.org/wiki/UTF-8#History "What happened was this": http://doc.cat-v.org/bell_labs/utf-8_history --steffen From mark at macchiato.com Tue May 12 16:05:29 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 12 May 2015 14:05:29 -0700 Subject: =?UTF-8?Q?FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_charts?= Message-ID: http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Tue May 12 17:19:57 2015 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 12 May 2015 16:19:57 -0600 Subject: FYI: The =?UTF-8?B?d29ybGTigJlzIGxhbmd1YWdlcywgaW4gNyBtYXBzIA==?= =?UTF-8?B?YW5kIGNoYXJ0cw==?= In-Reply-To: References: Message-ID: <55527C8D.2070406@khwilliamson.com> On 05/12/2015 03:05 PM, Mark Davis ?? wrote: > http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ > ////// And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844 From dzo at bisharat.net Tue May 12 17:47:27 2015 From: dzo at bisharat.net (dzo at bisharat.net) Date: Tue, 12 May 2015 22:47:27 +0000 Subject: =?Windows-1252?B?UmU6IEZZSTogVGhlIHdvcmxkknMgbGFuZ3VhZ2VzLCBpbiA3IG1hcHMgYW5kIGNoYXJ0cw==?= Message-ID: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> And a tangent, picking up on a complaint that Swahili wasn't represented on one of the 7 WaPost graphics: http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html Two other recent posts on this blog ("Beyond Niamey") critique the Africa part of a set of graphics/maps of "Second Most Spoken Languages Worldwide" (on the Olivet Nazarene University site) - another thought-provoking effort that could inform better if redone. Don Osborn ------Original Message------ From: Karl Williamson Sender: Unicode To: Mark Davis ?? To: Unicode Public Subject: Re: FYI: The world?s languages, in 7 maps and charts Sent: May 12, 2015 6:19 PM On 05/12/2015 03:05 PM, Mark Davis ?? wrote: > http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ > ////// And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844 Sent via BlackBerry by AT&T From jonathan.rosenne at gmail.com Wed May 13 04:24:45 2015 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Wed, 13 May 2015 12:24:45 +0300 Subject: =?utf-8?Q?RE:_FYI:_The_world=E2=80=99s_languages=2C_?= =?utf-8?Q?in_7_maps_and_charts?= In-Reply-To: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> Message-ID: <000401d08d5e$a811de90$f8359bb0$@gmail.com> I have two comments: - if Hindi and Urdu are counted together, why not Italian and Portuguese? - According to a lecture some time ago by a Israel professor (I forgot his name), there are 80 languages actively used in Israel, including Hebrew, Arabic, English (both varieties), Russian, Ukrainian, Yiddish, Ladino, Tagalog, most European languages, and various African and East Asian languages used by the large number of refugees from Africa and foreign workers from East Asia. Best Regards, Jonathan Rosenne 054-4246522 -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of dzo at bisharat.net Sent: Wednesday, May 13, 2015 1:47 AM To: Karl Williamson; Unicode; Mark Davis ??; Unicode Public Subject: Re: FYI: The world?s languages, in 7 maps and charts And a tangent, picking up on a complaint that Swahili wasn't represented on one of the 7 WaPost graphics: http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html Two other recent posts on this blog ("Beyond Niamey") critique the Africa part of a set of graphics/maps of "Second Most Spoken Languages Worldwide" (on the Olivet Nazarene University site) - another thought-provoking effort that could inform better if redone. Don Osborn ------Original Message------ From: Karl Williamson Sender: Unicode To: Mark Davis ?? To: Unicode Public Subject: Re: FYI: The world?s languages, in 7 maps and charts Sent: May 12, 2015 6:19 PM On 05/12/2015 03:05 PM, Mark Davis ?? wrote: > http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ > ////// And a critique: http://languagelog.ldc.upenn.edu/nll/?p=18844 Sent via BlackBerry by AT&T From verdy_p at wanadoo.fr Wed May 13 05:37:44 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 13 May 2015 12:37:44 +0200 Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?= =?UTF-8?Q?ts?= In-Reply-To: <000401d08d5e$a811de90$f8359bb0$@gmail.com> References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: Italian and Portuguese are difficult to understand between each other (especially in speech: Italians speak really too fast) On the opposite, exchanges between Standard French and Iberian Portuguese is really easy, with low time of adaptation, either for native French coming in Portugal for the first time, or native Portugueses coming in France. Also there's not much difficulties between French Guiana and Brazil for the two regional variants of the two "standard" languages. Native Portuguese ans native French use approximately the same syntaxic structure, similar phonology, similar rythms, and there's a large common lexicon (also with imports from alsmost the same set of modern foreign languages or historical languages), if this still does not work, reading remains easy, and beside minor grammatical termination differences the lexical roots are the same for most words, many words in Portuguese are borrowed directly from French with very minor changes; the creation of new words also use a similar system of prefixes and suffixes which are nearly identical. This is not true with modern Italian that has accumulated lots of phonetic transforms since Latin, and that has mixed very different sets of regional minority languages. And where the transformation of meanings (creation of new lemmas of the same term, creation of irregular words composed by fusion and many mutations) was much deeper in Italian than in French and Portuguese (which were more conservative). But if we speak about Hindi and Urdu, for long it was considered the same language in speech (the writing systems in Urdu were separated only for religious reasons, but religious texts could not be read by a vast majority of people in India. They were really splitted in two languages only when education and litteracy progressed a lot sarting in the middle of the 20th century, and after the indepedence of India, then the separation of Pakistan. So the practical difficult differences are only for the written script, but as Urdu is also spoken in India, it is also still written with the Devanagari script (in which case it becomes relatively easy to read by native Hindi readers). Arabic-Devanagari Transliterators are still heavily used for Urdu in India. But if Urdu native speakers don't want Hindi, they choose to communicate in English (as a de facto interchange language understood by both communities in India, but also by many Urdu speakers in Pakistan). For many things, Urdu and Hindi are in a situation quite similar to Serbian Cyrillic vs.Croatian (and the Serbian Latin transliteration is often named "Serbocroatian" and can be used also as an interchange language. Bosnian (or more recently Montenegrin) is also in the middle, extremely similar to Serbian Latin (for now the separation is not really justified, except for political reasons, but not cultural reasons (the attempt to separate them is made by artificially introducing neologisms that many people don't know or use correctly, or by inventing new orthographic rules that few people know or follow exactly; mass medias cannot really help because they are overwhelmed by medias in other major languages, or because medias in all these newly intriduced languages are spread over the same regions, local medias are not powerful enough to have a decisive audience that can influence rapidly the evolution to separate languages, and even if they exist, they often ignore the new artificial rules. In that region, many people belonging to distinct communities have to interchange contents everyday; and the time were Serbocroatian was still a single language is not very old; even if the Cyrlliic script is prefered in Serbia, it is still not the only standard and most people also use the Latin script easily for the same language, and translitterators are also doing good job with only very minor differences remaining with the standard orthography of Serbian in each script). 2015-05-13 11:24 GMT+02:00 Jonathan Rosenne : > I have two comments: > > - if Hindi and Urdu are counted together, why not Italian and Portuguese? > > - According to a lecture some time ago by a Israel professor (I forgot his > name), there are 80 languages actively used in Israel, including Hebrew, > Arabic, English (both varieties), Russian, Ukrainian, Yiddish, Ladino, > Tagalog, most European languages, and various African and East Asian > languages used by the large number of refugees from Africa and foreign > workers from East Asia. > > Best Regards, > > Jonathan Rosenne > > 054-4246522 > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > dzo at bisharat.net > Sent: Wednesday, May 13, 2015 1:47 AM > To: Karl Williamson; Unicode; Mark Davis ??; Unicode Public > Subject: Re: FYI: The world?s languages, in 7 maps and charts > > And a tangent, picking up on a complaint that Swahili wasn't represented > on one of the 7 WaPost graphics: > > > http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html > > Two other recent posts on this blog ("Beyond Niamey") critique the Africa > part of a set of graphics/maps of "Second Most Spoken Languages Worldwide" > (on the Olivet Nazarene University site) - another thought-provoking effort > that could inform better if redone. > > Don Osborn > > > ------Original Message------ > From: Karl Williamson > Sender: Unicode > To: Mark Davis ?? > To: Unicode Public > Subject: Re: FYI: The world?s languages, in 7 maps and charts > Sent: May 12, 2015 6:19 PM > > On 05/12/2015 03:05 PM, Mark Davis ?? wrote: > > > http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/ > > ////// > > And a critique: > > http://languagelog.ldc.upenn.edu/nll/?p=18844 > > Sent via BlackBerry by AT&T > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed May 13 19:31:29 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 14 May 2015 01:31:29 +0100 Subject: Regular Expressions and Canonical Equivalence Message-ID: <20150514013129.0b68eb41@JRWUBU2> What is the current state of play on regular expression engines that acknowledge canonical equivalence? By acknowledge, I mean that will deem a string to have a match for a pattern if any string canonically equivalent to the string does. I believe this corresponds to the intent of requirement RL2.1 that was in UTS#18 Unicode Regular Expression until the towel was thrown in and the paragraph survived but the requirement vanished. I have been putting my own together, but my efforts have bogged down with how to select the match and subexpression matches to report. The relevant theory is not of regular languages of strings, but of regular languages of 'traces'. I currently leave the results undefined if an algebraic Kleene star is not a regular expression, e.g. (\u0323\u0301)*. It is particularly relevant to using regular expressions for text rendering, e.g. for something like an imitation of Microsoft?s Universal Shaping Engine. I note that ICU is having another attempt at supporting canoncial equivalence - http://bugs.icu-project.org/trac/ticket/9111 'Support UREGEX_CANON_EQ'. At least, they are if the User Guide (http://userguide.icu-project.org/strings/regexp) is to be believed. Perhaps not, though, if the old comments in the ticket are taken seriously. For example, I believe that one should be able to find the Lanna script subscript nga in the word ?????? /k??/ 'half' or the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX in the word _bu?c_ 'to bind' . As far as I can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the combination of Vietnamese letter and tone mark. One will not find them if one simply applies the string theory of regular expressions to NFD equivalents, as the initial bug report in the ticket suggests doing. A later comment in the ticket suggests that the alphabet for the string theory should be 'the combining sequences'. (I hope there is no theoretical problem from there being an infinite number of them.) The Vietnamese search would work if the alphabet in the string theory were *Vietnamese* collation elements. In the text rendering domain, HarfBuzz makes regular expressions work with conversion to NFD by permuting the canonical combining classes on a script by script basis. This requires care. Richard. From richard.wordingham at ntlworld.com Thu May 14 02:59:59 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 14 May 2015 08:59:59 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150514013129.0b68eb41@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> Message-ID: <20150514085959.433e49af@JRWUBU2> On Thu, 14 May 2015 01:31:29 +0100 Richard Wordingham wrote: > I believe this corresponds to the > intent of requirement RL2.1 that was in UTS#18 Unicode Regular > Expression until the towel was thrown in and the paragraph survived > but the requirement vanished. I apologise if I am telling those interested what they already know. I couldn't find it written down in terms of NFD strings. I believe the core of the problem is that Thompson's construction algorithm has to be significantly elaborated for concatenation. When running the non-deterministic finite state machine for the regular expression st, if the string is amnb with ccc(m) != ccc(n), one has to consider the possibility that subsequence an matches expression s and subsequence mb matches expression t. To handle a run of decomposed characters with non-zero canonical combining class, one method adds states of the form (x,y,n) where x is a state of for expression s, y is a state for expression t, and n is the non-zero canonical combining class of the last character received. The additional problem with (algebraic) Kleene star is that for s* one has to simultaneously consider s, ss, sss and so on, which makes the state machine non-finite. This is probably just a formal problem; once one adds capture groups to the FSM, the memory requirement depends on the size of the string being examined. A solution is to effectively add a loop to the parse structure of the regular expression and add checks to the matching function to avoid unnecessary recursion. An elegant formal solution to the Kleene star problem interprets (\u0323\u0302)* as (\u0323|\u0302)*. However, that is counter-intuitive, and simply rejecting such expressions would probably be better. Going non-finite is probably better. My *finite* state machine bodge for these cases is to simply match s+ to something uncharacterised between s|ss and s+. Richard. From verdy_p at wanadoo.fr Thu May 14 05:58:29 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 14 May 2015 12:58:29 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150514085959.433e49af@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> Message-ID: 2015-05-14 9:59 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > An elegant formal solution to the Kleene star problem interprets > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > counter-intuitive Yes it is problematic: (ab)* is not the same as (a|b)* as this requires matching pairs of letters "ab" in that order in the first expression, but random strings of "a" and "b" i nthe second one (so the second matches *more* input samples. Even if you consider canonical equivalences (where the relative order of "ab" does not matter for example because they have distinct non-zero canonical) this does not mean that "a" alone will match in the first expression "(ab*)", even though it MUST match in "(a|b)*". So the solution is just elegant to simplify the first level of analysis of "(ab)*" by using "(a|b)*" instead. But then you need to perform a second pass on the match to make sure it is containing only complete sequences "ab" in that order (or any other order if they are all combining with a non-zero combining class) and no unpaired "a" or "b". Such transform using two passes should only be made when subregexps within a "(...)*" contain only alternatives (converted to NFD) such then each of them contains ONLY combining characters with distinct non-zero combining classes. If one of the alternatives "ab" contains any character with combining class 0 or if they have blockers with identical non-zero combining classes, we cannot use this transform. But this transform using two passes is stil elegant: the alternatives where we can use it and that requires a second pass have a bounded length (it's impossible for them to be longer than 255 codepoints given there cannot be more than 255 *distinct* non-zero combining classes. But even in this case, the current UCD currently uses a much lower number of non-zero combining classes, so this limit is even lower: the substrings where this transform is possible will be extremely short and a second pass on them will be extremely fast (using very small string buffers that can stay in memory). For your example "(\u0323\u0302)*" the characters in the alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to NFD (which is the same here) are just using at most two distinct non-zero combining classes and no blocker; so it is safe to trasform it to (\u0323|\u0302)* for a first pass matching that will then only check candidate matchs in the second pass. or more efficiently, a second finite state automata (FSA) running in parallel with its own state: in your example this second FSA just has 2 states: the initial state 0 which is also the final/accept state, and state 1 after matching one character of the pair. When you reach the point where matching (\u0323 | \u0302)* with the first level of analysis would terminate, you just need to check the state of the 2nd FSA to see if it is also in the initial/final/accept state 0 (otherwise this is not an valid accept state for the untransformed (\u0323\u0302)* regexp. However, the most difficult part for regexps supporting canonical equivalence is about what we can do to return submatches! they are not necessarily contiguous in the input stream. You can still return a matching substring but if you use it for performing search/replace operations, it becomes difficult to know where to place the replacement, when that replacement string (even if it was converted first to NFD) may also contain combining characters. Or even worse if the replacement contains some blockers that will be inserted in the middle of the non-replaced text (wnad where can we safely place the remaining characters in the middle of the match but that are not part of the match itself ??? One solution is to not exclude these characters in the middle of a match and return them too. It's up to the replacement function to check their existence: The regexp engine can just provide an additional index of characters in the returned matched substring, that are in fact not part of the actual match but present in the middle, instead of just the substring for the match. Or it can return just the exact matching substring, but also an index array containing their relative position in the actual input string (in standard matches those indexes would be the sequence of intergers 0 to N-1 where N is the length of the matched substring, but if the sequence is discontinusous in the input, the sequence will still be growing with some steps higher than 1, leaving some holes (the last index in that sequence will be equal to or higher than N). -------------- next part -------------- An HTML attachment was scrubbed... URL: From webalorixa at gmail.com Thu May 14 08:24:57 2015 From: webalorixa at gmail.com (Luis de la Orden) Date: Thu, 14 May 2015 14:24:57 +0100 Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?= =?UTF-8?Q?ts?= In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: As a speaker of both Portuguese (mother tongue, native) and Spanish (father, not native anymore) with a Catalan connection (dad was from Barcelona and I lived there for a few months, amazing language, love it to bits), I would say these two languages are closer to each other than Italian to Portuguese. But never as close to consider them the same I can assure you :). In fact, even European Portuguese can be a bit hermetic to understand although they understand Brazilian Portuguese better. This is all down to the fact that they import more cultural products such as books and TV programmes from Brazil than the other way around. European Portuguese speakers say that the way we tonalise and inflect the language is softer but I believe that they understand us due to exposure to the language which in turn teaches them to decipher our pronunciation and the Brazilian linguistic idiosyncracies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 14 09:08:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 14 May 2015 07:08:14 -0700 Subject: Regular Expressions and Canonical Equivalence Message-ID: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net> Richard Wordingham wrote: > For example, I believe that one should be able to find > [...] > the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH > CIRCUMFLEX in the word _bu?c_ 'to bind' SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As far as I > can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the > combination COMBINING DOT BELOW> of Vietnamese letter and tone mark. What you're looking for in this case is neither an NFC match nor an NFD match, but a language-dependent match, as you imply further down. <1ED9> decomposes to <006F 0323 0302>, and if you want a match with <00F4>, which decomposes to <006F 0302>, your regex engine has to reorder the marks. It sounds unlikely that you'll find such an engine, but there is a lot of Vietnamese-language?specific software out there, so you never know. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From slevin at signpuddle.net Thu May 14 12:25:06 2015 From: slevin at signpuddle.net (Stephen E Slevinski Jr) Date: Thu, 14 May 2015 12:25:06 -0500 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> Message-ID: <5554DA72.4040509@signpuddle.net> On 5/14/15 5:58 AM, Philippe Verdy wrote: > Yes it is problematic: (ab)* is not the same as (a|b)* as this > requires matching pairs of letters "ab" in that order in the first > expression, but random strings of "a" and "b" i nthe second one (so > the second matches *more* input samples. > > Even if you consider canonical equivalences (where the relative order > of "ab" does not matter for example because they have distinct > non-zero canonical) this does not mean that "a" alone will match in > the first expression "(ab*)", even though it MUST match in "(a|b)*". > > So the solution is just elegant to simplify the first level of > analysis of "(ab)*" by using "(a|b)*" instead. But then you need to > perform a second pass on the match to make sure it is containing only > complete sequences "ab" in that order (or any other order if they are > all combining with a non-zero combining class) and no unpaired "a" or "b". If you always want to find "a" and "b" in a pair without regard to the order, how about the regex: ((ab)|(ba))* ?Steve From wjgo_10009 at btinternet.com Thu May 14 12:14:57 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 14 May 2015 18:14:57 +0100 (BST) Subject: Tag characters Message-ID: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> http://www.unicode.org/L2/L2015/15107.htm Section E.1.3 of the above-linked document is amazing and is about a brilliant new use for some of the tag characters. What else would be possible if the same sort of technique were applied to another base character? William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu May 14 12:55:33 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 14 May 2015 18:55:33 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <5554DA72.4040509@signpuddle.net> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <5554DA72.4040509@signpuddle.net> Message-ID: <20150514185533.776a7772@JRWUBU2> On Thu, 14 May 2015 12:25:06 -0500 Stephen E Slevinski Jr wrote: > On 5/14/15 5:58 AM, Philippe Verdy wrote: > > Yes it is problematic: (ab)* is not the same as (a|b)* as this > > requires matching pairs of letters "ab" in that order in the first > > expression, but random strings of "a" and "b" i nthe second one (so > > the second matches *more* input samples. > > > > Even if you consider canonical equivalences (where the relative > > order of "ab" does not matter for example because they have > > distinct non-zero canonical) this does not mean that "a" alone will > > match in the first expression "(ab*)", even though it MUST match in > > "(a|b)*". > > > > So the solution is just elegant to simplify the first level of > > analysis of "(ab)*" by using "(a|b)*" instead. But then you need to > > perform a second pass on the match to make sure it is containing > > only complete sequences "ab" in that order (or any other order if > > they are all combining with a non-zero combining class) and no > > unpaired "a" or "b". > > If you always want to find "a" and "b" in a pair without regard to > the order, how about the regex: > ((ab)|(ba))* In NFD, the language (\u0323\u0302)* consists of ? (empty string) \u0323\u0302 \u0323\u0323\u0302\u0302 \u0323\u0323\u0323\u0302\u0302\u0302 \u0323\u0323\u0323\u0323\u0302\u0302\u0302\u0302 and so on. Therefore the finite automaton implied by your regex won't work. No regular expression will work. That is mathematically proven. What I have listed above is the standard example of a 'non-regular language', a set of strings that cannot be defined by a finite site of regular expressions. Richard. From richard.wordingham at ntlworld.com Thu May 14 13:13:24 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 14 May 2015 19:13:24 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> Message-ID: <20150514191324.1e455c57@JRWUBU2> On Thu, 14 May 2015 12:58:29 +0200 Philippe Verdy wrote: > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > An elegant formal solution to the Kleene star problem interprets > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > counter-intuitive The technical term for this is the 'concurrent iteration' - or at least, that's the term used in the 'Book of Traces'. > For your example "(\u0323\u0302)*" the characters in the alternatives > (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to > NFD (which is the same here) are just using at most two distinct > non-zero combining classes and no blocker; so it is safe to trasform > it to (\u0323|\u0302)* for a first pass matching that will then only > check candidate matchs in the second pass. or more efficiently, a > second finite state automata (FSA) running in parallel with its own > state: You've forgotten the basic problem. A *finite* state automaton cannot count very far; with only n states, it cannot count as far as n. For this simple example, one could simply use something like (\u0323\u0302)\{0,7\}, which should be more than enough for any likely occurrences. It's an interesting challenge, but I think solving it provides satisfaction rather than practical benefit. > However, the most difficult part for regexps supporting canonical > equivalence is about what we can do to return submatches! they are not > necessarily contiguous in the input stream. You can still return a > matching substring but if you use it for performing search/replace > operations, it becomes difficult to know where to place the > replacement, when that replacement string (even if it was converted > first to NFD) may also contain combining characters. Or even worse if > the replacement contains some blockers that will be inserted in the > middle of the non-replaced text (wnad where can we safely place the > remaining characters in the middle of the match but that are not part > of the match itself ??? Interestingly, ICU hides that detail from the user. For search and replace on a text buffer, the text to be replaced would be defined by a list of text intervals. If the text is unnormalised, some of the boundaries may divide precomposed characters. If the interval list is compacted, at most one of the intervals will contain a character properly having combining class 0. (U+0F73 and U+0F75 do not count.) If there is such an interval, it will be replaced and the others simply deleted. If there is no such interval, then the choice of insertion point may be more difficult. Indeed, in some cases, it could be appropriate to reject the replacement command as undefined in the context. On the other hand, if the text buffer is normalised, then one would be able to have well-defined behaviour, as one does when splitting text into UCA collating elements. Richard. From richard.wordingham at ntlworld.com Thu May 14 14:29:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 14 May 2015 20:29:06 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net> References: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net> Message-ID: <20150514202906.34c33518@JRWUBU2> On Thu, 14 May 2015 07:08:14 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > > For example, I believe that one should be able to find > > [...] > > the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH > > CIRCUMFLEX in the word _bu?c_ 'to bind' > LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As > > far as I can tell, U+1ED9 is not a letter of the Vietnamese > > alphabet; it is the combination > CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and > > tone mark. > > What you're looking for in this case is neither an NFC match nor an > NFD match, but a language-dependent match, as you imply further down. > <1ED9> decomposes to <006F 0323 0302>, and if you want a match with > <00F4>, which decomposes to <006F 0302>, your regex engine has to > reorder the marks. It sounds unlikely that you'll find such an > engine, but there is a lot of Vietnamese-language?specific software > out there, so you never know. There's no more reordering than is involved in doing a Vietnamese collation-based search, where one has to split <006F 0323 0302> up into collating elements <006F 0302><0323>. Possibly a back-tracking regular expression would reorder the string. My experimental canonical-equivalence respecting regular expression engine is designed in the same manner as the Thomson construction - it is a non-deterministic finite automaton (except for the effects of capturing parts of the input string) composed of a hierarchy of non-deterministic finite automata. States are identified as strings of scalars following the hierarchy. The engine checks whether a string matches a regular expression. The engine decomposes the string to NFD. This keeps the automaton for the concatenation of two regular expressions simple. I will now show how it handles the search. The regular expression to match against is \u00f4.* - a character U+00F4 followed by anything, including nothing. My program essentially produced the following output, with comments added later indicated by #: $ ./regex '\u00f4.*' ? # Arguments are regular expression and string Simple Unicode regex "\u006F\u0302" # First half of regular expression # as the automaton actually sees it. Initial states: 0) L0 # Initial state - expecting 'o', in first half of expression. # 'L' = left. =o=10:20:= # Gets 'o' L0 => L1 # Changes state to expecting combining circumflex =0323=20:30:= # Gets combining dot below (U+0323) L1 => N001220:1:* # N => State for concatenation of regular # expressions; both automata are run. # 001 => Substring length within state identifier. # 220 => Combining class of U+0323. Characters with # this ccc or lower may no longer be processed by # the left-hand automaton. # : is punctuation for readability of state # 1 => Left half still expecting combining circumflex # * => only state for regex ".*". =0302=30:06:= # Gets combining circumflex (U+0302) # The engine runs a non-deterministic finite # automaton. It now branches to 3 states. N001220:1:* => N001220:M:* # Left half has now reached end of expected # string. N001220:1:* => R* (match) # On transferring to an accept state of # \u00f4 automaton, only the .* # automaton needs to # be processed. N001220:1:* => N001230:1:* # Possibly the combining circumflex is to # match the '.*'. The combining class is # updated to 230. This 230 will actually # block the \u00f4 automaton from reaching # an accept state from this state, for a # combining circumflex can henceforth only be # considered by the .* automaton. At no point has the input string been reordered. Richard. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From wjgo_10009 at btinternet.com Thu May 14 15:40:19 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 14 May 2015 21:40:19 +0100 (BST) Subject: Tag characters In-Reply-To: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> Message-ID: <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> > What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 14 16:26:32 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 14 May 2015 14:26:32 -0700 Subject: Tag characters In-Reply-To: <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> Message-ID: > > Thinking about this further, could the technique be used to solve the > requirements of > section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ? Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > > What else would be possible if the same sort of technique were applied > to another base character? > > > Thinking about this further, could the technique be used to solve the > requirements of > > section 8 Longer Term Solutions > > of > > http://www.unicode.org/reports/tr51/tr51-2.html > > ? > > > Both colour pixel map and colour OpenType vector font solutions would be > possible. > > > Colour voxel map and colour vector 3d solids solutions are worth thinking > about too as fun coding thought experiments that could possibly lead to > useful practical results. > > > > William Overington > > > 14 May 2015 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 14 17:13:39 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 14 May 2015 15:13:39 -0700 Subject: Tag characters Message-ID: <20150514151339.665a7a7059d7ee80bb4d670165c8327d.ce7108c845.wbe@email03.secureserver.net> http://www.unicode.org/L2/L2015/15107.htm points indirectly to: http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf which says: > The proposal has two parts > > 1. Un-deprecate TAG characters E0020-E007E. Hee hee. Hee hee. > 2. Define a character as the ?base? for a following sequence of > TAG characters that specifies a region or subregion to be > represented using a sequence of TAG characters. There are two > possibilities for the base character: > > a. Preferred: Use the Unicode 7.0 character WAVING WHITE FLAG: > 1F3F3?WAVING WHITE FLAG?So?0?ON?????N????? > The advantage is no new characters need be encoded. "Add language to UTR #51 describing the mechanism given in 2A" means that U+1F3F3 will be the tag introducer, basically the "flag emoji" equivalent of U+E0001 LANGUAGE TAG. I think I understand why the TAG/CANCEL TAG start-end mechanism which was invented for Plane 14 language tags wasn't reused for flag emoji. Adding U+E0002 FLAG TAG would have implied that the sequence ends with CANCEL TAG. Flags don't have scope and there is no need to indicate the end of the sequence explicitly for scoping purposes, as there is with tagged text. I assume that existing text with U+1F3F3 followed by no tag characters should continue to display the waving white flag glyph, whereas text conforming to this new mechanism should suppress that glyph and show the Scottish, Welsh, Delawarean, or Nordlending flag instead. > Using the following notation - > B designates the chosen base character (U+1F3F3 or new U+1F1E5) > TL designates a TAG LATIN CAPITAL LETTER (A..Z) > TD designates a TAG DIGIT (ZERO..NINE) > TH designates TAG HYPHEN-MINUS > > - a well-formed sequence for for designating flags for ISO 3166-1, > 3166-2 or UN M49 codes would be > > B ((TL{2} (TH (TL|TD){3})?) | (TD{3})) Will the subdivision sequence always be exactly 3 characters long? CLDR ticket #8423 seems to say that ISO 3166-2 code elements that are only 1 or 2 characters long will be prepended with "xx" or "x" to make them all exactly 3. Obviously some research will need to be done to ensure this doesn't result in conflicts with existing code elements, and of course 3166-2 makes no promises that future assignments will deliberately avoid such a conflict. Will both mechanisms, old and new, be available for encoding national flags? For example, for a French flag: <1F1EB 1F1F7> or <1F3F3 E0046 E0052> > In CLDR 28, LDML will define a unicode_subdivision_subtag which also > provides validity criteria for the codes used for regional > subdivisions (see CLDR ticket #8423). When representing regional > subdivisions using ISO 3166-2 codes, only those codes that are valid > for the LDML unicode_subdivision_subtag should be used. I note that a preliminary file is already available at http://unicode.org/repos/cldr/trunk/common/supplemental/subdivisions.xml . -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu May 14 19:10:36 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 15 May 2015 02:10:36 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150514191324.1e455c57@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> Message-ID: 2015-05-14 20:13 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Thu, 14 May 2015 12:58:29 +0200 > Philippe Verdy wrote: > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > An elegant formal solution to the Kleene star problem interprets > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > counter-intuitive > > The technical term for this is the 'concurrent iteration' - or at > least, that's the term used in the 'Book of Traces'. > > > For your example "(\u0323\u0302)*" the characters in the alternatives > > (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to > > NFD (which is the same here) are just using at most two distinct > > non-zero combining classes and no blocker; so it is safe to trasform > > it to (\u0323|\u0302)* for a first pass matching that will then only > > check candidate matchs in the second pass. or more efficiently, a > > second finite state automata (FSA) running in parallel with its own > > state: > > You've forgotten the basic problem. A *finite* state automaton cannot > count very far; with only n states, it cannot count as far as n. > I did not forget it, this is why there's a second pass (or a second FSA running in parallel to indicate its own accept state). You have to combine the two states variables to get the final combined state to determine if it is a final accept state. But one of the two state variable has an upper bound which is not only finite but very small (it has at most 255 possible values). Typical regexp engines do not create the full deterministic automata with all its states (it would require a lot of memory due to combinatorial effects, they handle multiple state variables in parallel and use a rendez-vous system to test them in order to determine if we have an accept state, or a fail state (for which we must rollback). So even if one of the state is not bound in terms of length, the other one (exploring the possible lengths of reorderable of non-blocking combining characters) is clearly finite (so you don't need to count very far. So you just need a single additional byte of storage for storing the second state variable in the global state of your FSA. The size of the global state variable only depends on the number of alternatives in your regexp and it is also bound (limited to the length of the source regexp: even if if your regexp speicification string is 1000 characters long, you know that you will never need more than 1000 bytes to represent it, but of course it will not be a simple 32-bit integer: this structure can represent billions of billions of billions of possible states without needing to transform the FSA to a pure deterministic FSA with a single integer and without having to build a single MxN transition matrix (with M columns for each possible character class, and N rows for each each deterministic state, where each cell contains the value of for next deterministic state: this will not work) Even if your regexp is so complex that it requires a specification string that is 100KB long, your global state variable will never be longer than 100KB. But of course, this structure is a bit less easier to use when advancing: you have to advance all active states in parallel using the current input character with each transition submatrix (which is really small as well with just a couple of elements that can fit in a small structure with fixed size: an accept character, limited to 21 bits with Unicode, or a character class index, and a few flags, 3 bits, for saying if this character is advancable, or if the current state is an accept state or a failure state, or to indicate the presence of an alternative and give an index to its own branch by specifying only which elementary state variable is used for that alternative within the structure of the global state variable). In summary the global state varaible is just a small array of 32 bit integers for the most complex regexps you will encounter (I don't think that 100KB regexps are very common, almost of them are below 1KB, so your global state variable will fit in 4KB of memory, and the transition matrix will also fit in 4KB). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 14 19:38:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 15 May 2015 02:38:17 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150514191324.1e455c57@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> Message-ID: 2015-05-14 20:13 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > If the interval list is compacted, at most one of the intervals will > contain a character properly having combining class 0. This is not a sufficent condition, there is also the case where two intervals contain combining characters with the same combining class: their relative order is significant because one is blocking the other (it limits the alllowed reorderings that are canonically equivalent). But if the replacement string also adds its own blockers the situation is worse... There's no simple way to determine what to do by just returning a replacement string that the regexp engine will insert itself in the output text: the base that can be done is that the regexp gives a full view not only to the characters withjin matches, but also the characters in the middle that are not part of the match: instead of performing the insertion itself (by specifying a single expression for the replacement text), you will provide a callback function analysing also the non-matched characters in the middle to decide what to do with them: you should then be able to choose between several replacement patterns (including placeholders also for unmathed intervals such as numbered placeholders with negative values $-1, $-2, ..., positive or null numbers being used for the classical array of matched captures $0, $1... But for these additional captures that are not part of the match, you need a way to indicate their placement within the true matched captures, and not all positive captures share the same set of negative captures and not at the same positions). Note that for making sure we can perform safe replacements within normalized text and makeing sure that the result will also be normalized, we need to include in negative captures some characters that are not in the middle of a match, but also all the other combining characters with non-zero combining class that are before the matched string (if the matched string does not start with a character with combining class 0) and after it and that have a higher combining class than the last character in the positive capture.; if the positive capure is an ampty string, the first negative capture will include all combining characters with distinct non-0 combining class. before the insertion point of that empty positive capture, and the second one will onclude all non-0 combining characters after thje insertion point that have distinct non-0 combining classes (these two negative captures are bounded in length to at most 255 characters, just like with the negative captures added for parts of the input that are in the middle of a positive capture). For now I've never seen any regexp engine supporting the concept of "negative captures", all of them only return positive ones, including when they allow the replacement to be a callback and not just a static string with optional placeholders. If there is such an interval, it will be > replaced and the others simply deleted. If there is no such interval, > then the choice of insertion point may be more difficult. Indeed, in > some cases, it could be appropriate to reject the replacement command > as undefined in the context. On the other hand, if the text buffer is > normalised, then one would be able to have well-defined behaviour, as > one does when splitting text into UCA collating elements. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu May 14 21:44:21 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 15 May 2015 02:44:21 +0000 Subject: Tag characters In-Reply-To: References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> Message-ID: And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shervin Afshar Sent: Thursday, May 14, 2015 2:27 PM To: wjgo_10009 at btinternet.com Cc: unicode at unicode.org Subject: Re: Tag characters Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ? Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington > wrote: > What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 14 22:11:37 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 14 May 2015 20:11:37 -0700 Subject: Future of Emoji? (was Re: Tag characters) Message-ID: Peter, This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]). [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ ? Shervin On Thu, May 14, 2015 at 7:44 PM, Peter Constable wrote: > And yet UTC devotes lots of effort (with an entire subcommittee) to > encode more emoji as characters, but no effort toward any preferred longer > term solution not based on characters. > > > > > > Peter > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin > Afshar > *Sent:* Thursday, May 14, 2015 2:27 PM > *To:* wjgo_10009 at btinternet.com > *Cc:* unicode at unicode.org > *Subject:* Re: Tag characters > > > > Thinking about this further, could the technique be used to solve the > requirements of > section 8 Longer Term Solutions > > > > IMO, the industry preferred longer term solution (which is also discussed > in that section with few existing examples) for emoji, is not going to be > based on characters. > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > > > What else would be possible if the same sort of technique were applied > to another base character? > > > Thinking about this further, could the technique be used to solve the > requirements of > > section 8 Longer Term Solutions > > of > > http://www.unicode.org/reports/tr51/tr51-2.html > > ? > > > Both colour pixel map and colour OpenType vector font solutions would be > possible. > > > Colour voxel map and colour vector 3d solids solutions are worth thinking > about too as fun coding thought experiments that could possibly lead to > useful practical results. > > > > > William Overington > > > 14 May 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu May 14 23:37:37 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 15 May 2015 04:37:37 +0000 Subject: Future of Emoji? (was Re: Tag characters) In-Reply-To: References: Message-ID: Skype uses stickers, including animated stickers. Here?s the documented set: https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?. Peter From: Shervin Afshar [mailto:shervinafshar at gmail.com] Sent: Thursday, May 14, 2015 8:12 PM To: Peter Constable Cc: unicode at unicode.org Subject: Future of Emoji? (was Re: Tag characters) Peter, This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]). [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ ? Shervin On Thu, May 14, 2015 at 7:44 PM, Peter Constable > wrote: And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shervin Afshar Sent: Thursday, May 14, 2015 2:27 PM To: wjgo_10009 at btinternet.com Cc: unicode at unicode.org Subject: Re: Tag characters Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ? Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington > wrote: > What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri May 15 00:40:22 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 14 May 2015 22:40:22 -0700 Subject: Future of Emoji? (was Re: Tag characters) In-Reply-To: References: Message-ID: Good point. I missed these while looking into compatibility symbols. Of course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended support for flags, most of smiley faces, objects, etc. on all platforms would affect this approach. My idea of a sticker-based solution is something more like Facebook's[3] or Line's[4] implementations. [1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf [2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf [3]: http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html [4]: https://creator.line.me/en/guideline/ ? Shervin On Thu, May 14, 2015 at 9:37 PM, Peter Constable wrote: > Skype uses stickers, including animated stickers. Here?s the documented > set: > > > > https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons > > > > And if you search, you?ll find lots more ?hidden? emoticons, like > ?(bartlett)?. > > > > > > > > Peter > > > > > > *From:* Shervin Afshar [mailto:shervinafshar at gmail.com] > *Sent:* Thursday, May 14, 2015 8:12 PM > *To:* Peter Constable > *Cc:* unicode at unicode.org > *Subject:* Future of Emoji? (was Re: Tag characters) > > > > Peter, > > > > This very topic was discussed in last meeting of the subcommittee and my > impression is that there are plans to promote the use of embedded graphics > (aka stickers) either through expansions to section 8 of TR51 or through > some other means. It should also be noted that none of current members of > Unicode seem to have a sticker-based implementation (with the exception of > an experimental limited trial by Twitter[1]). > > > > [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 7:44 PM, Peter Constable > wrote: > > And yet UTC devotes lots of effort (with an entire subcommittee) to > encode more emoji as characters, but no effort toward any preferred longer > term solution not based on characters. > > > > > > Peter > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin > Afshar > *Sent:* Thursday, May 14, 2015 2:27 PM > *To:* wjgo_10009 at btinternet.com > *Cc:* unicode at unicode.org > *Subject:* Re: Tag characters > > > > Thinking about this further, could the technique be used to solve the > requirements of > section 8 Longer Term Solutions > > > > IMO, the industry preferred longer term solution (which is also discussed > in that section with few existing examples) for emoji, is not going to be > based on characters. > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > > > What else would be possible if the same sort of technique were applied > to another base character? > > > Thinking about this further, could the technique be used to solve the > requirements of > > section 8 Longer Term Solutions > > of > > http://www.unicode.org/reports/tr51/tr51-2.html > > ? > > > Both colour pixel map and colour OpenType vector font solutions would be > possible. > > > Colour voxel map and colour vector 3d solids solutions are worth thinking > about too as fun coding thought experiments that could possibly lead to > useful practical results. > > > > > William Overington > > > 14 May 2015 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 15 02:10:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 15 May 2015 08:10:03 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> Message-ID: <20150515081003.1984d0c4@JRWUBU2> On Fri, 15 May 2015 02:10:36 +0200 Philippe Verdy wrote: > 2015-05-14 20:13 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Thu, 14 May 2015 12:58:29 +0200 > > Philippe Verdy wrote: > > > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > > richard.wordingham at ntlworld.com>: > > > > > > > An elegant formal solution to the Kleene star problem interprets > > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > > counter-intuitive > > > > The technical term for this is the 'concurrent iteration' - or at > > least, that's the term used in the 'Book of Traces'. > > > > > For your example "(\u0323\u0302)*" the characters in the > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), > > > once converted to NFD (which is the same here) are just using at > > > most two distinct non-zero combining classes and no blocker; so > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass > > > matching that will then only check candidate matchs in the second > > > pass. or more efficiently, a second finite state automata (FSA) > > > running in parallel with its own state: > > > > You've forgotten the basic problem. A *finite* state automaton > > cannot count very far; with only n states, it cannot count as far > > as n. > > > > I did not forget it, this is why there's a second pass (or a second > FSA running in parallel to indicate its own accept state). You have > to combine the two states variables to get the final combined state > to determine if it is a final accept state. Your description makes no sense to me as a description of a finite state automaton. Now, a program to check whether a trace matching {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just counts the number of times \u0323 occurs and the number of times \u0302 occurs, and returns whether they are equal. The two counters are the key variables (and one could just keep the difference in the counts). However, this is not a finite state automaton. Now, to some extent I am cheating by assuming that the characters are delivered in NFD order. If I did not do this, to construct the non-deterministic finite automaton (NDFA) for the concatenation of two sets / regular expressions, the triples (x, y, n) of (left NDFA state, right NDFA state, highest non-zero ccc assigned to righthand component) would need to be expanded. The third component would become a list of non-zero ccc's - in principle 2^254 values, but in fact rather fewer as not all 255 ccc values are used by Unicode. It is still finite. I prefer to keep the complexity out of the regular expression engine proper. Given a NDFA recognising a set of NFD strings, one can convert it to a deterministic finite automaton (DFA), say X, provided one does not run out of memory or time. One can then 'easily' construct a DFA Y recognising the canonical equivalents of the strings. The state in DFA Y reached by string x is defined to be the state reached by the string to_NFD(x) in DFA X. This method relies on the identity to_NFD(to_NFD(x)z) = to_NFD(xz). This handles the recognition of a string canonically equivalent to \u0323\u0302. (The constructions above are sledge hammers; the NDFAs have many unreachable states.) However, recognising canonical equivalents of (\u0323\u0302)* via an FSM is rather more difficult; to be precise, it cannot be done by an FSM. Richard. From abdo.alrhman.aiman at gmail.com Fri May 15 09:18:47 2015 From: abdo.alrhman.aiman at gmail.com (=?UTF-8?B?2LnYqNivINin2YTYsdit2YXYp9mGINij2YrZhdmG?=) Date: Fri, 15 May 2015 17:18:47 +0300 Subject: Arabic diacritics Message-ID: hi, regarding the Arabic diacritics. e.g. for the Shadda, we have: 1. The form that people type: http://unicode-table.com/en/0651/ 2. An Isolated form. It should be the same, but looks different in the Unicode table, which is confusing me now. http://unicode-table.com/en/FE7C/ 3. A medial form: http://unicode-table.com/en/FE7D/ When do I use 1/2, and when do I use 3? some diacritics has e.g. isolated and medial forms. Some have only one of these forms, some have both. So, where does each of them go? respectfully -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 15 10:45:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 15 May 2015 16:45:03 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> Message-ID: <20150515164503.2c8624f0@JRWUBU2> On Fri, 15 May 2015 02:38:17 +0200 Philippe Verdy wrote: > 2015-05-14 20:13 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > If the interval list is compacted, at most one of the intervals will > > contain a character properly having combining class 0. > > This is not a sufficent condition, there is also the case where two > intervals contain combining characters with the same combining class: > their relative order is significant because one is blocking the other > (it limits the alllowed reorderings that are canonically equivalent). If two fully decomposed characters of combining class 0 are included in the match to a subexpression, all the characters between them will be included. The needs you perceive would be met by providing the start and end points of the locations of the non-starters flanking the matching string on the sides where it starts with a non-starter or ends with a character with non-zero rccc. (U+00E2 would probably have to count as a non-starter for your purposes.) However, I'm not sure that passing the positions would not suffice. Don't forget that the input string can be rearranged, preserving canonical equivalence, so that the captured string is actually contiguous. I think this discussion on search and replace would benefit from some examples. I don?t see your problem. Is it based on experience? I have some fairly simple examples. My first example is the replacement of ? by U+00E2 in the 4-character string bu?c . U+1ED9 has the full decomposition . The substring ? has the discontiguous position, in inclusive:exclusive notation: Component 1 at Position 2:Component 2 at Position 2 (content U+006F) Component 3 at Position 2:Whole at Position 3 (content U+0302) Now, the regular expression syntax for an identified substring suggests that it is contiguous. For substitution, it therefore makes most sense to view the whole string as though it were the canonically equivalent , a form in which the identified substring is contiguous. Replacement should therefore create something canonically equivalent to . In terms of program logic, I would expect the string editing to proceed something like this: 1. Decompose characters that straddle range boundaries, so: a. String becomes b. Identified substring location updates to: i. Whole at Position 2: Whole at Position 3 (content U+006F) ii. Whole at Position 4: Whole at Position 5 (content U+0302) 2. First portion contains a character with canonical combining class 0, so replace it by replacement string. 3. Delete other portions. 4. Apply any normalisation requirements. For my second example, let the replacement string be instead. I would expect the same logic to apply, yielding a substring , and would not be concerned by its not being canonically equivalent to . For my third example, consider the replacement of U+0302 by U+031B COMBINING HORN in the 6-character string buo??c . The character is at location Whole at Position 4:Whole at Position 5. The identified substring does not contain any characters of canonical combining class 0. U+031B has ccc=216 and U+0323 has ccc=220, so it matters little how the characters between U+006F and U+0063 are arranged - the results are canonically equivalent and the substitution should be made without complaint. For my fourth example, consider again the replacement of U+0302 in the 6-character string buo??c , but this time by U+0068 LATIN SMALL LETTER H. We now have a problem. Applying the substitution at this location yields the string buoh?c (dot below the ?h?), while applying the substitution to the string in NFD form yields buo?hc (dot below the ?o?), which is visually distinct. In some ways this is similar to the problem of grouping text into collating elements for collation. The Unicode Collation Algorithm resolves conflicts on the basis of the NFD form. Requiring the string to be in strict NFD might not be suitable ? it breaks compatibility ideographs. Also, I can imagine wanting to make global substitutions so as to undo ill effects of normalisation. There are many different ways to handle the problem, and I can imagine a rich selection of flags for a substitution routine. I would urge, however, that the replacement text should be contiguous in some canonical equivalent of the resulting string. Richard. From petercon at microsoft.com Fri May 15 10:46:39 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 15 May 2015 15:46:39 +0000 Subject: Future of Emoji? (was Re: Tag characters) In-Reply-To: References: Message-ID: MSN Messenger supported extensible stickers years ago. A couple of sites still offering add-ons: http://www.getsmile.com/ http://www.smileys4msn.com/ Peter From: Shervin Afshar [mailto:shervinafshar at gmail.com] Sent: Thursday, May 14, 2015 10:40 PM To: Peter Constable Cc: unicode at unicode.org Subject: Re: Future of Emoji? (was Re: Tag characters) Good point. I missed these while looking into compatibility symbols. Of course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended support for flags, most of smiley faces, objects, etc. on all platforms would affect this approach. My idea of a sticker-based solution is something more like Facebook's[3] or Line's[4] implementations. [1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf [2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf [3]: http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html [4]: https://creator.line.me/en/guideline/ ? Shervin On Thu, May 14, 2015 at 9:37 PM, Peter Constable > wrote: Skype uses stickers, including animated stickers. Here?s the documented set: https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?. Peter From: Shervin Afshar [mailto:shervinafshar at gmail.com] Sent: Thursday, May 14, 2015 8:12 PM To: Peter Constable Cc: unicode at unicode.org Subject: Future of Emoji? (was Re: Tag characters) Peter, This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]). [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ ? Shervin On Thu, May 14, 2015 at 7:44 PM, Peter Constable > wrote: And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shervin Afshar Sent: Thursday, May 14, 2015 2:27 PM To: wjgo_10009 at btinternet.com Cc: unicode at unicode.org Subject: Re: Tag characters Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ? Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington > wrote: > What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Fri May 15 10:57:56 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 15 May 2015 15:57:56 +0000 Subject: Future of Emoji? (was Re: Tag characters) In-Reply-To: References: Message-ID: Ah,yes. And Messenger ?winks?. E.g., http://www.msn-tools.net/free-msn-winks-1.htm I note that this has .swf files, and that?s what we saw one of the Japanese carriers saying they?d be moving to instead of PUA characters. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Friday, May 15, 2015 8:47 AM To: Shervin Afshar Cc: unicode at unicode.org Subject: RE: Future of Emoji? (was Re: Tag characters) MSN Messenger supported extensible stickers years ago. A couple of sites still offering add-ons: http://www.getsmile.com/ http://www.smileys4msn.com/ Peter From: Shervin Afshar [mailto:shervinafshar at gmail.com] Sent: Thursday, May 14, 2015 10:40 PM To: Peter Constable Cc: unicode at unicode.org Subject: Re: Future of Emoji? (was Re: Tag characters) Good point. I missed these while looking into compatibility symbols. Of course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended support for flags, most of smiley faces, objects, etc. on all platforms would affect this approach. My idea of a sticker-based solution is something more like Facebook's[3] or Line's[4] implementations. [1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf [2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf [3]: http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html [4]: https://creator.line.me/en/guideline/ ? Shervin On Thu, May 14, 2015 at 9:37 PM, Peter Constable > wrote: Skype uses stickers, including animated stickers. Here?s the documented set: https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?. Peter From: Shervin Afshar [mailto:shervinafshar at gmail.com] Sent: Thursday, May 14, 2015 8:12 PM To: Peter Constable Cc: unicode at unicode.org Subject: Future of Emoji? (was Re: Tag characters) Peter, This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]). [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ ? Shervin On Thu, May 14, 2015 at 7:44 PM, Peter Constable > wrote: And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shervin Afshar Sent: Thursday, May 14, 2015 2:27 PM To: wjgo_10009 at btinternet.com Cc: unicode at unicode.org Subject: Re: Tag characters Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ? Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington > wrote: > What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Fri May 15 11:09:29 2015 From: moyogo at gmail.com (Denis Jacquerye) Date: Fri, 15 May 2015 16:09:29 +0000 Subject: Arabic diacritics In-Reply-To: References: Message-ID: You should use ARABIC SHADDA U+0651 in all positions. The presentation forms (isolated, medial, final forms) are for compatibility with legacy systems. See what is said in http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf about the Arabic Presentation Forms-B. Cheers, On Fri, 15 May 2015 at 15:53 ??? ??????? ???? wrote: > hi, > > regarding the Arabic diacritics. e.g. for the Shadda, we > have: > > 1. The form that people type: > http://unicode-table.com/en/0651/ > > 2. An Isolated form. It should be the same, but looks different in the > Unicode table, which is confusing me now. > http://unicode-table.com/en/FE7C/ > > 3. A medial form: > http://unicode-table.com/en/FE7D/ > > When do I use 1/2, and when do I use 3? > > some diacritics has e.g. isolated and medial forms. Some have > only one of these forms, some have both. So, where does each of them go? > > respectfully > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri May 15 12:57:46 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 15 May 2015 10:57:46 -0700 Subject: Future of Emoji? (was Re: Tag characters) In-Reply-To: References: Message-ID: These are all great pointers which we might want to look into more closely for expanding the longer term solution section in TR51 or any other document encouraging folks to use stickers. May be Microsoft people who are attending emoji SC can provide some insight on these issues, too. I think I still prefer the current situation compared to Japanese carriers having to go with .SWF! ? Shervin On Fri, May 15, 2015 at 8:57 AM, Peter Constable wrote: > Ah,yes. And Messenger ?winks?. E.g., > > > > http://www.msn-tools.net/free-msn-winks-1.htm > > > > I note that this has .swf files, and that?s what we saw one of the > Japanese carriers saying they?d be moving to instead of PUA characters. > > > > > > Peter > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Peter > Constable > *Sent:* Friday, May 15, 2015 8:47 AM > *To:* Shervin Afshar > *Cc:* unicode at unicode.org > *Subject:* RE: Future of Emoji? (was Re: Tag characters) > > > > MSN Messenger supported extensible stickers years ago. A couple of sites > still offering add-ons: > > > > http://www.getsmile.com/ > > http://www.smileys4msn.com/ > > > > > > Peter > > > > *From:* Shervin Afshar [mailto:shervinafshar at gmail.com > ] > *Sent:* Thursday, May 14, 2015 10:40 PM > *To:* Peter Constable > *Cc:* unicode at unicode.org > *Subject:* Re: Future of Emoji? (was Re: Tag characters) > > > > Good point. I missed these while looking into compatibility symbols. Of > course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are > mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? > U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended > support for flags, most of smiley faces, objects, etc. on all platforms > would affect this approach. > > > > My idea of a sticker-based solution is something more like Facebook's[3] > or Line's[4] implementations. > > > > [1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf > > [2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf > > [3]: > http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html > > [4]: https://creator.line.me/en/guideline/ > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 9:37 PM, Peter Constable > wrote: > > Skype uses stickers, including animated stickers. Here?s the documented > set: > > > > https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons > > > > And if you search, you?ll find lots more ?hidden? emoticons, like > ?(bartlett)?. > > > > > > > > Peter > > > > > > *From:* Shervin Afshar [mailto:shervinafshar at gmail.com] > *Sent:* Thursday, May 14, 2015 8:12 PM > *To:* Peter Constable > *Cc:* unicode at unicode.org > *Subject:* Future of Emoji? (was Re: Tag characters) > > > > Peter, > > > > This very topic was discussed in last meeting of the subcommittee and my > impression is that there are plans to promote the use of embedded graphics > (aka stickers) either through expansions to section 8 of TR51 or through > some other means. It should also be noted that none of current members of > Unicode seem to have a sticker-based implementation (with the exception of > an experimental limited trial by Twitter[1]). > > > > [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/ > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 7:44 PM, Peter Constable > wrote: > > And yet UTC devotes lots of effort (with an entire subcommittee) to > encode more emoji as characters, but no effort toward any preferred longer > term solution not based on characters. > > > > > > Peter > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin > Afshar > *Sent:* Thursday, May 14, 2015 2:27 PM > *To:* wjgo_10009 at btinternet.com > *Cc:* unicode at unicode.org > *Subject:* Re: Tag characters > > > > Thinking about this further, could the technique be used to solve the > requirements of > section 8 Longer Term Solutions > > > > IMO, the industry preferred longer term solution (which is also discussed > in that section with few existing examples) for emoji, is not going to be > based on characters. > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > > > What else would be possible if the same sort of technique were applied > to another base character? > > > Thinking about this further, could the technique be used to solve the > requirements of > > section 8 Longer Term Solutions > > of > > http://www.unicode.org/reports/tr51/tr51-2.html > > ? > > > Both colour pixel map and colour OpenType vector font solutions would be > possible. > > > Colour voxel map and colour vector 3d solids solutions are worth thinking > about too as fun coding thought experiments that could possibly lead to > useful practical results. > > > > > William Overington > > > 14 May 2015 > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 15 15:09:13 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 15 May 2015 22:09:13 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150515081003.1984d0c4@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> Message-ID: 2015-05-15 9:10 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Fri, 15 May 2015 02:10:36 +0200 > Philippe Verdy wrote: > > > 2015-05-14 20:13 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > On Thu, 14 May 2015 12:58:29 +0200 > > > Philippe Verdy wrote: > > > > > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > > > richard.wordingham at ntlworld.com>: > > > > > > > > > An elegant formal solution to the Kleene star problem interprets > > > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > > > counter-intuitive > > > > > > The technical term for this is the 'concurrent iteration' - or at > > > least, that's the term used in the 'Book of Traces'. > > > > > > > For your example "(\u0323\u0302)*" the characters in the > > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), > > > > once converted to NFD (which is the same here) are just using at > > > > most two distinct non-zero combining classes and no blocker; so > > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass > > > > matching that will then only check candidate matchs in the second > > > > pass. or more efficiently, a second finite state automata (FSA) > > > > running in parallel with its own state: > > > > > > You've forgotten the basic problem. A *finite* state automaton > > > cannot count very far; with only n states, it cannot count as far > > > as n. > > > > > > > I did not forget it, this is why there's a second pass (or a second > > FSA running in parallel to indicate its own accept state). You have > > to combine the two states variables to get the final combined state > > to determine if it is a final accept state. > > Your description makes no sense to me as a description of a finite > state automaton. This is because you don't understand the issue ! > Now, a program to check whether a trace matching > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just > counts the number of times \u0323 occurs and the number of times > \u0302 occurs, and returns whether they are equal. This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would pass your counting test (which does not work in a FSA) but they are NOT canonically equivalent because the identical combining characters are blocking each other (so arbitrary ordering is not possible). I maintain what I said: you don't need arbitrary counting and a FSA is possible (both NFA, using compound states, and the derived DFA if ever you want to resolve compound states to a single integer, but assume the fact the the transition tables will explode dramatically) Once again we cannot have pairs of strings where you cannot isolate BOUNDED substrings (between blockers) where you can check their canonically equivalence. At most you'll have only 255 combining characters to check that have distinct non-zero combining classes. So the FSA implementation is perfectly possible, for canonical equivalences only... This evidently does not work if you are performing regexp searches using looser equivalences, such as compatibility equivalence. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 15 15:21:56 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 15 May 2015 22:21:56 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150515164503.2c8624f0@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515164503.2c8624f0@JRWUBU2> Message-ID: 2015-05-15 17:45 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > I think this discussion on search and replace would benefit from some > examples. I don?t see your problem. Is it based on experience? I have > some fairly simple examples. > Just consider a regexp that attempts to search and subtitute "?" (for example by "?") and that has to locate it where it is in NFC form (single character) or NFD form (combining sequence). You'll also have to match cases where there are other intermediate combining characters (with a distinct non-zero combining class, different from the combining class of the acute accent) between the base letter and the acute accent. You have then to return discontiguous matches, but your replacement string "?" should still preserve the other combining characters. The situation is even worse if you are looking for strings in which you want to discard only some combining characters (the replacement is empty): there may be several discontiguities in the matches. Now imagine that the replacement string is to replace all these distinct combining characters by a single one (such things would be done for filters that want to eliminate some combining characters not suitable for a given language, or because there's a linguistic orthographic rule that permits these substitutions of foreign combining characters, e.g. : drop combining dots above, replace all combining characters below, except the cedilla by a single one such as a low line. Such thing would also happen for languages that have changed/simplified their orthography about combiing characters, or that use two distinct orthographic conventions and you want to convert between them) -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 15 16:57:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 15 May 2015 22:57:03 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> Message-ID: <20150515225703.20771426@JRWUBU2> On Fri, 15 May 2015 22:09:13 +0200 Philippe Verdy wrote: > 2015-05-15 9:10 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > This is because you don't understand the issue ! > > Now, a program to check whether a trace matching > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just > > counts the number of times \u0323 occurs and the number of times > > \u0302 occurs, and returns whether they are equal. > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would > pass your counting test (which does not work in a FSA) but they are > NOT canonically equivalent because the identical combining characters > are blocking each other (so arbitrary ordering is not possible). TUS7.0: D108 Reorderable pair: Two adjacent characters A and B in a coded character sequence are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is a reorderable pair. TUS7.0: D109 Canonical Ordering Algorithm: In a decomposed character sequence D, exchange the positions of the characters in each Reorderable Pair until the sequence contains no more Reorderable Pairs. The normalisation process on first replaces it by . There are then no more reorderable pairs, so that has reduced it to form NFD. Therefore and *are* canonically equivalent. > So the FSA implementation is perfectly possible, for canonical > equivalences only... I now vaguely follow your argument, but it depends on the erroneous claim that and are not canonically equivalent. > This evidently does not work if you are > performing regexp searches using looser equivalences, such as > compatibility equivalence. I completely fail to understand this remark; it makes no difference whether one uses canonical or compatibility equivalence. Richard. From verdy_p at wanadoo.fr Fri May 15 17:31:53 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 16 May 2015 00:31:53 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150515225703.20771426@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> Message-ID: 2015-05-15 23:57 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Fri, 15 May 2015 22:09:13 +0200 > Philippe Verdy wrote: > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > This is because you don't understand the issue ! > > > > Now, a program to check whether a trace matching > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just > > > counts the number of times \u0323 occurs and the number of times > > > \u0302 occurs, and returns whether they are equal. > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would > > pass your counting test (which does not work in a FSA) but they are > > NOT canonically equivalent because the identical combining characters > > are blocking each other (so arbitrary ordering is not possible). > > TUS7.0: D108 Reorderable pair: > Two adjacent characters A and B in a coded character sequence > are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is > a reorderable pair. > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that U+0323 blocks another occurence of U+0323 because it has the **same** combining class. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri May 15 17:54:22 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 15 May 2015 23:54:22 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> Message-ID: <20150515235422.3e347dc3@JRWUBU2> On Sat, 16 May 2015 00:31:53 +0200 Philippe Verdy wrote: > 2015-05-15 23:57 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Fri, 15 May 2015 22:09:13 +0200 > > Philippe Verdy wrote: > > > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham < > > > richard.wordingham at ntlworld.com>: > > > > > This is because you don't understand the issue ! > > > > > > Now, a program to check whether a trace matching > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It > > > > just counts the number of times \u0323 occurs and the number of > > > > times \u0302 occurs, and returns whether they are equal. > > > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would > > > pass your counting test (which does not work in a FSA) but they > > > are NOT canonically equivalent because the identical combining > > > characters are blocking each other (so arbitrary ordering is not > > > possible). > > > > TUS7.0: D108 Reorderable pair: > > Two adjacent characters A and B in a coded character sequence > > are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. > > > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) > > is a reorderable pair. > > > > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that > U+0323 blocks another occurence of U+0323 because it has the **same** > combining class. How does that stop and being canonically equivalent? TUS7.0: D109 'Canonical Ordering Algorithm' says: "In a decomposed character sequence D, exchange the positions of the characters in each Reorderable Pair until the sequence contains no more Reorderable Pairs." There is no mention of blocking in D109. Richard. From verdy_p at wanadoo.fr Fri May 15 19:04:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 16 May 2015 02:04:55 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150515235422.3e347dc3@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> Message-ID: But do you agree that we still need to match pairs of distinct characters in your example ? If you just count the otal it will be wrong with (\u0302\u0302\0323)* if you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to (\u0302|\0323)*, because what you want is not matching pairs but triples (your counter check would have now to make sure there are two times more occurences of \u0302 and occurences of \u0323. If not, you need to rollback (by one or two characters, possibly more, until you satisfy the condition, but you won't know by just seeing the characters and advancing that your sequence is terminated: it is only at end that you have to do this check and only then you can rollback : The analysis cannot be deterministic, or it requires keeping a track of all acceptable positions previously seen that could satisfy the condition; as the sequence for (\u0302\u0302\0323)* can be extremely long, keeping this track for possible rollbacks coudl be costly. For example consider this regexp: (\u0302\u0302\0323)*(\u0302\0303)*\u0302 Can you still transform it and correctly infer the type of counters you need for the final check (before rollbacks) if you replace it with: (\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to (\u0302|\0303|\0323)*\u03202. You'd need to check that there are exactly - (2n+1) occurences of \0302 - (n) occurences of \0303 - (n) occurences of \0323 But it won't be enough because \0302 and \0303 have the same combining class and cannot be reordered. So within the first regexp: (\u0302\u0302\0323)*(\u0302\0303)*\u0302 the first iterated subregexp will need to scan first within the part that is to match in the second iterated subregexp, where you cannot predict where it will stop. It may even scan completely through it (if you have not encountered any \0303) and eaten the last \u0302. At this time, you may see that the first iterated subregexp cannot contain any \u0303 so the first rollback to do will be to rollback just before the 1st occurence of \0303. But the counter check may still be wrong and you'll have to rollback through one or two occurences of \u0302 in order to find the location where the first iterated subregexp is satisfied. At this point the one ot two occurences of \u0302 that you've rolled back will be counted as being part of the 2nd iterated regexp, or even be the final occurence looked to match the end of the regexp. I don't see how you can support this regexp with a DFA, you absolutely need an NFA (and the counters you want to add do not offer any decisive help). 2015-05-16 0:54 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sat, 16 May 2015 00:31:53 +0200 > Philippe Verdy wrote: > > > 2015-05-15 23:57 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > On Fri, 15 May 2015 22:09:13 +0200 > > > Philippe Verdy wrote: > > > > > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham < > > > > richard.wordingham at ntlworld.com>: > > > > > > > This is because you don't understand the issue ! > > > > > > > > Now, a program to check whether a trace matching > > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It > > > > > just counts the number of times \u0323 occurs and the number of > > > > > times \u0302 occurs, and returns whether they are equal. > > > > > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would > > > > pass your counting test (which does not work in a FSA) but they > > > > are NOT canonically equivalent because the identical combining > > > > characters are blocking each other (so arbitrary ordering is not > > > > possible). > > > > > > TUS7.0: D108 Reorderable pair: > > > Two adjacent characters A and B in a coded character sequence > > > are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. > > > > > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) > > > is a reorderable pair. > > > > > > > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that > > U+0323 blocks another occurence of U+0323 because it has the **same** > > combining class. > > How does that stop and U+0302, U+0323, U+0302> being canonically equivalent? > > TUS7.0: D109 'Canonical Ordering Algorithm' says: > "In a decomposed character sequence D, exchange the positions of the > characters in each Reorderable Pair until the sequence contains no more > Reorderable Pairs." > > There is no mention of blocking in D109. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri May 15 19:18:56 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 15 May 2015 17:18:56 -0700 Subject: Tag characters In-Reply-To: References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> Message-ID: The consortium is in no position to enhance protocols *itself* for exchanging images. That's firmly in other groups' hands. We can try to noodge them a bit, but what *will* make a difference is when the *vendors* of sticker solutions put pressure on the different groups responsible for the protocols to provide interoperability for images. Because there is a lot of growth in sticker solutions, I would expect there to be more such pressure. And even so, I expect it will take those some time to be deployed. We've said what our longer-term position is, and I think we all pretty much agree with that; exchanging images is much more flexible. However, we do have strong short-term pressure to show that we are responsive and responsible in adding emoji. And our adding a reasonable number of emoji per year is not going to stop Line or Skype from adding stickers! There are a few possible scenarios, and it's hard to predict the results. It could be that emoji are largely supplanted by stickers in 5 years; could be 10; could be that they both coexist indefinitely. I have no ??, and neither does anyone else... Mark *? Il meglio ? l?inimico del bene ?* On Thu, May 14, 2015 at 7:44 PM, Peter Constable wrote: > And yet UTC devotes lots of effort (with an entire subcommittee) to > encode more emoji as characters, but no effort toward any preferred longer > term solution not based on characters. > > > > > > Peter > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin > Afshar > *Sent:* Thursday, May 14, 2015 2:27 PM > *To:* wjgo_10009 at btinternet.com > *Cc:* unicode at unicode.org > *Subject:* Re: Tag characters > > > > Thinking about this further, could the technique be used to solve the > requirements of > section 8 Longer Term Solutions > > > > IMO, the industry preferred longer term solution (which is also discussed > in that section with few existing examples) for emoji, is not going to be > based on characters. > > > > > ? Shervin > > > > On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > > > What else would be possible if the same sort of technique were applied > to another base character? > > > Thinking about this further, could the technique be used to solve the > requirements of > > section 8 Longer Term Solutions > > of > > http://www.unicode.org/reports/tr51/tr51-2.html > > ? > > > Both colour pixel map and colour OpenType vector font solutions would be > possible. > > > Colour voxel map and colour vector 3d solids solutions are worth thinking > about too as fun coding thought experiments that could possibly lead to > useful practical results. > > > > > William Overington > > > 14 May 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 15 19:41:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 16 May 2015 02:41:33 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> Message-ID: With a NFA, the representation is completely different, The regexp (\u0302\u0302\0323)*(\u0302\0303)*\u0302 is just transformed into: (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\ 0303|?\0303?\u0302)*?\u0302? where I noted with the "tack" the 15 relative positions **in this new regex** where there's a need to check if the input matches a character or character class. Note that in this transform, all allowed permutations of canonically equivalent substrings are added; given that these substrings are bounded in length in the initial regexp and there's a limited number of permutations the result is still bounded. The state of the NFA is represented as a set of these positions (here a bitset with 15 bits). The initial state has only the first bit set to true, the final accept state must just have the 15th bit set to true. When you scan the input, you have to test the inpout character for each position in the bitset that is currently true, if the associated character or character class matches the input, and then advance this bit. For that you use a second separate bitset initially empty (all 15 bits set to false), and to advance bit n in the state, you will set bit (n+1) to true; but to advance from bit 15, you don't set any in the second bitset. You may also want to avoid generating these permutations: (?\u0302?\u0302?\0323)*(??\u0302?\0303)*??\u0302? Here I noted with the "combining glottal stop" the positions where you have to count the characters ONLY in the subsequence, i.e. my "tacks" that are just after the asterisks. However in both cases (either the generated permutations, or using counters) you'll need to use backtracing for rollbacks. Performing a rollback in an NFA is not easy! you have to remember the bitsets representing the state of the NFA before you advanced it to the next state... If you did not want to generate the permutations but only use counters, the backtracing to keep must also contain these counters (I have no idea how to safely rollback those counters, my opinion is that it will not work, and generating the permutations, even if this increases the number of "tack" positions in the transformed regexp, is MUCH simpler, and does not really generate a significant cost in term of memory). But it's true that the allowed reorderings implied by canonical equivalences (and those that are NOT allowed because they are blocked) are really challenging ! 2015-05-16 2:04 GMT+02:00 Philippe Verdy : > But do you agree that we still need to match pairs of distinct characters > in your example ? > If you just count the otal it will be wrong with (\u0302\u0302\0323)* if > you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to > (\u0302|\0323)*, because what you want is not matching pairs but triples > (your counter check would have now to make sure there are two times more > occurences of \u0302 and occurences of \u0323. > If not, you need to rollback (by one or two characters, possibly more, > until you satisfy the condition, but you won't know by just seeing the > characters and advancing that your sequence is terminated: it is only at > end that you have to do this check and only then you can rollback : > > The analysis cannot be deterministic, or it requires keeping a track of > all acceptable positions previously seen that could satisfy the condition; > as the sequence for (\u0302\u0302\0323)* can be extremely long, keeping > this track for possible rollbacks coudl be costly. For example consider > this regexp: > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302 > > Can you still transform it and correctly infer the type of counters you > need for the final check (before rollbacks) if you replace it with: > > (\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to > (\u0302|\0303|\0323)*\u03202. > > You'd need to check that there are exactly > - (2n+1) occurences of \0302 > - (n) occurences of \0303 > - (n) occurences of \0323 > > But it won't be enough because \0302 and \0303 have the same combining > class and cannot be reordered. So within the first regexp: > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302 > > the first iterated subregexp will need to scan first within the part that > is to match in the second iterated subregexp, where you cannot predict > where it will stop. It may even scan completely through it (if you have not > encountered any \0303) and eaten the last \u0302. At this time, you may see > that the first iterated subregexp cannot contain any \u0303 so the first > rollback to do will be to rollback just before the 1st occurence of \0303. > > But the counter check may still be wrong and you'll have to rollback > through one or two occurences of \u0302 in order to find the location where > the first iterated subregexp is satisfied. At this point the one ot two > occurences of \u0302 that you've rolled back will be counted as being part > of the 2nd iterated regexp, or even be the final occurence looked to match > the end of the regexp. > > I don't see how you can support this regexp with a DFA, you absolutely > need an NFA (and the counters you want to add do not offer any decisive > help). > > > 2015-05-16 0:54 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > >> On Sat, 16 May 2015 00:31:53 +0200 >> Philippe Verdy wrote: >> >> > 2015-05-15 23:57 GMT+02:00 Richard Wordingham < >> > richard.wordingham at ntlworld.com>: >> > >> > > On Fri, 15 May 2015 22:09:13 +0200 >> > > Philippe Verdy wrote: >> > > >> > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham < >> > > > richard.wordingham at ntlworld.com>: >> > > >> > > > This is because you don't understand the issue ! >> > > >> > > > > Now, a program to check whether a trace matching >> > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It >> > > > > just counts the number of times \u0323 occurs and the number of >> > > > > times \u0302 occurs, and returns whether they are equal. >> > > >> > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would >> > > > pass your counting test (which does not work in a FSA) but they >> > > > are NOT canonically equivalent because the identical combining >> > > > characters are blocking each other (so arbitrary ordering is not >> > > > possible). >> > > >> > > TUS7.0: D108 Reorderable pair: >> > > Two adjacent characters A and B in a coded character sequence >> > > are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. >> > > >> > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) >> > > is a reorderable pair. >> > > >> > >> > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that >> > U+0323 blocks another occurence of U+0323 because it has the **same** >> > combining class. >> >> How does that stop and > U+0302, U+0323, U+0302> being canonically equivalent? >> >> TUS7.0: D109 'Canonical Ordering Algorithm' says: >> "In a decomposed character sequence D, exchange the positions of the >> characters in each Reorderable Pair until the sequence contains no more >> Reorderable Pairs." >> >> There is no mention of blocking in D109. >> >> Richard. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri May 15 20:21:02 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 15 May 2015 18:21:02 -0700 Subject: A few emoji per year... (was: Re: Tag characters) In-Reply-To: References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> Message-ID: <55569B7E.7040401@att.net> And to put Mark's comments in some statistical perspective, in the context of all the media hype, the true "big bang" for emoji in Unicode was Version 6.0, released over 4-1/2 years ago now. *That* was the Unicode release that added hundreds and hundreds of emoji for Japanese carrier interoperability, as well as the regional indicator mechanism for the representation of flag pictographs. But at the time, relatively few people noticed, because no Unicode emoji were on phones yet. Unicode 7.0, which resulted in the huge media splash about emoji last year, actually only added 103 emoji, and the majority of those were very old news: old-fashioned pictographs for Webdings compatibility. There were only a few high visibility, emotionally catchy new additions among that set, such as the CHIPMUNK and the you-know-what-I'm-talking-about hand gesture, that convinced people this was a bigger deal new release than it was. But suddenly everything was visible on phones, and that made all the difference for the general public. Unicode 8.0 is about to be released, and it will have just 41 emoji additions -- among them the 5 emoji modifiers that are already available on phones to address the emoji diversity issue. And the UTC just approved 38 new emoji candidates that will be the likely basis of the emoji additions for Unicode 9.0 next year. Once we get through the Unicode 8.0 and Unicode 9.0 cycles, this process will have settled into a kind of a routine -- and it will be apparent to all what the likely scale and scope of future emoji additions *as Unicode characters* will be: a few dozen per year, carefully picked based on a set of criteria now to be set out in the new UTR #51 regarding emoji. The sky isn't falling here. ;-) The Unicode Consortium has not suddenly transmogrified into the Emoji Consortium. People will get used to the fact that a few dozen new emoji characters get added to the standard every year -- ho hum. And for folks who can't wait through the two-years-from-proposal-to-implementation cycles of character encoding committees, well... those stickers are out there waiting for you. --Ken On 5/15/2015 5:18 PM, Mark Davis ?? wrote: > > However, we do have strong short-term pressure to show that we are > responsive and responsible in adding emoji. And our adding a > reasonable number of emoji per year is not going to stop Line or Skype > from adding stickers! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Sat May 16 03:15:34 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sat, 16 May 2015 09:15:34 +0100 Subject: A few emoji per year... (was: Re: Tag characters) In-Reply-To: <55569B7E.7040401@att.net> References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost> <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost> <55569B7E.7040401@att.net> Message-ID: On 16 May 2015 at 02:21, Ken Whistler wrote: > > And for folks who can't wait through the two-years-from-proposal-to-implementation > cycles of character encoding committees, well... ... don't worry, the UTC will simply bypass the normal ISO ballot cycle, and fast-track them into the next available version of Unicode. Andrew From richard.wordingham at ntlworld.com Sat May 16 09:14:28 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 16 May 2015 15:14:28 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> Message-ID: <20150516151428.550def44@JRWUBU2> On Sat, 16 May 2015 02:04:55 +0200 Philippe Verdy wrote: > But do you agree that we still need to match pairs of distinct > characters in your example ? The original point I made was that (\u0323\u0302)*, as applied to 'traces' of Unicode strings under canonical equivalence, was only a regular expression if one reinterpreted the *-operator. The key points established in the theory of 'trace monoids' as applied to fully decomposed Unicode strings are: 1) If a set ('language') A of Unicode strings under canonical equivalence can be recognised by a *finite* state machine and, for each string in A: a) the string contain a starter, or b) all characters in the string have the same canonical combining class then there is a *finite* state machine that recognises A* with the normal interpretation as the set of concatenations of zero or more members of A. 2) Every set recognised by a *finite* state machine can be written in the form of a regular expression using optionality, bracketing, alternative, concatenation and Kleene star. Moreover, Kleene star will only be applied to sets satisfying the condition above. Moreover, the expression could be used to check the string as converted to NFD. That sounds like very good news until you remember that *searching* for the canonical equivalent of U+00F4 in an NFD string needs something like: .*(o)[:^ccc = 230:]*(\u0302).* This expression has two capture groups. > But do you agree that we still need to match pairs of distinct > characters in your example ? A *finite* automaton acting on Unicode traces won't support (\u0323\u0302)*. My preferred solution, if support is required, is to bend the finite automaton to simultaneously consider an unbounded number of repeats of a subexpression. This works for me because I store states as strings and allocate each string from the heap. The amount of memory required is sublinear in the length of the string being searched. > If you just count the otal it will be wrong with (\u0302\u0302\0323)* > if you transform it into (\u0302|\u0302|\0323)* which is fully > equivalent to (\u0302|\0323)*, because what you want is not matching > pairs but triples (your counter check would have now to make sure > there are two times more occurences of \u0302 and occurences of > \u0323. If not, you need to rollback (by one or two characters, > possibly more, until you satisfy the condition, but you won't know by > just seeing the characters and advancing that your sequence is > terminated: it is only at end that you have to do this check and only > then you can rollback : > The analysis cannot be deterministic, or it requires keeping a track > of all acceptable positions previously seen that could satisfy the > condition; as the sequence for (\u0302\u0302\0323)* can be extremely > long, keeping this track for possible rollbacks coudl be costly. For > example consider this regexp: I don't do roll-backs. I use a non-deterministic finite automaton that is equivalent to a deterministic finite automaton or, confronted with this type of rational expression (it ain't regular for traces!), use a non-deterministic slightly non-finite automaton. Now, capture groups do destroy the finiteness of the automaton, and it looks like a matter of trade-offs. There is an example on the regular expression page in the ICU user guide, searching AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC for (A+)+B. Roll backs make this exponentially slow. My code runs through this faster than it can display its progress, which is par for the course. Now, my implementation of capture groups is far from complete. At present, I capture all thousand or so possibilities, as I have no logic to determine what is required. If I set it to capture all occurrences of A+, I can just perceive an increase in run time when I pipe the progress reporting to /dev/null. As I augment the recognition-related state with the capture information, the number of active states is quadratic with string length, and the logic to maintain the list of states occupied is quadratic with the number of states, the time to run varies as the fourth power of the string length. > (\u0302\u0302\0323)*(\u0302\0303)*\u0302 > > Can you still transform it and correctly infer the type of counters > you need for the final check (before rollbacks) if you replace it > with: > > But the counter check may still be wrong and you'll have to rollback > through one or two occurences of \u0302 in order to find the location > where the first iterated subregexp is satisfied. At this point the > one ot two occurences of \u0302 that you've rolled back will be > counted as being part of the 2nd iterated regexp, or even be the > final occurence looked to match the end of the regexp. > I don't see how you can support this regexp with a DFA, you > absolutely need an NFA (and the counters you want to add do not offer > any decisive help). The reinterpretation of this expression as a regular expression for traces substitutes 'concurrent iteration' for Kleene star. Each trace in the bracketed expression that lacks a character with ccc=0 is replaced by its maximal subtraces of each canonical combining class. Under this scheme, (\u0302\u0302\0323)*(\u0302\0303)*\u0302 would be interpreted as (\u0302\u0302|0323)*(\u0302\0303)*\u0302. As I said before, that is unlikely to be what the user means by expressions like these. I'm not sure what you mean by 'NFA'. Do you mean a 'back-tracking automaton'? To support (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302 I would use my extension of non-deterministic finite automaton to process it. If you like, that is how I would do the counting - by the number of incomplete matches to \u0323\u0302\u0302. Note that I take characters from the input string in NFD order, along with their (fractional) positions in the input string. I process \u0302\u0302\u0323 as a very simple regex \u0323\u0302\u0302. The state is simply the position of the next character, plus two states for 'all matched' and for 'cannot match'. Running it with input \u0323\u0302\u0302\u0302, which it did recognise, did show one problem. Me engine doesn't notice that when looking for the first factor, \u0323\u0302\u0302, it is not possible for \u0302 to belong to a subsequent factor. Instead it progresses to a dead-end state where all subsequent input is assumed to be part of another factor. Supporting Unicode properties may make fixing this messy. I had been living with this because these dead-end states are killed on receipt of a starter, and runs of non-starters are normally not very long. No precomposed character decomposes to more than three of them. I saw the need as being for something that runs correctly, rather than for something that runs correctly and fast. When checking whether a string matches, once I have fixed the problem of dead-end states, there will be, for each state, one capturing group for the last U+0323 encountered and, at most, one capturing group for the last or penultimate U+0302 encountered. While the number of states is unbounded, the number of possible states at any point is uniformly bounded. Searching for the pattern is a bit more complicated, as each U+0323 or U+0302 could be the last such character in a matching subtrace. Richard. From richard.wordingham at ntlworld.com Sat May 16 10:02:39 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 16 May 2015 16:02:39 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> Message-ID: <20150516160239.16638123@JRWUBU2> On Sat, 16 May 2015 02:41:33 +0200 Philippe Verdy wrote: > With a NFA, the representation is completely different, The regexp > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302 > > is just transformed into: > > (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\ > 0303|?\0303?\u0302)*?\u0302? > > where I noted with the "tack" the 15 relative positions **in this new > regex** where there's a need to check if the input matches a > character or character class. The old regex is a pattern for use with the trace monoid of Unicode strings under canonical equivalence. >From its appearance, I presume the new regex is intended for use with strings, and that the third run of codepoints is meant to be ?\u0323?\u0302?\u0302 rather than a repeat of ?\u0302?\u0302?\0323. There is an annoying error. You appear to assume that U+0302 COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't; they have the same combining class, namely 230. I'm going to assume that 0303 is a typo for 0323. \u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to \u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the corrected old regex (\u0302\u0302\u0323)*(\u0302\u0323)*\u0302. However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the corrected new regex (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0323|?\u0323?\u0302)*?\u0302? This example goes straight to the problem with the recommended way of using string-based regular expression engines. Using NFD throughout works fine if one is working with whole words. If fails if one is working with sequences of combining marks and there is any complexity. > But it's true that the allowed reorderings implied by canonical > equivalences (and those that are NOT allowed because they are > blocked) are really challenging ! They are not challenging at all. Once you have eliminated the precomposed characters and characters with singleton decompositions, you are left with the trace monoid of Unicode strings under canonical equivalence. All you have to remember is that two characters commute if and only if they have different positive canonical combining classes. Richard. From verdy_p at wanadoo.fr Sat May 16 11:29:18 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 16 May 2015 18:29:18 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150516160239.16638123@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> Message-ID: 2015-05-16 17:02 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > There is an annoying error. You appear to assume that U+0302 COMBINING > CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't; > they have the same combining class, namely 230. I'm going to assume > that 0303 is a typo for 0323. Not a typo, and I did not made the assumption you suppose because I chose then so that they were effectively using the **same** combining class, so that they do not commute. It was the key fact of my argument that destroys your argumentation. Reread carefully and use the example string I gave and don't assume I wanted to write u0323 instead of u0303. And you'll see that backtracing is necessary for this case (EVEN if you don't care about capture groups but you are only interested in the global capture $0). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat May 16 12:07:13 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 16 May 2015 11:07:13 -0600 Subject: Tag characters Message-ID: <794493C42D714C3C8A58D2F45AA36663@DougEwell> L2/15-145R says: > On some platforms that support a number of emoji flags, there is > substantial demand to support additional flags for the following: > [...] > Certain supra-national regions, such as Europe (European Union flag) > or the world (e.g. United Nations flag). These can be represented > using UN M49 3-digit codes, for example "150" for Europe or "001" for > World. These are uncomfortable equivalence classes. Not all countries in Europe are members of the European Union, and the concept of "United Nations" is not really the same by definition as "all countries in the world." The remaining UN M.49 code elements that don't have a 3166-1 equivalent seem wholly unsuited for this mechanism (and those that do, don't need it). There are no flags for "Middle Africa" or "Latin America and the Caribbean" or "Landlocked developing countries." Some trans-national organizations might _almost_ seem as if they could be shoehorned into an M.49 code element, like identifying 035 "South-Eastern Asia" with the ASEAN flag, but this would be problematic for the same reasons as 150 and 001. Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for "European Union" and "UN" for "United Nations." If these flags are the use cases, why not simply use those alpha-2 code elements, instead of burdening the new mechanism with the 3-digit syntax? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sat May 16 14:28:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 16 May 2015 21:28:12 +0200 Subject: Tag characters In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell> References: <794493C42D714C3C8A58D2F45AA36663@DougEwell> Message-ID: 2015-05-16 19:07 GMT+02:00 Doug Ewell : > L2/15-145R says: > > On some platforms that support a number of emoji flags, there is >> substantial demand to support additional flags for the following: >> [...] >> Certain supra-national regions, such as Europe (European Union flag) >> or the world (e.g. United Nations flag). These can be represented >> using UN M49 3-digit codes, for example "150" for Europe or "001" for >> World. >> > > These are uncomfortable equivalence classes. Not all countries in Europe > are members of the European Union But the flag of the European in fact belongs to the Council of Europe that created it 30 years before the European Community adopted it. According to the Coucil of Europe, the flag is appropriate for ALL countries in Europe. In summary the flag does represents *not only* the EU. It is suitable as well for Russia, Belarussia (even if its seat is suspended in the Coucil of Europe), or Kazakhstan and Turkey (even if only a part of these countries is in Europe). > and the concept of "United Nations" is not really the same by definition > as "all countries in the world." > Yes but the UN recognizes a set of territories (not always their government) that covers the whole world (including Antarctica where no government is also recognized, as well as territorial waters of these territories, plus the international waters that the UN protects). Not all countries also are required to become members of the UN (the Holy See/Vatica is not a full member, but it is recognized; same remark for Palestine). So the UN has a competence on the whole world, and all people of the world can legally seek protection from the UN, wherever they live, or even if they have no country to recognize them a nationality). If you want to seek territories where the UN has no authority at all, the nearest ones are on the Moon ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat May 16 15:33:55 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 16 May 2015 21:33:55 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> Message-ID: <20150516213355.7891b4b6@JRWUBU2> On Sat, 16 May 2015 18:29:18 +0200 Philippe Verdy wrote: > 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > There is an annoying error. You appear to assume that U+0302 > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but > > they don't; they have the same combining class, namely 230. I'm > > going to assume that 0303 is a typo for 0323. > > > Not a typo, and I did not made the assumption you suppose because I > chose then so that they were effectively using the **same** combining > class, so that they do not commute. In that case you have an even worse problem. Neither the trace nor the string \u0303\u0302\u0302 matches the pattern (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the regular expression (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\ 0303|?\0303?\u0302)*?\u0302? You've transformed (\u0302\u0303) into (?\u0302?\0303|?\0303?\u0302), but that is unnecessary and wrong, because U+0302 and U+0303 do not commute. > It was the key fact of my argument that destroys your argumentation. Which argument? Restoring the \u303, the fact that remains that \u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to \u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the corrected old regex (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302. However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the corrected new regex (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303|?\u0303?\u0302)*?\u0302? Do you claim that this argument is destroyed? If it is irrelevant, why is it irrelevant? It shows that your transform does not solve the original problem of missed matches. > Reread carefully and use the example string I gave and don't assume I > wanted to write u0323 instead of u0303. I'm not at all sure what your example string is. I ran my program to watch its progression with input \u0323\u0323\u0302\u0302, which does not match the pattern, and attach the outputs for your scorn. I have added comments started by #. # NDE = new dead end - I could tweak the program so this state is not entered. # NDE! = new dead end that might not be easy to avoid. # ODE = old dead end - derived from a state already labelled ODE or NDE. # ODE! = old dead end - derived from a state already labelled ODE! or NDE!. Here are the run outputs, with blank lines added to assist formatting. $ ./regex -b '(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302' '\u0323\u0302\u0323\u0302' # ignore line wrapping above. Examining /home/richard/unicode/UCD/7.00/PropertyAliases.txt. Examining /home/richard/unicode/UCD/7.00/PropertyValueAliases.txt. Examining /home/richard/unicode/UCD/7.00/SpecialCasing.txt. Examining /home/richard/unicode/UCD/7.00/Scripts.txt. Examining /home/richard/unicode/UCD/7.00/PropList.txt. Simple Unicode regex "\u0323\u0302\u0302" Simple ASCII regex "" # I construct A* = (|A+) Simple Unicode regex "\u0302\u0303" Simple Unicode regex "\u0302" Initial states: 0) LLLL0 # The states are named according to a hierarchy of regexes. # LL = regex (\u0302\u0302\u0323)* # LLL = regex (\u0302\u0302\u0323)+ # LLLL = regex \u0302\u0302\u0323. # This is implemented as 'Simple Unicode regex "\u0323\u0302\u0302"'. # 0 means about to compare with character at offset 0, i.e. 0 1) LLRM # LLR = Empty string regex. # M = matched 2) LRLL0 # LR = regex (\u0302\u0303)* # LRL = regex (\u0302\u0303)+ # LRLL = regex \u0302\u0303 3) LRRM # LRR = Empty string regex. 4) R0 # R = regex \u0302 =0323=00:06:= # Get U+0323 from whole (=0) at byte 0 of argument LLLL0 => LLLL2 LLLL0 => LLLN001220:0:L2 # NDE! =0323=012:018:= # Note that string is input in NFD order. LLLL2 => LLLN001220:2:L2 # Now running LLLL and LLLR, whose states have relative names 2 and L2. # LLLR is a clone of LLL. # This recursion enables the recognition of unrecognisable Kleene # stars. It makes the automaton non-finite. # 001 is length in hex of name of relative state of LLLL # 220 means non-starters of ccc <= 220 will not be fed to LLLL LLLN001220:0:L2 => LLLN001220:0:N001220:2:L2 # ODE! =0302=06:012:= LLLN001220:2:L2 => LLLN001220:4:L2 LLLN001220:2:L2 => LLLN001230:2:L4 # NDE LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LL2 # NDE # L = regex (\u0302\u0302\u0323)*(\u0302\u0303)* # NDE LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LN001230:0:L2 # NDE LLLN001220:2:L2 => N00E230:LLN001220:2:L2:M # NDE LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001220:4:L2 # ODE! LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001230:2:L4 # ODE! LLLN001220:0:N001220:2:L2 => LN017230:LN001220:0:N001220:2:L2:LL2 # ODE! LLLN001220:0:N001220:2:L2 => # Line-break is email artefact. LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 # ODE! LLLN001220:0:N001220:2:L2 => N018230:LLN001220:0:N001220:2:L2:M # ODE! =0302=018:024:= LLLN001220:4:L2 => LLLN001220:M:L2 # Redundant - should purge somehow. LLLN001220:4:L2 => LLLL2 # Regex LLLL 'recognised' - rename LLLRL as LLLL. LLLN001220:4:L2 => LLLN001230:4:L4 # NDE LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LL2 # NDE LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LN001230:0:L2 # NDE LLLN001220:4:L2 => N00E230:LLN001220:4:L2:M # NDE LLLN001230:2:L4 => LLLN001230:2:LM # ODE LLLN001230:2:L4 => LLLN001230:2:L0 # ODE LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LL2 # ODE LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LN001230:0:L2 # ODE LLLN001230:2:L4 => N00E230:LLN001230:2:L4:M # ODE LN00D230:LN001220:2:L2:LL2 => LN00D230:LN001220:2:L2:LN001230:2:L2 # ODE LN00D230:LN001220:2:L2:LL2 => N019230:N00D230:LN001220:2:L2:LL2:M # ODE LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is e-mail artefact LN00D230:LN001220:2:L2:LN001230:0:N001230:2:L2 # ODE LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is email artefact N023230:N00D230:LN001220:2:L2:LN001230:0:L2:M # ODE LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001220:M:L2 # ODE! LLLN001230:0:N001220:4:L2 => LLLN001230:0:L2 # ODE! LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001230:4:L4 # ODE! LLLN001230:0:N001220:4:L2 => LN017230:LN001230:0:N001220:4:L2:LL2 # ODE! LLLN001230:0:N001220:4:L2 => # Line-break is e-mail artefact. LN017230:LN001230:0:N001220:4:L2:LN001230:0:L2 # ODE! LLLN001230:0:N001220:4:L2 => N018230:LLN001230:0:N001220:4:L2:M # ODE! LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:LM # ODE! LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:L0 # ODE! LLLN001230:0:N001230:2:L4 => LN017230:LN001230:0:N001230:2:L4:LL2 # ODE! LLLN001230:0:N001230:2:L4 => # Line-break is e-mail artefact LN017230:LN001230:0:N001230:2:L4:LN001230:0:L2 # ODE! LLLN001230:0:N001230:2:L4 => N018230:LLN001230:0:N001230:2:L4:M # ODE! LN017230:LN001220:0:N001220:2:L2:LL2 => LN017230:LN001220:0:N001220:2:L2:LN001230:2:L2 # ODE! LN017230:LN001220:0:N001220:2:L2:LL2 => N023230:N017230:LN001220:0:N001220:2:L2:LL2:M # ODE! LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 => LN017230:LN001220:0:N001220:2:L2:LN001230:0:N001230:2:L2 # ODE! LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 => N02D230:N017230:LN001220:0:N001220:2:L2:LN001230:0:L2:M # ODE! End marker is at 024:OVF > And you'll see that backtracing is necessary for this case (EVEN if > you don't care about capture groups but you are only interested in > the global capture $0). What I see is the desirability of some optimisation, but no problem in principle. Now I might see something different with your intended example - but until I see it I think my examination would be overwhelmed by dead-end state propagations. If you are making the point that a backtracking automaton might need to backtrack, then I won't dispute that point. Richard. From srl at icu-project.org Sat May 16 15:39:17 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Sat, 16 May 2015 13:39:17 -0700 Subject: Tag characters In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell> References: <794493C42D714C3C8A58D2F45AA36663@DougEwell> Message-ID: <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org> See the meeting minutes and the actual utr51. Enviado desde nuestro iPhone. > El may 16, 2015, a las 10:07 AM, Doug Ewell escribi?: > > L2/15-145R says: > >> On some platforms that support a number of emoji flags, there is >> substantial demand to support additional flags for the following: >> [...] >> Certain supra-national regions, such as Europe (European Union flag) >> or the world (e.g. United Nations flag). These can be represented >> using UN M49 3-digit codes, for example "150" for Europe or "001" for >> World. > > These are uncomfortable equivalence classes. Not all countries in Europe are members of the European Union, and the concept of "United Nations" is not really the same by definition as "all countries in the world." > > The remaining UN M.49 code elements that don't have a 3166-1 equivalent seem wholly unsuited for this mechanism (and those that do, don't need it). There are no flags for "Middle Africa" or "Latin America and the Caribbean" or "Landlocked developing countries." > > Some trans-national organizations might _almost_ seem as if they could be shoehorned into an M.49 code element, like identifying 035 "South-Eastern Asia" with the ASEAN flag, but this would be problematic for the same reasons as 150 and 001. > > Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for "European Union" and "UN" for "United Nations." If these flags are the use cases, why not simply use those alpha-2 code elements, instead of burdening the new mechanism with the 3-digit syntax? > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat May 16 16:01:24 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 16 May 2015 15:01:24 -0600 Subject: Tag characters In-Reply-To: <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org> References: <794493C42D714C3C8A58D2F45AA36663@DougEwell> <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org> Message-ID: <9E6D62BF9816458A83577364CB380E54@DougEwell> Steven R. Loomis wrote: > See the meeting minutes and the actual utr51. Sorry, I didn't find anything dealing with numeric codes in Section E.1.3 of the meeting minutes, and the copy of UTR #51 at unicode.org doesn't appear to have been updated for anything beyond the existing RIS. What specifically should I be looking for? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sun May 17 09:33:15 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 17 May 2015 16:33:15 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150516213355.7891b4b6@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> <20150516213355.7891b4b6@JRWUBU2> Message-ID: 2015-05-16 22:33 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sat, 16 May 2015 18:29:18 +0200 > Philippe Verdy wrote: > > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > There is an annoying error. You appear to assume that U+0302 > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but > > > they don't; they have the same combining class, namely 230. I'm > > > going to assume that 0303 is a typo for 0323. > > > > > > Not a typo, and I did not made the assumption you suppose because I > > chose then so that they were effectively using the **same** combining > > class, so that they do not commute. > > In that case you have an even worse problem. Neither the trace nor the > string \u0303\u0302\u0302 matches the pattern > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the > regular expression > (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\ > 0303|?\0303?\u0302)*?\u0302? > > You've transformed (\u0302\u0303) into (?\u0302?\0303|?\0303?\u0302), > but that is unnecessary and wrong, because U+0302 and U+0303 do not > commute. Oh right! Thanks for pointing, it was intended you can read it as. (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\0303)*?\u0302? But my argument remains because of the presence of \0302 in the second subregexp (which additionally is a separate capture, but here I'm not concentrating on the impact in numbered captures, but only on the global capture aka $0) > > It was the key fact of my argument that destroys your argumentation. > > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the > corrected new regex > > (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302? > > Do you claim that this argument is destroyed? If it is irrelevant, why > is it irrelevant? It shows that your transform does not solve the > original problem of missed matches. > Why doesn't it solve it? Note that the notation with tacks is just the first transform. Of course you can optimize it by factorizing the common prefixes in each alternative. In the following the 1st and 4th tacks have some common followers in their lists of characters or character classes they expect (for advancing to the next tack), but the 2nd and 5th tack expect different followers. (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302? OK I understand the need for "counting" characters present in regexps when they are sharing the same combining classes, but counting does not work correctly, in fact you have to keep counters for each distinct combining character with non-zero combining class for how they contribute to the total length of the "star" group. They also don't contribute necessarily to the same total when the regexp specifies them multiple times (a simple measurment of the total length of the group is evidently not enough, all counters must be exact multiples of the number of occurences (counter[c]) of each combining character (c) in the original untransformed content of each alternative in the star group, and the second factor n of this multiple must be identical for all counters The total length is in pseudo-code: { sum=0; for(c:v in counter) sum += v; return sum; } but it has no use by itself. If the number of (non-repeated) original untransformed alternative are in mustoccur[] the check to perform is this pseudo-code: var n = null; foreach(c:m in mustoccur) { checkthat(counter[c] % m == 0); if (n == null) n = counter[c] / m; else checkthat(counter[c] / m == n); } > > Reread carefully and use the example string I gave and don't assume I > > wanted to write u0323 instead of u0303. > > I'm not at all sure what your example string is My example was the original regexp without the notation tacks: (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302 It exposes some of the critical difficulties, first for returning correct global matches (but then also for for captures, and the effect of "ungreedy" options of Perl (and PCRE working in Perl-compatible mode or in extended mode) and most regexp engines (whose default behavior is "greedy"): the "ungreedy" option causes significant slowdowns with additional rollbacks or more work to maintain an efficient backtracing directly in the current state of the automata (if you attempt to use deterministic rules everywhere it is possible). But we know that it's not possible for all regexps in general, otherwise regexp engines would just be simple LR(n) parsers with bounded n, or even simpler LALR parsers like Bison/Yacc but without their backtracing support for "shift"/"reduce" ambiguities, these LALR parsers are also greedy by default and resolve ambiguities by "shifting" first, leaving the code determine what to do when after shiting there's a parse error caused by unexpected values, but LALR parsers do not have a clean way to handle the correct rollback to the last ambiguous shift/reduce state with a special match rule, and they do not support trying "reduce" first to get the "ungreedy" behavior as they cannot return to this state to choose the "shift" alternative). That's why since long lexers are written with regexps, and syntaxic scanners written preferably with LALR parsers which cannot work alone without a separate lexer. But using LALR parsers does not work with common languages like Fortran; it works for parsing language like C/C++ because they are specified so that shift/reduce ambiguities are resolved using "shift" always (i.e. the greedy behavior). Very few parser generator support both working mode (except the excellent PCCS that I have used since the early 1990's when it was still not rewritten in Java and was a student project in Purdue University, and that combines all the advantages of regexps and LR parsers, with very clean control of backtracing, it also supports layered parsing with multiple local parsers if needed, even without wrining any piece of output code, you can describe the full syntaxic and lexical rules of almost all languages in a single specification). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun May 17 09:45:18 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 17 May 2015 16:45:18 +0200 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: <20150516213355.7891b4b6@JRWUBU2> References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> <20150516213355.7891b4b6@JRWUBU2> Message-ID: 2015-05-16 22:33 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > I'm not at all sure what your example string is. I ran my program to > watch its progression with input \u0323\u0323\u0302\u0302, which does > not match the pattern, and attach the outputs for your scorn. I have > added comments started by #. > Sorry for not commenting it, this is the internal tricks and outputs of your program, and your added comments does not allow me to interpret what all this means, i.e. the exact role of the notations with sequences or "L" or "R" or "N", and what the "=>" notation means (I suppose this is noting an advance rule and that the left-hand side is the state before, the right-hand-side is the state after, but I don't see where is the condition (the character or character class to match, or an error condition). You've only "explained" partly the NDE and ODE comments and the "!" when it is appended. Is that really what your regexp engine outputs as its internally generated parser tables (only "friendly" serialized as a "readable" text) ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun May 17 11:52:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 17 May 2015 17:52:56 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> <20150516213355.7891b4b6@JRWUBU2> Message-ID: <20150517175256.1bc136f4@JRWUBU2> On Sun, 17 May 2015 16:45:18 +0200 Philippe Verdy wrote: > 2015-05-16 22:33 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > I'm not at all sure what your example string is. I ran my program > > to watch its progression with input \u0323\u0323\u0302\u0302, which > > does not match the pattern, and attach the outputs for your scorn. > > I have added comments started by #. > > > > Sorry for not commenting it, this is the internal tricks and outputs > of your program, and your added comments does not allow me to > interpret what all this means, i.e. the exact role of the notations > with sequences or "L" or "R" or "N", and what the "=>" notation means > (I suppose this is noting an advance rule and that the left-hand side > is the state before, the right-hand-side is the state after, but I > don't see where is the condition (the character or character class to > match, or an error condition). You've only "explained" partly the NDE > and ODE comments and the "!" when it is appended. 'ODE' and 'NDE' mean the transitions should not occur when I finish my current set of edits. The exclamation mark means the optimisation I first though of wouldn't eliminate it. > Is that really what your regexp engine outputs as its internally > generated parser tables (only "friendly" serialized as a "readable" > text) ? When running the regex, I really do hold the states in forms like LLLL2 and LLLN001220:2:L2. (The colons are unnecessary; I included them for readability.) It's designed for proof of principle, rather than high speed. There is also a tree corresponding to the analysis of the regex; the nodes record how the lower level regexes are combined. The branching nodes in the example are for sequences. In the simplest case, a matching expression will, in some canonically equivalent form, be the concatenation of a string matching the left hand node and a string matching the right mode. For iterations ('*' and '+', though I treat '+' as basic), the tree does not need a corresponding right branch, as all the information about the regex is held in the left branch. An 'L' means that the input sequence is proceeding through the left branch. An 'R' means that it has completed its passage through the left branch, and is now proceeding through the right branch. All this would be applicable if I were ignoring canonical equivalence. An 'N' (for 'normalisation') means that parsing is passing through the region where the normalisation has interleaved the left and right hand component strings. As I consider each fresh character, I have to consider its canonical combining class. The string for the state records what ccc is blocked from the left hand string. As I take the characters from the input string in NFD order, I only need to remember the highest blocked ccc. The first character I receive with a lower ccc will be a starter, at which point I will only be progressing the right hand component string. For the state in the parent regex, I record the 'N' (as opposed to an 'L' or 'R'), the highest blocked ccc, the state in the left-hand regex and the state in the right-hand regex. The input characters are recorded in the form ==::= The character location is recorded in the from . The part is a single digit. 0 means whole character, 1 means first character in character decomposing to multiple characters, 2 means second and so on. Thus, as the first U+0302 is stored as 6-character escape code '\u0302', I record the position as: =0302=06:012:= I then record the consequential transition from each state to another state. As the basic structure is that of a non-deterministic finite automaton (as at https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton ), there may be no or many transitions from a particular state. There are no error conditions as such. As I record each transition, I record whether there is now a match to the whole regex and whether the state is a duplicate. Detecting duplicates is part of the key to the classical NFA's better resistance to 'pathological inputs' compared to back-tracking algorithms. There are two main state numberings for the bottom level regexes. The main bottom level regex is a simple regex with no alternates or groupings. The engine propagates the simple regex as a string and records the state as the byte offset of the next character to compare against. The regex is stored in Latin-1 or UTF-8. (Latin-1 is not suitable for precomposed characters.) Thus when the first character input is U+0323 and is compared against the regex \u0323\u0302\u0302, the state for the regex changes from 0 to 2, as U+0323 occupies 2 bytes in UTF-8. This is recorded as an overall state transition 'LLLL0 => LLLL2'. When all characters in the string have been matched, the state becomes 'M'. The simple regexes have one-to-many state progressions to handle iterations and optionality ('*', '+' and '?'). The second system is for Unicode properties. The state records the composition of precomposed characters by using the accumulated codepoint as the state. However, the state also includes a success flag for ease of composing the acceptance or otherwise of the overall state and to determine transitions from one regex to the next. My program does not calculate what the characters are for a state transition to occur. Instead, it calculates what transitions occur in response to an input character. Richard. From richard.wordingham at ntlworld.com Sun May 17 19:03:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 01:03:02 +0100 Subject: Regular Expressions and Canonical Equivalence In-Reply-To: References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2> <20150514191324.1e455c57@JRWUBU2> <20150515081003.1984d0c4@JRWUBU2> <20150515225703.20771426@JRWUBU2> <20150515235422.3e347dc3@JRWUBU2> <20150516160239.16638123@JRWUBU2> <20150516213355.7891b4b6@JRWUBU2> Message-ID: <20150518010302.79f2b871@JRWUBU2> On Sun, 17 May 2015 16:33:15 +0200 Philippe Verdy wrote: > 2015-05-16 22:33 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Sat, 16 May 2015 18:29:18 +0200 > > Philippe Verdy wrote: > > > > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > > > richard.wordingham at ntlworld.com>: > > > > > > > There is an annoying error. You appear to assume that U+0302 > > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, > > > > but they don't; they have the same combining class, namely > > > > 230. I'm going to assume that 0303 is a typo for 0323. > > > > > > > > > Not a typo, and I did not made the assumption you suppose because > > > I chose then so that they were effectively using the **same** > > > combining class, so that they do not commute. > > > > In that case you have an even worse problem. Neither the trace nor > > the string \u0303\u0302\u0302 matches the pattern > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match > > the regular expression > > (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\ > > 0303|?\0303?\u0302)*?\u0302? > > > > You've transformed (\u0302\u0303) into > > (?\u0302?\0303|?\0303?\u0302), but that is unnecessary and wrong, > > because U+0302 and U+0303 do not commute. > > > Oh right! Thanks for pointing, it was intended you can read it as. > > (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\0303)*?\u0302? > > But my argument remains because of the presence of \0302 in the second > subregexp (which additionally is a separate capture, but here I'm not > concentrating on the impact in numbered captures, but only on the > global capture aka $0) > > > > > It was the key fact of my argument that destroys your > > > argumentation. > > > > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the > > corrected new regex > > > > (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302? > > > > Do you claim that this argument is destroyed? If it is irrelevant, > > why is it irrelevant? It shows that your transform does not solve > > the original problem of missed matches. > > > > Why doesn't it solve it? Sorry, my example wasn't quite right. It should have two combining dots below and five circumflexes, not four as I wrote it. I will first explain how my NDnear-FA handles it - I have now removed the generation of the dead end states. Initial states: 0) LLLL0 # Starting the \u0302\u0302\u0323 factor, # implemented as \u0323\u032\u0320 1) LLRM # Completed the zero trip alternative to (\u0302\u0302\u0323)+ # Not actually useful. 2) LRLL0 # Starting the \u0302\u0303 factor 3) LRRM # Completed the zero trip alternative to (\u0302\u0303)+ 4) R0 # Starting the \u0302 factor =0323=00:06:= LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323 =0323=06:012:= LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor. # Both have progressed as far as \u0323. # Finiteness would restrict me to, say, 3 repeats # in progress. # The states of the finite DFA are a cross product of 3 copies of # the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant # ccc values. By no means all of these states are used. # In the Kleene stars of the regular expression guaranteed by # recognisability, 3 copies caters for the worst case, xyz, where x # has a starter and ends in a non-starter, y consists of non-starters # with the same canonical combining class, and z starts with # non-starter and contains a starter, e.g. # x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42 # to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42 =0302=012:018:= LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors # First has progressed to \u0323\u0302 and second to # \u0323. The other way round has been pruned by the # automated observation that if \u0302 is blocked from # first factor, the factor cannot be completed. =0302=018:024:= LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor LLLN001220:4:L2 => LLLL2 # As first factor is complete, remove it from # consideration and relabel second factor as # first. =0302=024:030:= LLLL2 => LLLL4 # \u0323\0302\u0302 completed as far as \u0323\u0302 =0302=030:036:= LLLL4 => LLLLM # \u0323\u0302\u0302 is complete. LLLL4 => LRLL0 # So start \u0302\u0303 factor. LLLL4 => LRRM # Alternatively, completed the zero trip option of # (\u0302\u0303)* LLLL4 => R0 # Or, we have progressed as far as the final \u0302 LLLL4 => LLLL0 # Or, start another \u0323\u0302\u0302 =0302=036:042:= LRLL0 => LRLL2 # Got as far as \u0302 in \u0302\u0303 R0 => RM (match) # Or completed the final \u0302. End marker is at 042:OVF Could you please talk me through how your system recognises the string \u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex. I can't work out how it is supposed to work from your description. Richard. From abdo.alrhman.aiman at gmail.com Mon May 18 06:49:27 2015 From: abdo.alrhman.aiman at gmail.com (=?UTF-8?B?2LnYqNivINin2YTYsdit2YXYp9mGINij2YrZhdmG?=) Date: Mon, 18 May 2015 14:49:27 +0300 Subject: Arabic diacritics In-Reply-To: References: Message-ID: many thanks, this exactly the needed information :) respectfully 2015-05-15 19:09 GMT+03:00 Denis Jacquerye : > You should use ARABIC SHADDA U+0651 in all positions. The presentation > forms (isolated, medial, final forms) are for compatibility with legacy > systems. > See what is said in http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf > about the Arabic Presentation Forms-B. > > Cheers, > > > On Fri, 15 May 2015 at 15:53 ??? ??????? ???? < > abdo.alrhman.aiman at gmail.com> wrote: > >> hi, >> >> regarding the Arabic diacritics. e.g. for the Shadda, we >> have: >> >> 1. The form that people type: >> http://unicode-table.com/en/0651/ >> >> 2. An Isolated form. It should be the same, but looks different in the >> Unicode table, which is confusing me now. >> http://unicode-table.com/en/FE7C/ >> >> 3. A medial form: >> http://unicode-table.com/en/FE7D/ >> >> When do I use 1/2, and when do I use 3? >> >> some diacritics has e.g. isolated and medial forms. Some have >> only one of these forms, some have both. So, where does each of them go? >> >> respectfully >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon May 18 13:19:01 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 11:19:01 -0700 Subject: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> L2/15-145R says: > In CLDR 28, LDML will define a unicode_subdivision_subtag which also > provides validity criteria for the codes used for regional > subdivisions (see CLDR ticket #8423). When representing regional > subdivisions using ISO 3166-2 codes, only those codes that are valid > for the LDML unicode_subdivision_subtag should be used. The preliminary subdivisions.xml file includes entries like this: (GB-SCT) for the Scottish flag and <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK) for the North Lanarkshire council area flag -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From markus.icu at gmail.com Mon May 18 13:28:18 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 18 May 2015 11:28:18 -0700 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> Message-ID: On Mon, May 18, 2015 at 11:19 AM, Doug Ewell wrote: > Is the new mechanism intended to allow flag tags that include either > "subtype" values or "contains" values? As far as I can tell from your quotes, CLDR will say what's valid (plus containment info), and Unicode permits you to show a flag for any valid tag. North Lanarkshire seems perfectly fine. I am curious to see if the redundant hyphen will be part of the syntax. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 13:35:45 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 19:35:45 +0100 Subject: Regexes, Canonical Equivalence and Backtracking of Input Message-ID: <20150518193545.51cb95b8@JRWUBU2> Philippe and I have got bogged down in a long discussion of how to parse traces of Unicode strings under canonical equivalence against non-regular Kleene star of regular expressions. Fortunately, such expressions can be expected to have very little use. A seemingly simple example is the regex \u0f73* i.e. any number of occurrences of U+0F73 TIBETAN VOWEL SIGN II, and not \u0f71\u0f72*. An example of a string matching under canonical equivalence is 0F71 0F71 0F72 0F72. I believe we both thought that characters would arrive from the trace in a deterministic order. Now, many regular expression engines back-track their parsing of the input string (no-one has reported working with input traces). A possibly useful trick would be for characters to be taken from the input file in accordance with the matching to the pattern, with input also back-tracked if matching fails. The notion of next character would depend on the state of the parsing algorithm. In the example above, the engine would just take the input in the order 0F71 0F72 0F71 0F72. Match found, job done. One advantage of this scheme is that there would be no need for adjustments to deal with the interleaving of adjacent matches to successive subexpressions. There would be no nagging worry that one's rational expression was not a regular expression when applied to traces. Any theoreticians around may be wondering how this magic is achieved. The simple answer is that the non-finiteness has been transferred to: (1) the back-tracking through parse options; and (2) the algorithm to walk through the character sequencing options. The algorithm itself should be tractable - Mark Davis has published an algorithm to generate all strings canonically equivalent to a Unicode string, and what we need might not be so complex. I offer this thought up as it seems that, for a regex engine working on traces with deterministic input, the byte code for a regex concatenation AB or iteration A* is much more complicated than the code for the subregexes A and B. I have a worry that the length of the compiled code might even be exponential with the length of the regex. (I may be wrong - there might be a limit to what one can do for worst case complexity of the interleaving.) Choosing the input to match the regex would remove this problem. Richard. From andrewcwest at gmail.com Mon May 18 13:37:06 2015 From: andrewcwest at gmail.com (Andrew West) Date: Mon, 18 May 2015 19:37:06 +0100 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> Message-ID: On 18 May 2015 at 19:19, Doug Ewell wrote: > > Is the new mechanism intended to allow flag tags that include either > "subtype" values or "contains" values? For example: That is my understanding. > <1F3F3 E0047 E0042 E002D E0053 E0043 E0054> (GB-SCT) > for the Scottish flag > > and > > <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK) > for the North Lanarkshire council area flag I don't believe that North Lanarkshire has an associated flag, which I think is the case for most UK counties and councils (Cornwall, Devon and Dorset all have flags, but they may be the exceptions). In fact not all of the four nations comprising the UK have a flag -- for political reasons there is no official flag for Northern Ireland, so I do not know what an implementation would display for <1F3F3 E0047 E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag emblazoned with "GB-NIR". Andrew From verdy_p at wanadoo.fr Mon May 18 13:47:19 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 18 May 2015 20:47:19 +0200 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> Message-ID: The hyphen is not redundant in ISO 3166 that defines primary codes with variable length (even if ISO 3166 part 1 for now only use two-letter codes). Sometime in a future, two letters will not be enough even in ISO 3166-1, if countries continue to split/merge (this does not happen frequently but is occurs every few years; and it will not be possible to reuse old codes that are maintained for a long period). May be then we'll have ISO 3166-1 codes using digits (such as "A1" or "1A"), but this will cause some problems to map them to IETF ccTLD codes (within the DNS root registry). As well the UN M.49 numeric codes will get full if it continues with its current allocation scheme (using ranges of numbers by continental regions). Or the other solution will be to extend the set of allowed letters. 2015-05-18 20:28 GMT+02:00 Markus Scherer : > On Mon, May 18, 2015 at 11:19 AM, Doug Ewell wrote: > >> Is the new mechanism intended to allow flag tags that include either >> "subtype" values or "contains" values? > > > As far as I can tell from your quotes, CLDR will say what's valid (plus > containment info), and Unicode permits you to show a flag for any valid tag. > North Lanarkshire seems perfectly fine. > > I am curious to see if the redundant hyphen will be part of the syntax. > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon May 18 14:05:49 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 18 May 2015 21:05:49 +0200 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <20150518193545.51cb95b8@JRWUBU2> References: <20150518193545.51cb95b8@JRWUBU2> Message-ID: 2015-05-18 20:35 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > The algorithm itself should be tractable - Mark Davis has published > an algorithm to generate all strings canonically equivalent to a > Unicode string, and what we need might not be so complex. Even this algorithm from Mark Davis will fail in this case: - You can use it easily to transform a regexp containing (\u0F73) into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72) - But this leaves the same problem for unbounded repetititions with the "+" or "*" or "{m,}" operators. - However you can use it for bounded repetitions with "{m,n}", provided that "n" is not too large because the total number of expendaned alternatives (without repetitions) explodes exponentially with a power proportional to "n" (the base of the exponent depends on the basic non-repeated string and the number of canonical equivalents it has. Now all the problem is how to do the backtracking, and if it works, and how to expose the matched captures (which will still be discontiguous, including $0) and then how you can perform a safe find&replace operation: it is hard to specify the replacement with simple "$n" placeholders, you need more complex placeholders for handling discontiguous matches: $n has to become not just a string, but an object whose default "tostring" property is the exact content of the match, but other properties are needed to expose the interleaving characters, or some context before and after the match (notably when these contexts contain combining characters that are NOT blocked by the match itself. Backtracing is an internal thing before even handling matches, they occur where there is still NO match to return, even if the regexp engine offers a way to use a callback instead of a basic replacement string containing "$n" placeholders, so this callback would not be called. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 14:33:37 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 20:33:37 +0100 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net> Message-ID: <20150518203337.4949e7cc@JRWUBU2> On Mon, 18 May 2015 19:37:06 +0100 Andrew West wrote: > > <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK) > > for the North Lanarkshire council area flag > > I don't believe that North Lanarkshire has an associated flag, which I > think is the case for most UK counties and councils (Cornwall, Devon > and Dorset all have flags, but they may be the exceptions). In fact > not all of the four nations comprising the UK have a flag -- for > political reasons there is no official flag for Northern Ireland, so I > do not know what an implementation would display for <1F3F3 E0047 > E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag > emblazoned with "GB-NIR". As the Ulster Banner is still in use, and still does unofficially represent Northern Ireland, perhaps it should have its own codepoint. I'm not sure of the strength of the argument for St Patrick's Cross. Perhaps it too should have its own codepoint, especially if it is evolving from being a flag of Ireland (apparently not used by the Irish rugby union team) to a flag of Northern Ireland. Richard. From eliz at gnu.org Mon May 18 14:40:21 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 18 May 2015 22:40:21 +0300 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <20150518193545.51cb95b8@JRWUBU2> References: <20150518193545.51cb95b8@JRWUBU2> Message-ID: <83mw11ekt6.fsf@gnu.org> > Date: Mon, 18 May 2015 19:35:45 +0100 > From: Richard Wordingham > > Mark Davis has published an algorithm to generate all strings > canonically equivalent to a Unicode string Where can I find the description of that algorithm? From doug at ewellic.org Mon May 18 15:10:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 13:10:38 -0700 Subject: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net> Markus Scherer wrote: > As far as I can tell from your quotes, CLDR will say what's valid > (plus containment info), and Unicode permits you to show a flag for > any valid tag. North Lanarkshire seems perfectly fine. I'm under the impression that this will be a standard Unicode mechanism, defined in principle by TUS and in detail by the upcoming revision of UTR #51, with data (but no additional rules) supplied by CLDR. > I am curious to see if the redundant hyphen will be part of the > syntax. Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2 requires it (Section 5.2), and the syntax diagram at the end of L2/15-145R shows it: B ((TL{2} (TH (TL|TD){3})?) | (TD{3})) where TH is TAG HYPHEN-MINUS. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Mon May 18 15:14:32 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 13:14:32 -0700 Subject: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net> I know I'll regret this... Philippe Verdy wrote: > Sometime in a future, two letters will not be enough even in ISO > 3166-1, if countries continue to split/merge (this does not happen > frequently but is occurs every few years; and it will not be possible > to reuse old codes that are maintained for a long period). ISO 3166-1 already defines alpha-3 and numeric code elements, as well as alpha-2. ISO 3166/MA has added approximately one code element per year on average since the breakup of the Soviet Union. There are approximately 336 unassigned alpha-2 code elements, and if any of the assigned ones is withdrawn, it can be recycled in 50 years. > May be then we'll have ISO 3166-1 codes using digits (such as "A1" or > "1A"), but this will cause some problems to map them to IETF ccTLD > codes (within the DNS root registry). Adapting to this challenge, if and when it arises, should be child's play for the DNS, which has recently introduced TLDs like ".???????????" (or ".xn--clchc0ea0b2g2a9gcd" if one prefers). > As well the UN M.49 numeric codes will get full if it continues with > its current allocation scheme (using ranges of numbers by continental > regions). Or the other solution will be to extend the set of allowed > letters. UN M.49 numeric code elements (equivalent to ISO 3166-1) are assigned alphabetically by English country name, or as close as possible, with some exceptions related to historical names. There are no allocations by geographical region. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Mon May 18 15:26:43 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 18 May 2015 22:26:43 +0200 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net> References: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net> Message-ID: 2015-05-18 22:14 GMT+02:00 Doug Ewell : > I know I'll regret this... > You should not > > Philippe Verdy wrote: > > > Sometime in a future, two letters will not be enough even in ISO > > 3166-1, if countries continue to split/merge (this does not happen > > frequently but is occurs every few years; and it will not be possible > > to reuse old codes that are maintained for a long period). > > ISO 3166-1 already defines alpha-3 and numeric code elements, as well as > alpha-2. > But how to work with the 2 letters limitation when the world wants more stability in codes (this was an important reason why ISO 639 was not fully integrated in IETF tags, and why the IETF tags have chosen the stability by keeping also the codes that hbave been deleted in ISO 639, but only deprecated in IETF language tags (BCP47). We've already seen the famous reuse before 50 years (do you remember when CS was reassigned just a few months after it was discarded after an initial introduction for some months in Serbia-Montenegro?) ISO coding standard are known to be unstable. This would also be true of the UCS if Unicode did not push its stability pact with ISO! But now let's remembers that parts of ISO 3166 are also included (not fully) in BCP47 tags that require the stability. IT will prohibit reassignments by ISO (or if this happens, this will break BCP47 and et IETF will reject the change and will use another subtag if needed. So country codes cannot be reassigned (and we can expect many more merges/splits or changes of regimes in the many troubled areas of the world. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 15:32:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 21:32:02 +0100 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: References: <20150518193545.51cb95b8@JRWUBU2> Message-ID: <20150518213202.19ef7cd2@JRWUBU2> On Mon, 18 May 2015 21:05:49 +0200 Philippe Verdy wrote: > 2015-05-18 20:35 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > The algorithm itself should be tractable - Mark Davis has published > > an algorithm to generate all strings canonically equivalent to a > > Unicode string, and what we need might not be so complex. > > > Even this algorithm from Mark Davis will fail in this case: How so? The regexp is \u0F73*, which is converted to a non-capturing (\u0F71\u0F72)*. Given a string 0F40 0F71 0F73 0F42 representing the trace, matching will fail at 0F40 and an attempt will be made starting at the 0F71. The input string handling part will then present a run of three non-starters: \u0F71 \u0F71 \u0F72 I think the process is even simpler than I first thought. The engine will look for a match for \u0F71, and take it from this list, leaving \u0F71 \u0F72. It will then look for a match for \u0F72, and take it form the list, leaving \u0F71. It will then look for a match for \u0F71, and take it from the list. It will then look for a match for \u0F72. It will fail, and then back track, disgorging the \0F71. The input 'stream' now looks like \u0F71 \u0F42. This will match nothing; it is after the matching substream. The matching substring is: None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42. Its value, as a trace, is 0F71 0F72. > - You can use it easily to transform a regexp containing (\u0F73) > into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72) That is *not* what I am suggesting. The regex needs decomposing, but no other transformations. It is the string representing the input trace that is expanded. > - But this leaves the same problem for unbounded repetititions with > the "+" or "*" or "{m,}" operators. Not at all - that is the beauty of the scheme. On the regex side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*. Putting back the unused fragments of the run of non-starters in the input trace is the most difficult part. > Now all the problem is how to do the backtracking, Yes, that may be more difficult than I thought. Comparing against literal characters is simple, but it may be more complicated when matching against a list of alternative characters. Back-tracking schemes may not be set up to try the next character on a list of alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input string 0F71 0F72. The alternative (\u0f72|\u0f71) would first take the 0F72, and only on backtracking would it take the 0F71 instead. This is an issue with traces that does not appear with strings. Richard. From verdy_p at wanadoo.fr Mon May 18 15:43:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 18 May 2015 22:43:42 +0200 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net> References: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net> Message-ID: If ever the country codes used in BCP47 becomes full (all pairs of letters used), just some time before this happens, we could see new prefixes added before a new range of code. It is possible to use a 1-letter prefix for new country/territory code extensions, but with some maintenance of BCP47 parsing rules (notably the letter used should not be reordered with other singleton prefixes) But I feel it will first be simpler to assign a special 2-letter code like "C1-" followed by a new new series of 2-letters country codes (ccTLDs will survive, in fact with the development of new gTLDs not limited to 2 characters, the new countries will prefer asking for a more descriptive gTLD, even if they don't have a 2-letter ccTLD. Or 2-letter codes will be deprecated in favor of 3-letter codes (but the IETF will keep all the existing 2-letter ccTLDs as long as their sponsors support them (and don't require changing it to another TLD, even if this breaks existing URLs encoded throughout the web). There's no requirement for ISO 3166 codes to match exactly with a TLD in the global DNS (this is already the case since long for the ".uk" ccTLD, because ".gb" is almost unused). But the stability of couintry codes is desirable as well in URLs (stored within encoded documented and for which it will be hard to make global substitutions: the solution could be to use tracking dates to resolve domain names, but the worldwide DNS currently does not support this type of query by date and registrars would not like to have to keep history files for long, and software/OS developers don't want to include and maintain such data for their domain name resolving clients). It is however possible that in some future the existing URLs requiring domain names will be deprecated in favor of unique IDs (e.g. based on IPv6): users won't see ndomain names, but labels retreived from some whois-like database, or shown by search engines and possibly translated. It would be also an improvement even if this breaks the business of existing registrars (however registrars will still have business for selling PKI-related services). These IDs can also be used in URIs. In fact the DNS system is already antique in its design (and its very strange and complex encoding for IDNA that no one can read). 2015-05-18 22:10 GMT+02:00 Doug Ewell : > Markus Scherer wrote: > > > As far as I can tell from your quotes, CLDR will say what's valid > > (plus containment info), and Unicode permits you to show a flag for > > any valid tag. North Lanarkshire seems perfectly fine. > > I'm under the impression that this will be a standard Unicode mechanism, > defined in principle by TUS and in detail by the upcoming revision of > UTR #51, with data (but no additional rules) supplied by CLDR. > > > I am curious to see if the redundant hyphen will be part of the > > syntax. > > Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2 > requires it (Section 5.2), and the syntax diagram at the end of > L2/15-145R shows it: > > B ((TL{2} (TH (TL|TD){3})?) | (TD{3})) > > where TH is TAG HYPHEN-MINUS. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 15:46:44 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 21:46:44 +0100 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <83mw11ekt6.fsf@gnu.org> References: <20150518193545.51cb95b8@JRWUBU2> <83mw11ekt6.fsf@gnu.org> Message-ID: <20150518214644.024f8c42@JRWUBU2> On Mon, 18 May 2015 22:40:21 +0300 Eli Zaretskii wrote: > > Date: Mon, 18 May 2015 19:35:45 +0100 > > From: Richard Wordingham > > > > Mark Davis has published an algorithm to generate all strings > > canonically equivalent to a Unicode string > > Where can I find the description of that algorithm? Section 5 of http://unicode.org/notes/tn5/ . There's a lot of detail missing, and its easy to overlook the Hangul sylables. The complete code is rather more complicated than it looks from the wording, especially if you want successive candidates on successive calls. You also need to include the legal permutations of the non-starters - the code as given only delivers the FCD canonical equivalents. On further thought, I also think its actually unnecessary for this application. Richard. From verdy_p at wanadoo.fr Mon May 18 15:56:47 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 18 May 2015 22:56:47 +0200 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <20150518213202.19ef7cd2@JRWUBU2> References: <20150518193545.51cb95b8@JRWUBU2> <20150518213202.19ef7cd2@JRWUBU2> Message-ID: Isn't it possible for your basic substitution to transform \uf073 into a character class [\uf071\uf072\uf073] that the regexp considers as a single entity to check ? In that case, backtracking for matching \u0F73*\u0F72 is simpler: [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only one character class (instead of one character). It is also posible also to transform \u0F73*\u0F72 into the really equivalent: (\u0F71\0F72)*\u0F72 | (\0F72\u0F71)*\u0F72 | (\0F73)*\u0F72 (assuming that in the non-capturing group you are already performing canonical reorderings using counters (as many counters as there are distinct ccc values in these groups, excluding blockers that create groups always matched separately without any need to use backtrack "through" them: if this does not match as at a blocking position, there's no other alternative possible, so this is a definitive non-match) 2015-05-18 22:32 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Mon, 18 May 2015 21:05:49 +0200 > Philippe Verdy wrote: > > > 2015-05-18 20:35 GMT+02:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > The algorithm itself should be tractable - Mark Davis has published > > > an algorithm to generate all strings canonically equivalent to a > > > Unicode string, and what we need might not be so complex. > > > > > > Even this algorithm from Mark Davis will fail in this case: > > How so? The regexp is \u0F73*, which is converted to a non-capturing > (\u0F71\u0F72)*. > > Given a string 0F40 0F71 0F73 0F42 representing the trace, matching > will fail at 0F40 and an attempt will be made starting at the 0F71. > The input string handling part will then present a run of three > non-starters: > > \u0F71 \u0F71 \u0F72 > > I think the process is even simpler than I first thought. > > The engine will look for a match for \u0F71, and take it from this > list, leaving \u0F71 \u0F72. > > It will then look for a match for \u0F72, and take it form the list, > leaving \u0F71. > > It will then look for a match for \u0F71, and take it from the list. > > It will then look for a match for \u0F72. It will fail, and then back > track, disgorging the \0F71. > > The input 'stream' now looks like \u0F71 \u0F42. This will match > nothing; it is after the matching substream. > > The matching substring is: > > None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42. > > Its value, as a trace, is 0F71 0F72. > > > - You can use it easily to transform a regexp containing (\u0F73) > > into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72) > > That is *not* what I am suggesting. The regex needs decomposing, but > no other transformations. It is the string representing the input > trace that is expanded. > > > - But this leaves the same problem for unbounded repetititions with > > the "+" or "*" or "{m,}" operators. > > Not at all - that is the beauty of the scheme. On the regex > side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*. > Putting back the unused fragments of the run of non-starters in the > input trace is the most difficult part. > > > Now all the problem is how to do the backtracking, > > Yes, that may be more difficult than I thought. Comparing against > literal characters is simple, but it may be more complicated when > matching against a list of alternative characters. Back-tracking > schemes may not be set up to try the next character on a list of > alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input > string 0F71 0F72. The alternative (\u0f72|\u0f71) would first take the > 0F72, and only on backtracking would it take the 0F71 instead. This is > an issue with traces that does not appear with strings. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 16:14:11 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 May 2015 22:14:11 +0100 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: References: <20150518193545.51cb95b8@JRWUBU2> <20150518213202.19ef7cd2@JRWUBU2> Message-ID: <20150518221411.4c508924@JRWUBU2> On Mon, 18 May 2015 22:56:47 +0200 Philippe Verdy wrote: > Isn't it possible for your basic substitution to transform \uf073 > into a character class [\uf071\uf072\uf073] that the regexp considers > as a single entity to check ? > In that case, backtracking for matching \u0F73*\u0F72 is simpler: > [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only > one character class (instead of one character). I'm still waiting for your explanation of how your scheme for European diacritics (as used in SE Asia) would work. This thread is intended for the idea of using the regex to decide which character to take as the next character from the input trace. In the other thread, I'm still not sure whether you're working with traces or strings. Richard. From doug at ewellic.org Mon May 18 16:38:19 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 14:38:19 -0700 Subject: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net> Philippe Verdy wrote: >> ISO 3166-1 already defines alpha-3 and numeric code elements, as well >> as alpha-2. > > But how to work with the 2 letters limitation when the world wants > more stability in codes (this was an important reason why ISO 639 was > not fully integrated in IETF tags, and why the IETF tags have chosen > the stability by keeping also the codes that hbave been deleted in ISO > 639, but only deprecated in IETF language tags (BCP47). I assume you're aware of the extent of my involvement in BCP 47, so this is a semi-rhetorical question. If and when ISO 3166/MA manages to use up all of the remaining 336 unassigned code elements -- nearly half of the TOTAL possible code space of 676 two-letter combinations -- the corresponding numeric code elements will be assigned as BCP 47 region subtags instead. > We've already seen the famous reuse before 50 years (do you remember > when CS was reassigned just a few months after it was discarded after > an initial introduction for some months in Serbia-Montenegro?) What actually happened was, 'CS' was withdrawn for Czechoslovakia and then assigned to Serbia and Montenegro. At that time, the waiting period was five years; the 'CS' incident is what resulted in the change to 50 years. > But now let's remembers that parts of ISO 3166 are also included (not > fully) in BCP47 tags that require the stability. IT will prohibit > reassignments by ISO (or if this happens, this will break BCP47 and et > IETF will reject the change and will use another subtag if needed. Again, I'm guessing you already know that I know how BCP 47 works. ISO 3166/MA can recycle alpha-2 code elements 50 years after withdrawal if they feel like it. BCP 47 can't prevent that. That's why BCP 47 has a mechanism to work around that possibility. > So country codes cannot be reassigned (and we can expect many more > merges/splits or changes of regimes in the many troubled areas of the > world. Changes of regimes don't usually result in new 3166 code elements. The same is true for merges (look at DE/DD or YE/YD). New and changed country names usually do. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Mon May 18 16:55:02 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 14:55:02 -0700 Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net> Philippe Verdy wrote: > If ever the country codes used in BCP47 becomes full (all pairs of > letters used), just some time before this happens, we could see new > prefixes added before a new range of code. It is possible to use a > 1-letter prefix for new country/territory code extensions, but with > some maintenance of BCP47 parsing rules (notably the letter used > should not be reordered with other singleton prefixes) This would be a major revision to BCP 47, it would have nothing to do with reordering, and it would not in any case involve 1-letter prefixes, which already have a different meaning. And the time frame we are talking about is reminiscent of Ken's estimate of when 17 planes will no longer be enough for Unicode. > But I feel it will first be simpler to assign a special 2-letter code > like "C1-" followed by a new new series of 2-letters country codes We actually thought about this stuff over in LTRU. Really. I'm not the least bit concerned about the DNS. Five years from now they could be assigning TLDs consisting entirely of emoji. This is no longer relevant to flag tags or anything else Unicode. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Mon May 18 17:08:27 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 00:08:27 +0200 Subject: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net> References: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net> Message-ID: 2015-05-18 23:38 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > So country codes cannot be reassigned (and we can expect many more > > merges/splits or changes of regimes in the many troubled areas of the > > world. > > Changes of regimes don't usually result in new 3166 code elements. The > same is true for merges (look at DE/DD or YE/YD). New and changed > country names usually do. I just included merges only to be complete because they frequently occur a little time after a split (and not with the former part). But of course merges are much less frequent than splits. An in today's globalized world, splits are even easier than they were in the past (where merges were the results of invasions/wars/conquests). The rate of splits is in fact accelerating in history, even in countries living in peace, this does not mean that they terminate all their partnerships, just that they take the right to create their own alliances. There are reasons for them: cultural (language), national taxes, economic difficulties in some regions, unemployment, management of resources (water, constructible or cultivable soils) but the most important reasons is political (defiance between political parties, or brutality against minorities and mutual misunderstanding)... In the last 50 years the most important changes came from decolonialisation and its independances (that was completed at end the the 1970's). But now we are seeing splits for much smaller entities, and this can occur in many more places. With ISO 3166-2 the situation within countries is much more complex and more frequent (in Europe most countries are undergoing large changes in their administrative divisions, the changes that will occur next year in French regions is still not taken into account in ISO 3166-2, as well as the change that is already effective within one department, splitted in two parts with only one which remains as a department, the other one being a group of communes erected into a new territorial collectivity taking all powers of its former department, for local adminsitration only, but with the national power still not divided in what is now a "circonscription d?partementale" with the same departmental prefecture as before the split. The hierarchical model of subdivisions has in fact lots of exceptions (look into Spain, UK, Germany, it was already true for France and US, but now it is also occuring even in the Metropolitan area). In fact we can see several parallel layers of subdivisions, but for different legal roles/missions. The ISO 3166-1 also assumes that everything is a country, but it is already wrong with some dependant territories (not all) of France, UK, US, the Netherlands, Spain and possibly some islands of China. And these codes also don't map correctly to effective national divisions (the encoding for claims in Antartica remains ambiguous, depending on who uses the data). There are also reserves for things that are not countries but groups of countries (EU, WIPO areas...), and there could exist new codes for other international alliances (these look like "merges" except that they are not full merges and the entities continue to coexist separately). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon May 18 17:25:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 00:25:33 +0200 Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net> References: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net> Message-ID: 2015-05-18 23:55 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > If ever the country codes used in BCP47 becomes full (all pairs of > > letters used), just some time before this happens, we could see new > > prefixes added before a new range of code. It is possible to use a > > 1-letter prefix for new country/territory code extensions, but with > > some maintenance of BCP47 parsing rules (notably the letter used > > should not be reordered with other singleton prefixes) > > This would be a major revision to BCP 47, it would have nothing to do > with reordering, It woiuld have to do because all subtags after the pricmary language subtag in BCP47 are optional, and you can distincguish them only by their length *or* by the role assigned to specific singletons: there's already the "x" singleton exception (that is ordered at end), but other singletons are currently described to use a canonical order but it is used only for encoding variants unrelated to region subtags or even to the languages. Very few singletons are used in fact (the singleton subtags occuring at start of ther tag are also treated separately from others: it could also be used to support new syntaxes for BCP47 tags, but fow we just have "i-", deprecated but still valid, and "x-" for private use; for all other letters there's no parsing defined for now, their syntax is unknown and they are not interchangeable without a standard, so they are used only for private use; another constraint comes from the length limit of subtags: the first subtag is either a special singleton, or a primary language code using 2 or 3 letters for now; some BCP47 use an empty first subtag, i.e. the tag starts by an hyphen; double hyphens could be used as extensions to chhange locally the parsing rules and possibly return to the next logical subtag and could be used to encode international organization without needing a formal "exceptional reservation" in ISO 3166-1; for example "*-EU" in could have been encoded as "--O-EU" and we could have the same system for NATO, EEA, EFTA... There's still ample space for extensions of parsing rules in BCP47, but not in ISO3166.) ISO 3166 also encodes some 4-letter codes but they are not used in BCP47 (so there's no confusion with 4-letter script codes). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon May 18 17:50:50 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 May 2015 15:50:50 -0700 Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes Message-ID: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net> This is why I knew I would regret it. Clearing up some errors here. No more posts from me on this non-Unicode topic after this one. Philippe Verdy wrote: >> This would be a major revision to BCP 47, it would have nothing to do >> with reordering, > > It woiuld have to do because all subtags after the pricmary language > subtag in BCP47 are optional, and you can distincguish them only by > their length *or* by the role assigned to specific singletons: there's > already the "x" singleton exception (that is ordered at end), but > other singletons are currently described to use a canonical order but > it is used only for encoding variants unrelated to region subtags or > even to the languages. All non-initial singletons introduce an extension, except for 'x' which introduces a private-use sequence, and which must be last. Even if an extension were defined to hold top-level region information, WHICH WILL NEVER HAPPEN, it would not matter whether that extension appeared before or after other extensions, because it would be an extension and not a region subtag. > but fow we just have "i-", deprecated but still valid, "i-" is not deprecated. > for all other letters there's no parsing defined for now, their syntax > is unknown and they are not interchangeable without a standard, so > they are used only for private use Extension 't' was defined in 2011 and 'u' in 2010. They have well-defined syntax, specified in RFC 6497 and 6067 respectively. Undefined singletons may not be used for private use. > some BCP47 use an empty first subtag, i.e. the tag starts by an > hyphen; Absolutely, utterly false. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Mon May 18 18:25:54 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 01:25:54 +0200 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <20150518221411.4c508924@JRWUBU2> References: <20150518193545.51cb95b8@JRWUBU2> <20150518213202.19ef7cd2@JRWUBU2> <20150518221411.4c508924@JRWUBU2> Message-ID: I don't work with strings, but with what you seem to call "traces", but that I call sets of states (they are in fact bitsets, which may be compacted or just stored as arrays of bytes containing just 1 usefull bit, but which may be a bit faster; byte arrays are just simpler to program)., in a stack (I'll use bitsets later to make the structure more compact, if needed, but for now this is fast enough and not memory intensive even for large regexps with many repetitions with "+/*/{m,n}" or variable parts). The internal matcher uses NFD, but needs to track the positions in the original buffered input for returning captured matches. There's some optiomization to reduce the size of the bitsets, by defining classes. The representation of classes in Unicode is more challenging than with plain ASCII or ISO8859-*, for this reason I limit their length (differences between the smallest and highest code point), and over this size the classes are just defined as a sorted string of pairs of codepoints: I can perform a binary search in that string and look at the position: with an even position the character is part of the class, with an odd position, the character is not part of it). Thanks to a previous message you posted, I noted that my code deos not work correctly with Hangul precomposed syllables (I perform the decompoisition to NFD of the input on the fly in the input buffer, but the buffer is incorrectly advanced when there's a match to the next character, and it can skip one or two characters of the original input instead of code points in the NFD transformed input. (I don't have extensive cases for testing Hangul, I have much more for Latin, Greek, Cyrillic and Arabic, but also too few for Hebrew where "pathological" cases of regexps are certainly more likely to occur than in Latin, even with Vietnamese and its frequent double diacritics). For now with the complex cases of replacements, I have no precise syntax defined for specifiying replacements as as simple string with placeholders I just allow these matches to be passed as objects (rather than just strings) to a callback that performs the substitutions itself using the array of captures given by the engine to the callback; I have no idea for now about how to handle the special cases occuring when computing the actual replacements: The callback can insert/delete subsequences everywhere in the input buffer which is limited in size by the extent of $0, plus any intermediate characters when there's a discontinuity, plus their left and right contexts when the match still does not include the full combining sequences (for most uses cases, the left context is empty, but the right context is frequently non-empty and contains all combining characters on over the last base which is part of the match; the callback also does not necessarily have to modify the input buffer it it does not want to perform replacements in it, but in that case the input buffer is readonly and I don't need to feed the contexts which remain empty. There are also left and right context variables for *each* capture group (some of them may be partly or fully in another returned capture group). Finally a question: I suppose that like many programmers you have read the famous "green dragon" book of Sethi/Aho/Ullman books about compilers. I can understand the terminology they use when spoeaking about automatas (and that is found in many other places), but apparently you are using some terms that I have to guess from their context. Good books on the subjext are now becoming difficutlt to find (or they are more expensive now), and too difficult to use on the web (for such very technical topics, it really helps to have a printed copy, that you an annotate, explore, or have beside you instead of on a screen (and printing ebooks is not an option if they are voluminous). May be you have other books to recommend. But finding these books in libraries is now becoming difficult when many are closing or reducing their proposed collections (and I don't like buying books on the Internet). For the rest, I tend to just describe what I've made or used or experimented, even if the terms are not the best ones (some of my references are in French, and dificutl to translate). On difficult topics like this one, I'm not paid to perform research and I can only do that in my spare time from time to time, until I can make something stable enough for a limited use (without experimental features) In the past I could work on such research topic, but now we are pressed to use extisting libraries and not pass lot of time, we sell smaller incremental but limtied improvements and we know what is volutarily limited and left unimplemented. 2015-05-18 23:14 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Mon, 18 May 2015 22:56:47 +0200 > Philippe Verdy wrote: > > > Isn't it possible for your basic substitution to transform \uf073 > > into a character class [\uf071\uf072\uf073] that the regexp considers > > as a single entity to check ? > > In that case, backtracking for matching \u0F73*\u0F72 is simpler: > > [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only > > one character class (instead of one character). > > I'm still waiting for your explanation of how your scheme for European > diacritics (as used in SE Asia) would work. This thread is intended for > the idea of using the regex to decide which character to take as the > next character from the input trace. In the other thread, I'm still not > sure whether you're working with traces or strings. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon May 18 18:36:21 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 01:36:21 +0200 Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes In-Reply-To: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net> References: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net> Message-ID: 2015-05-19 0:50 GMT+02:00 Doug Ewell : > > but fow we just have "i-", deprecated but still valid, > > "i-" is not deprecated. In the IANA database they are all replaced. I call that "deprecated" a bit abusively, but there's no longer any interest in them. >> for all other letters there's no parsing defined for now, their syntax >> is unknown and they are not interchangeable without a standard, so >> they are used only for private use > Extension 't' was defined in 2011 and 'u' in 2010. They have > well-defined syntax, specified in RFC 6497 and 6067 respectively. You are speaking of extensions subtags after the initial subtag, I did not discuss them. I was just speaking about the initial subtag (before the first hyphen), where "t" and "u" are not defined: only "x" and "i" are defined there ("i" is not defined in the other singletons for trailing subtags). > Undefined singletons may not be used for private use. For private use (meaning NOT for interchanges) NOTHING is forbidden, you are never bound to any standard. There are lots of places where these private extensions are used and not discussed. >> some BCP47 use an empty first subtag, i.e. the tag starts by an >> hyphen; > Absolutely, utterly false. Absolutely, utterly true, but a word was missing in my sentence "some BCP 47 extensions" (which are private, local only to a specific software in its internal data). -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon May 18 19:44:17 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 19 May 2015 01:44:17 +0100 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: References: <20150518193545.51cb95b8@JRWUBU2> <20150518213202.19ef7cd2@JRWUBU2> <20150518221411.4c508924@JRWUBU2> Message-ID: <20150519014417.38d7115a@JRWUBU2> On Tue, 19 May 2015 01:25:54 +0200 Philippe Verdy wrote: > I don't work with strings, but with what you seem to call "traces", For the concept of traces, Wikipedia suffices: https://fr.wikipedia.org/wiki/Mono%C3%AFde_des_traces . As far as text manipulation is concerned, the word 'trace' is an idealisation of how Latin text is written. Base letters advance the writing point, so they commute with nothing - canonical combining class 0. Ideally, marks of different canonical combining classes do not interact with one another when writing, so they commute. In general, marks of the same canonical combining class interact with one another, be it only to move the subsequent one further from the base letter, so they do not commute. The traces I refer to are the equivalence classes of Unicode modulo canonical equivalence. To apply the theory, I have to regard decomposable characters as notations for sequences of 1 to 4 indecomposable characters. The notion works with compatibility equivalence, and one could use a stronger notion of equivalence, so that compatibility ideographs did not have singleton decompositions. Thus, as strings, \u0323\u0302 and \u0302\u0323 are distinct, but as traces, they are identical. The lexicographic normal form that is most useful is simply NFD. The indecomposable characters are ordered by canonical combining class and then it doesn't matter; one may as well use codepoint. > but that I call sets of states (they are in fact bitsets, which may be > compacted or just stored as arrays of bytes containing just 1 usefull > bit, but which may be a bit faster; byte arrays are just simpler to > program)., in a stack (I'll use bitsets later to make the structure > more compact, if needed, but for now this is fast enough and not > memory intensive even for large regexps with many repetitions with > "+/*/{m,n}" or variable parts). Your 'bitset' sounds like a general purpose type, and to be an implementation detail that surfaces in your discussion. > The internal matcher uses NFD, but > needs to track the positions in the original buffered input for > returning captured matches. That's how I'm working. I do not regard decomposable characters as atomic; I am emotionally happy with working with fractions of characters. > ... Greek, Cyrillic and Arabic, but also too few for Hebrew where > "pathological" cases of regexps are certainly more likely to occur > than in Latin, even with Vietnamese and its frequent double > diacritics). I was just thinking respecting canonical equivalence might be very useful for Hebrew, particularly when dealing with text with accents. > Finally a question: > > I suppose that like many programmers you have read the famous "green > dragon" book of Sethi/Aho/Ullman books about compilers. I can > understand the terminology they use when spoeaking about automatas > (and that is found in many other places), but apparently you are > using some terms that I have to guess from their context. No, I started off by hunting the web to try and work out what was special about a regular expression, and found the articles in Wikipedia quite helpful. When working out how to make matching respect canonical compliance, I started out with normalising strings to NFD. Only after I had generalised the closure properties of regular languages from strings to these representative forms (with the exception of Kleene star) did I finally discover what I had long suspected, that I was not the first person to investigate regular expressions on non-free monoids. What does surprise me is that I cannot find any evidence that any one else has made the connection between trace monoids and Unicode strings under canonical equivalence. I would like update the article on the trace monoid with its most important example, Unicode strings under canonical equivalence, but, alas, that seems to be 'original research'! I'm beginning to think that 'letting the regex choose the input character' might be a better method of dealing with interleaving of subexpressions even for 'non-deterministic' engines, i.e. those which follow all possible paths in parallel. I'll have to compare the relevant complexities. > Good books on the subjext are now becoming difficutlt to find (or > they are more expensive now), and too difficult to use on the web > (for such very technical topics, it really helps to have a printed > copy, that you an annotate, explore, or have beside you instead of on > a screen (and printing ebooks is not an option if they are > voluminous). May be you have other books to recommend. Google Books, in English, gives access to a very helpful chapter on regular languages in trace monoids in 'the Book of Traces'. I found Russ Cox's Internet notes on regular expressions helpful, though not everyone agrees with his love of non-determinism. Richard. From mark at macchiato.com Tue May 19 00:18:59 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 18 May 2015 22:18:59 -0700 Subject: Tag characters In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell> References: <794493C42D714C3C8A58D2F45AA36663@DougEwell> Message-ID: ?A few notes. A more concrete proposal will be in a PRI to be issued soon, and people will have a chance to comment more then. (I'm not trying to discourage discussion, just pointing out that there will be something more concrete relatively soon to comment on?people are pretty busy getting 8.0 out the door right now.) The principal reason for 3 digit codes is because that is the mechanism used by BCP47 in case ISO screws up codes (as they did for CS). The syntax does not need to follow the 3166 syntax - the codes correspond but are not the same anyway. So we didn't see the necessity for the hyphen, syntactically. There is a difference between EU and UN; the former is in BCP47. That being said, we could look at making the exceptionally reserved codes valid for this purpose (or at least the UN code). It appears that there are only 3 exceptionally reserved codes that aren't in BCP47: EZ, UK, UN. Just because a code is valid doesn't mean that there is a flag associated with it. Just like the fact that you can have the BCP47 code ja-Ahom-AQ doesn't mean that it denotes anything useful. I'd expect vendors to not waste time with non-existent flags. However, we could also discuss having a mechanism in CLDR to help provide guidelines as to which subdivisions are suitable as flags. Mark *? Il meglio ? l?inimico del bene ?* On Sat, May 16, 2015 at 10:07 AM, Doug Ewell wrote: > L2/15-145R says: > > On some platforms that support a number of emoji flags, there is >> substantial demand to support additional flags for the following: >> [...] >> Certain supra-national regions, such as Europe (European Union flag) >> or the world (e.g. United Nations flag). These can be represented >> using UN M49 3-digit codes, for example "150" for Europe or "001" for >> World. >> > > These are uncomfortable equivalence classes. Not all countries in Europe > are members of the European Union, and the concept of "United Nations" is > not really the same by definition as "all countries in the world." > > The remaining UN M.49 code elements that don't have a 3166-1 equivalent > seem wholly unsuited for this mechanism (and those that do, don't need it). > There are no flags for "Middle Africa" or "Latin America and the Caribbean" > or "Landlocked developing countries." > > Some trans-national organizations might _almost_ seem as if they could be > shoehorned into an M.49 code element, like identifying 035 "South-Eastern > Asia" with the ASEAN flag, but this would be problematic for the same > reasons as 150 and 001. > > Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for > "European Union" and "UN" for "United Nations." If these flags are the use > cases, why not simply use those alpha-2 code elements, instead of burdening > the new mechanism with the 3-digit syntax? > > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue May 19 07:57:58 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 14:57:58 +0200 Subject: Tag characters In-Reply-To: References: <794493C42D714C3C8A58D2F45AA36663@DougEwell> Message-ID: 2015-05-19 7:18 GMT+02:00 Mark Davis ?? : > There is a difference between EU and UN; the former is in BCP47. That > being said, we could look at making the exceptionally reserved codes valid > for this purpose (or at least the UN code). It appears that there are only > 3 exceptionally reserved codes that aren't in BCP47: EZ, UK, UN. > There are also reserved codes for WIPO areas; there are special codes requested by ITU and UPU or not removed from ISO3166 also on their demand for maintaining their own standards (may be there will be other codes requested by IATA and OACI or some international railways organisation, or maritime organisation for oceans in the "international waters"). Thanks for now we don't have to handle specific "region" code for the Moon or "divisions" of the solar system, or even for some groups of orbital airspace over the Earth (from stratospheric to geostationnary), as for now they are still considered international (and country laws only apply to individual equipements or when they have to fall back to ground or preferably oceans)... We could as well imagine other regions like poles, or hemispheres, or 1 hour (15?) bands of longitude (excluding polar areas within arctic/antarctic circle or within the +/-85?circle, commonly used in geography for showing maps with Mercator projections) There are various standards that define codes for their regions; some of them have political importances, and some have specific localized data associated to them and for which there must not exist collisions with existing or future ISO3166-1 country codes. For such applications however aplpications should use the concept of "namespace" to qualify each code source (ISO3166 being just one of them, IETF being another one, the local application using another namespace if needed for its regions; the same remark also applies if there's need of private codes for "pseudo-languages" or "pseudo-language-variants" or "pseudo-scripts"), and with the mechanism of namespaces you could even track versions (like it is used in XMLNS) -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue May 19 07:58:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 19 May 2015 14:58:33 +0200 Subject: Regexes, Canonical Equivalence and Backtracking of Input In-Reply-To: <20150519014417.38d7115a@JRWUBU2> References: <20150518193545.51cb95b8@JRWUBU2> <20150518213202.19ef7cd2@JRWUBU2> <20150518221411.4c508924@JRWUBU2> <20150519014417.38d7115a@JRWUBU2> Message-ID: 2015-05-19 2:44 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > > Good books on the subjext are now becoming difficutlt to find (or > > they are more expensive now), and too difficult to use on the web > > (for such very technical topics, it really helps to have a printed > > copy, that you an annotate, explore, or have beside you instead of on > > a screen (and printing ebooks is not an option if they are > > voluminous). May be you have other books to recommend. > > Google Books, in English, gives access to a very helpful chapter on > regular languages in trace monoids in 'the Book of Traces'. > [OT] It's interesting to see that books on this topic were published mostly after 1994. As I terminated my training cursus at this period, the subject was largely not covered before; now that I live in a small city where there's no good scientific library finding just some books in English on such topics is extremely rare (the only books I see are those published in French in the "for Dummies" series and I find them completely ininteresting. As a consequence I buy much less scientific books now. However Wikipedia is not a convenient place for extensive (but progressive) coverage of a topic (the one page limit has a consequence: it's difficult to learn from these articles, and you can read them only if you already know most of its covered topics or you don't need to navigate randomly over many pages through random links). Wikiepdia remains useful only if you can isolate your search to a few smaller subtopics. Wikibooks and Wikisource would be more useful for such extensive studies, but their contents is very small (and for legal resons, Wikisource cannot contain many scientific books about theories that were written after WW2 : unfortunately this covers almost all researches being performed on comuting theories that exploded only after th 1960's, and in many areas the researches were also protected by extensive patents in addition to copyrights; so the interesting books are published in English, extremely rarely translated, have a limited distribution, they are also expensive and not found except in very few libraries and only in some cities that have a scientific university ; public libraries also don't have these books, which are too expensive) Now there's the net, but even Google books only exposes just some pages (for the rest, Google books propose books that are even more expensive than on normal libraries, and from random sellers that are frequently not trustable : e.g. I will never buy anything from Amazon if Amazon is not the seller, or from other similar large platform on which you don't know who is the seller, or because the seller also wants us to pay absive delivery/shipping costs without even giving any warranty on the product and without even allowing us to trace the command; there are too many abusers, or that sell products with severe defects; I prefer using French online selling platforms; in addition this saves money on taxes if the seller is in the EU, otherwise we experiment long delivery delays in tex customs, and we also need to pay the tax on delivery, in addition to the initial cost, plus the currency change fees by the bank; all these can easily double the total cost, but at end there's also a big deception on the product and it's impossible to return it and get a refund) In summary, it is really bad that libraries are disappearing in many places, or are reducd to sell only a limited catalog "for the dummies" or popular books advertized in the medias. The variety of available books for sale is dramatically decreasing now. The net cannot replace these books that you want to read slowly and keep as reference for later reuse... except if the e-books you can buy online offer an option to get a "print on demand" with a good quality with reasonable costs and delays for the delivery (some French editors are proposing this "on demand printing" service, even for books from some other foreign editors). Note that is is not limtied to just scientific books, the system could be used for delivering all kinds of books (including litterature, photography, magazines, newspapers, or rare research papers available only in one public university library that could get some fees helping them to renew their own purchases...) -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue May 19 10:19:09 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 19 May 2015 08:19:09 -0700 Subject: Tag characters Message-ID: <20150519081909.665a7a7059d7ee80bb4d670165c8327d.adffac01f6.wbe@email03.secureserver.net> Re: Tag characters Mark Davis ? wrote: > A more concrete proposal will be in a PRI to be issued soon, and > people will have a chance to comment more then. I'll hold off on most other questions until the PRI appears. > The principal reason for 3 digit codes is because that is the > mechanism used by BCP47 in case ISO screws up codes (as they did for > CS). Hopefully the MA will adhere to the new 50-year limit. The example given in the proposal talked about trans-national flags. > The syntax does not need to follow the 3166 syntax - the codes > correspond but are not the same anyway. So we didn't see the necessity > for the hyphen, syntactically. Well, the codes are the same, but you're defining a new syntax, so you get to remove the hyphen if you want to. But again, the proposal didn't say that. > There is a difference between EU and UN; the former is in BCP47. I didn't know that was relevant to flag tagging. > Just because a code is valid doesn't mean that there is a flag > associated with it. Of course not. I'd also not expect CLDR or Unicode or even vendors to keep track of every state and territory flag around the world. Vendors will support some subset of flags of their choice, just as they currently do, and that's consistent with existing Unicode principles about not having to display every possible character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From wjgo_10009 at btinternet.com Tue May 19 11:25:37 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 19 May 2015 17:25:37 +0100 (BST) Subject: Tag characters Message-ID: <3532384.57721.1432052737187.JavaMail.defaultUser@defaultHost> Doug Ewell wrote: > Hopefully the MA will adhere to the new 50-year limit. The example given in the proposal talked about trans-national flags. What is MA please? A 50-year limit seems far too short a time. With that figure, a document could have its meaning retrospectively changed at least 20 years before its copyright runs out, and maybe a lot longer before its copyright runs out, maybe as much as 80 years before its copyright runs out, or even longer! Surely for archiving our culture, and the British Library is actively archiving, there should never be a retrospective change of meaning. William Overington 19 May 2015 From doug at ewellic.org Tue May 19 12:01:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 19 May 2015 10:01:14 -0700 Subject: Tag characters Message-ID: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> William_J_G Overington wrote: >> Hopefully the MA will adhere to the new 50-year limit. > > What is MA please? Maintenance Agency: http://www.iso.org/iso/home/standards/country_codes.htm > A 50-year limit seems far too short a time. There are two types of people: those who feel 50 years is too short, and those who feel it is too long. Fifty years is much better than five, which was the previous limit. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Tue May 19 22:22:28 2015 From: petercon at microsoft.com (Peter Constable) Date: Wed, 20 May 2015 03:22:28 +0000 Subject: Tag characters In-Reply-To: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> Message-ID: Evidently there were more than two type of people. There are those who feel 50 years is long enough; there are others who feel that five years is long enough; there are likely others that feel 75 or 30 or some other values are long enough. Then there are also those who feel that any finite length is probably not long enough. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Tuesday, May 19, 2015 10:01 AM To: Unicode Mailing List Cc: William_J_G Overington Subject: Re: Tag characters William_J_G Overington wrote: >> Hopefully the MA will adhere to the new 50-year limit. > > What is MA please? Maintenance Agency: http://www.iso.org/iso/home/standards/country_codes.htm > A 50-year limit seems far too short a time. There are two types of people: those who feel 50 years is too short, and those who feel it is too long. Fifty years is much better than five, which was the previous limit. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From wjgo_10009 at btinternet.com Wed May 20 11:29:28 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 20 May 2015 17:29:28 +0100 (BST) Subject: Tag characters In-Reply-To: References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> Message-ID: <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> Peter Constable wrote as follows. > Evidently there were more than two type of people. There are those who feel 50 years is long enough; there are others who feel that five years is long enough; there are likely others that feel 75 or 30 or some other values are long enough. Then there are also those who feel that any finite length is probably not long enough. Unicode is about long-term stability. Hopefully the people in charge of the codes to be used for the flags will agree never to reuse a code. Whether they do or not, would it be good to add an option into the tag coding of the flags whereby at the end one may optionally add TAG COLON then at least four TAG DIGIT characters, those TAG DIGIT characters representing the year? This feature would be ready if a future archivist finds the need to edit a text from years before so that it would display as its author intended, and indeed an author could use the method now so as to lock in his or her meaning. This could also be of use now so as to display such items as the flag of the USA at various historical periods. It would be helpful if a particular year were chosen for normalization purposes: for example for the flag of the USA used in the 1940s and most of the 1950s have one particular year rather than just using any year within the period when that particular design of flag was in use. Also for other flags at various historical periods. It has been speculated that had Scotland left the United Kingdom as a result of the referendum last year (in the event, the people voted for Scotland to stay in the United Kingdom) that the flag of the United Kingdom would have become changed, though some people advocated keeping it the same anyway. William Overington 20 May 2015 From doug at ewellic.org Wed May 20 12:35:34 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 20 May 2015 10:35:34 -0700 Subject: Tag characters Message-ID: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net> William_J_G Overington wrote: > Hopefully the people in charge of the codes to be used for the flags > will agree never to reuse a code. Normally I would completely agree about the need for archival stability. In this case, however, we are talking about flags used primarily as emoji, like the one in my signature block. People will pop these flags into their text messages alongside "party" or "celebration" icons. I'm not sure the requirement for stability is quite as critical as it might be. However... > Whether they do or not, would it be good to add an option into the tag > coding of the flags whereby at the end one may optionally add TAG > COLON then at least four TAG DIGIT characters, those TAG DIGIT > characters representing the year? It's remarkable how similar this suggestion is to a discussion between Philippe and me two years ago. There is currently no well-known coding system for flags -- the owner of the "Flags of the World" site doesn't know of one -- and there should be. (The term "flag code" already has two meanings that are very different from this, which makes it hard to find information.) Getting UTC to accept the extended syntax of a standard like this would, of course, require that the standard gain reasonable acceptance and popularity beforehand. Requiring it to become an ISO standard might not be unreasonable. If you want to discuss this specific idea further, please write to me privately and *not to the list*. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Wed May 20 13:38:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 20 May 2015 20:38:14 +0200 Subject: Tag characters In-Reply-To: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net> References: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net> Message-ID: Well for now a reasonnably stable standard exists: URLs, that can point to a collection of pagenames (each site can choose its own registry to name/encode the flags) URLs are then returening images (you can make a site that can return images in several formats and with variable sizes as well or with some transforms such as rotations, flips, animations... Instad of just isolated URLs, you can organize them into a base URL or static URL with query (acting as a resolver address), and then append the URN (name or code of the flags, which can include historic variants), and then allow the base URL to be replaced : keep just the part of the URL (end of pathname, or part of the query string) as "standard" and you get what is generally termed a "mirror". Mirrors however are not nececessarily bound to remain in the web, they can be any locals store (e.g in a local IP file, or a folder in your filesystem). Basically, even the existing FOTW site (and its mirrors) can be already seen as supporting these relatively stable URNs (provded that the site is not retructuring constantly its URLs and file names are kept or at least resolved by keeping internally redirecting links) So what is need is just a way to support URLs. However URLs today can be IRIs and contain most of Unicode and we cannot duplicte this code. It is however possible to do that by using the chracter sets used by Punycode (for domain names). But if FOTW just designs a naming convnetion for the paths it supports, so that it will use only a restricted set (ASCII letters, digits, and punctuation, with only some restrictions on slashs and controls) it is possible to use them as partial path names (excluding also file extension in file names) that can be used as URNs, and act as identifiers (all other parameters: size, transforms, image formats... should be separate parameters). And with this restricted set, it is possible to encode them in a stable (but still very extensible) way. 2015-05-20 19:35 GMT+02:00 Doug Ewell : > William_J_G Overington > wrote: > > > Hopefully the people in charge of the codes to be used for the flags > > will agree never to reuse a code. > > Normally I would completely agree about the need for archival stability. > > In this case, however, we are talking about flags used primarily as > emoji, like the one in my signature block. People will pop these flags > into their text messages alongside "party" or "celebration" icons. I'm > not sure the requirement for stability is quite as critical as it might > be. > > However... > > > Whether they do or not, would it be good to add an option into the tag > > coding of the flags whereby at the end one may optionally add TAG > > COLON then at least four TAG DIGIT characters, those TAG DIGIT > > characters representing the year? > > It's remarkable how similar this suggestion is to a discussion between > Philippe and me two years ago. There is currently no well-known coding > system for flags -- the owner of the "Flags of the World" site doesn't > know of one -- and there should be. (The term "flag code" already has > two meanings that are very different from this, which makes it hard to > find information.) > > Getting UTC to accept the extended syntax of a standard like this would, > of course, require that the standard gain reasonable acceptance and > popularity beforehand. Requiring it to become an ISO standard might not > be unreasonable. > > If you want to discuss this specific idea further, please write to me > privately and *not to the list*. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed May 20 13:57:53 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 20 May 2015 11:57:53 -0700 Subject: Tag characters Message-ID: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> Philippe Verdy wrote: > Well for now a reasonnably stable standard exists: URLs, that can > point to a collection of pagenames (each site can choose its own > registry to name/encode the flags) URLs are the opposite of stability. Anyone can post whatever they like, publish the URL, then change or remove the content at any time. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Wed May 20 17:28:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 20 May 2015 23:28:56 +0100 Subject: Tag characters In-Reply-To: <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> Message-ID: <20150520232856.01363823@JRWUBU2> On Wed, 20 May 2015 17:29:28 +0100 (BST) William_J_G Overington wrote: > This could also be of use now so as to display such items as the flag > of the USA at various historical periods. It would be helpful if a > particular year were chosen for normalization purposes: for example > for the flag of the USA used in the 1940s and most of the 1950s have > one particular year rather than just using any year within the period > when that particular design of flag was in use. That is a singularly poor example. An example that would jar is the use of the tricolour to represent France in an account of the Hundred Years' War, or the present German flag to represent Germany in an account of the Second World War. A problem we have is that flags are not stable enough to use in plain text that is to last a human lifetime. > It has been speculated that had Scotland left the United Kingdom as a > result of the referendum last year (in the event, the people voted > for Scotland to stay in the United Kingdom) that the flag of the > United Kingdom would have become changed, though some people > advocated keeping it the same anyway. It won't be kept if England secedes from the UK so as to leave the European Union. It may not be a likely outcome, but it's certainly a possibility. Richard. From verdy_p at wanadoo.fr Wed May 20 18:47:01 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 21 May 2015 01:47:01 +0200 Subject: Tag characters In-Reply-To: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> Message-ID: URLs were initially deisgned to be stable (and this is still a strong recommendation). However I did not describe just URLs but URNs (whose URLs are just resolvers locating them). URNs share with URLs (and URIs in general, as well the UCS) the initial "U" which is intended to be universal (both in space but also in time). The problem being that it is still open to anyone that do not want to maintain this stability (but also because URLs have a limit of time which is the time of registration of their domain name, this limits their universility in time). The web also is currently having difficulties to maintain its universitlity in space (look for ongoing political discussions for its "neutrality"). URNs however should be stable... provided that there's a stable registry for maintaining the references. (the UCS is stable only because this registry exists and is managed by a joint authority which is also still managed and with enough participants so that no other attempts are made to concurrence it with the same success). Stability laregely depends on the status of the standard that supports it, and by the number of interested people that want to participate. It is never warrantied over a long time as any particopant may decide to retive from the project). But stability also requires that the participants do not change their mind in that project. Such such is less likely to occur if there are lot of users of the standard. Even the UCS has had its own history of instability in its early versions. And it's very difficult to maintain this stability when frequently there are people that contest this stability (sometimes in the UCS this means that a new proprerty must be designed to satisfy more people, but this also adds to the total cost of management of the whole standard, however new sets of characters are now slowing down. The remaining ones are a few isolates to complement existing scripts, or scripts that are extremely similar in structure to existing ones, for which compeltely new solutions rarely need to be designed. Most important difficulties are solved, even for the remaining scripts that need to be encoded ... except the more recent addition of emojis where we still cannot see how they will be bounded in scope (and I count flags within emojis), and scripts with complex layouts for which there are still missing standard solutions (e.g. SignWriting, hieroglyphs and old cuneiforms). We'll probably have more discussions about conventional symbols used in signalisation (e.g. signals on roads, including traffic lights, and marks on the ground), or conventional signs on products (standard conformance marks...) and various security related symbols. We know we are stable only for alphabetic/phonetic scripts, but we have lots of candidate symbols and ideograms (whose creation and explosion in definitely not terminated, and do not concern just CJK scripts). The industry and legislations are creating new symbols every day around the world... and also deprecating a lot at almost the same rate. So yes URLs can be stable, but only those from recognized standard bodies that want to maintain them stable (e.g. URLs to W3C standards are stable... but not necessarilyt all tose linking to temporary discussions. The same is true for URLs to temporary work documents used by the UTC or ISO, or W3C themselves where docuemtns may be moved elsewherein some archives and with other formats, loosing some formatting details). 2015-05-20 20:57 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > Well for now a reasonnably stable standard exists: URLs, that can > > point to a collection of pagenames (each site can choose its own > > registry to name/encode the flags) > > URLs are the opposite of stability. Anyone can post whatever they like, > publish the URL, then change or remove the content at any time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Wed May 20 19:15:28 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 20 May 2015 17:15:28 -0700 Subject: Tag characters In-Reply-To: <20150520232856.01363823@JRWUBU2> References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> <20150520232856.01363823@JRWUBU2> Message-ID: <555D23A0.2000808@ix.netcom.com> Have there been any discussions of the flag alphabet? (Signal flags). They are not that infrequently used online or in print, although the concentration tends to be higher in publications/sites geared to nautical audiences (not that different from chess pieces and chess publications). Now, before you leap on the "it's just a font" bandwagon, consider that the signal flags not only represent letters and digits, but also contain special pennants for functions like "repeat once" to "repeat four times" as well as a number of special flags that are associated with two-letter codes. Also, the use of certain individual flags has conventional meanings other than the letter itself, so a reference to the flag in text would not necessarily survive a font substitution, because you'd lose the fact that you are talking about flags. Some of these uses have spread to enthusiasts, for example divers like to use the old "PO" flag (that curiously is now obsolete for this purpose) as a logo for their sport. The "diver down flag" (flag "A") is now a different one in the International Regulations for the Prevention of Collisions at Sea (IRPCAS), but for e-moji style use that would not matter as the other one (whatever it's origin) is now the recognized tribal symbol for divers. It seems to me that when schemes for representing sets of flags are discussed, it would be useful to keep open the ability to use the same scheme for signal flags -- perhaps with a different base character to avoid collisions in the letter codes. A./ From richard.wordingham at ntlworld.com Wed May 20 20:08:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 21 May 2015 02:08:06 +0100 Subject: Tag characters In-Reply-To: <555D23A0.2000808@ix.netcom.com> References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com> Message-ID: <20150521020806.3bbaea6e@JRWUBU2> On Wed, 20 May 2015 17:15:28 -0700 "Asmus Freytag (t)" wrote: > Have there been any discussions of the flag alphabet? (Signal flags). > It seems to me that when schemes for representing sets of flags are > discussed, it would be useful to keep open the ability to use the > same scheme for signal flags -- perhaps with a different base > character to avoid collisions in the letter codes. If these are worthy of coding, I think the Unified Canadian Aboriginal Syllabics would be a better model - encode the form, not the semantic. Braille is another precedent. Richard. From Shawn.Steele at microsoft.com Wed May 20 20:14:57 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 21 May 2015 01:14:57 +0000 Subject: Tag characters In-Reply-To: <20150521020806.3bbaea6e@JRWUBU2> References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com> <20150521020806.3bbaea6e@JRWUBU2> Message-ID: I've always been a bit partial to them and found it odd that they are intentionally not included in Unicode. Especially the novel concepts like the repeats. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Wednesday, May 20, 2015 6:08 PM To: unicode at unicode.org Subject: Re: Tag characters On Wed, 20 May 2015 17:15:28 -0700 "Asmus Freytag (t)" wrote: > Have there been any discussions of the flag alphabet? (Signal flags). > It seems to me that when schemes for representing sets of flags are > discussed, it would be useful to keep open the ability to use the same > scheme for signal flags -- perhaps with a different base character to > avoid collisions in the letter codes. If these are worthy of coding, I think the Unified Canadian Aboriginal Syllabics would be a better model - encode the form, not the semantic. Braille is another precedent. Richard. From doug at ewellic.org Wed May 20 21:11:25 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 20 May 2015 20:11:25 -0600 Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> Message-ID: Philippe Verdy wrote: > URLs were initially deisgned to be stable (and this is still a strong > recommendation). [+ 559 words] It doesn't matter if they were designed to be stable. Users don't keep them stable. I can't believe we're debating whether URLs are stable on a list where people have raised concerns about whether 50 years is stable enough for ISO 3166-1. In any event, URLs that point to images would be an awful basis for an encoding. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From eric.muller at efele.net Wed May 20 23:57:09 2015 From: eric.muller at efele.net (Eric Muller) Date: Wed, 20 May 2015 21:57:09 -0700 Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> Message-ID: <555D65A5.4090705@efele.net> On 5/20/2015 7:11 PM, Doug Ewell wrote: > In any event, URLs that point to images would be an awful basis for an > encoding. I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. Eric. From asmus-inc at ix.netcom.com Thu May 21 00:13:17 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 20 May 2015 22:13:17 -0700 Subject: Tag characters In-Reply-To: References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net> <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost> <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com> <20150521020806.3bbaea6e@JRWUBU2> Message-ID: <555D696D.1000309@ix.netcom.com> On 5/20/2015 6:14 PM, Shawn Steele wrote: > I've always been a bit partial to them and found it odd that they are intentionally not included in Unicode. Especially the novel concepts like the repeats. :) If I were to write an actual proposal I would suggest naming them after their international/modern use, but with the understanding that the actual interpretation would be based on whatever signalling system you intend to follow. None of the existing users would be helped by having them named after their shapes and colors. That is because some of the shapes and colors are a bit complex an nobody I know learns them by description. In a way, this is also what we do for many standard alphabets. We encode LATIN SMALL LETTER O, not "small letter looking like a round circle", and we leave it to the language whether to pronounce that long like an "oh" or short, as in "hot" (for English) or more as an "oo" sound, as in Swedish. We pick a conventional name for the element of the alphabet, and then allow variations in use. (Some of the consonants show much greater variation in pronunciation). When I said "naming" we should use the alphabetic abbreviations that they are associated with so that we can fit them into an open ended system, like the other flags. Then, whatever techniques we will be using (such as UFLs - Universal Flag Locators) would apply to them analogously to the national flags. A./ > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham > Sent: Wednesday, May 20, 2015 6:08 PM > To: unicode at unicode.org > Subject: Re: Tag characters > > On Wed, 20 May 2015 17:15:28 -0700 > "Asmus Freytag (t)" wrote: > >> Have there been any discussions of the flag alphabet? (Signal flags). >> It seems to me that when schemes for representing sets of flags are >> discussed, it would be useful to keep open the ability to use the same >> scheme for signal flags -- perhaps with a different base character to >> avoid collisions in the letter codes. > If these are worthy of coding, I think the Unified Canadian Aboriginal Syllabics would be a better model - encode the form, not the semantic. > Braille is another precedent. > > Richard. > > From asmus-inc at ix.netcom.com Thu May 21 00:14:45 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 20 May 2015 22:14:45 -0700 Subject: Tag characters In-Reply-To: <555D65A5.4090705@efele.net> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> Message-ID: <555D69C5.9040901@ix.netcom.com> On 5/20/2015 9:57 PM, Eric Muller wrote: > On 5/20/2015 7:11 PM, Doug Ewell wrote: >> In any event, URLs that point to images would be an awful basis for >> an encoding. > > I would make an exception for the URL > http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. > > Eric. > > > Currently that gives me Not Found The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was not found on this server. :) However, I agree, all we need to do is create a UFL (Universal Flag Locator) and we can keep it as stable as we want. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu May 21 10:46:01 2015 From: petercon at microsoft.com (Peter Constable) Date: Thu, 21 May 2015 15:46:01 +0000 Subject: Tag characters In-Reply-To: <555D69C5.9040901@ix.netcom.com> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> Message-ID: Would Unicode really want to get into the business of running a UFL service? P From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: Wednesday, May 20, 2015 10:15 PM To: Eric Muller; unicode at unicode.org Subject: Re: Tag characters On 5/20/2015 9:57 PM, Eric Muller wrote: On 5/20/2015 7:11 PM, Doug Ewell wrote: In any event, URLs that point to images would be an awful basis for an encoding. I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. Eric. Currently that gives me Not Found The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was not found on this server. :) However, I agree, all we need to do is create a UFL (Universal Flag Locator) and we can keep it as stable as we want. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From shizhao at gmail.com Thu May 21 10:06:12 2015 From: shizhao at gmail.com (shi zhao) Date: Thu, 21 May 2015 15:06:12 +0000 Subject: =?UTF-8?B?c2ltcGxpZmllZCBDaGluZXNlIHdvcmRzIO+8iOWcnyvku47vvIk=?= Message-ID: simplified Chinese words ??+?, Hanyupinyin: zong1?don't in unihan. simplified Chinese: (?+?) traditional Chinese: *? (U+3661**)* *see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661&useutf8=false * *http://glyphwiki.org/wiki/u2ff0-u571f-u4ece * *http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014&filename=KJSY201404019&v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=&filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96 * -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ???_???_????????_???.pdf Type: application/pdf Size: 133340 bytes Desc: not available URL: From verdy_p at wanadoo.fr Thu May 21 11:25:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 21 May 2015 18:25:16 +0200 Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> Message-ID: 2015-05-21 4:11 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > URLs were initially deisgned to be stable (and this is still a strong >> recommendation). >> > [+ 559 words] > > It doesn't matter if they were designed to be stable. Users don't keep > them stable. > > I can't believe we're debating whether URLs are stable on a list where > people have raised concerns about whether 50 years is stable enough for ISO > 3166-1. > I just say that the URL encoding itself is stable and allows to use them for stable references. The W3C itself uses URIs (in fact just URLs, even if they don't return a resource when queried) for making the XML schemas identifiables. In SGML there are similar stable identifiers (but in a naming scheme). In both cases they are meant to make identifiers unique and stable over time. An URL does NOT have to return a stable content, it JUST has to remain stable by itself. There's absolutely no obligation for its associated content to be accessible or retrievable. It will survive even if the referenced content is later changed or deleted: an URL is a valid URI, it is an identifier. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu May 21 12:49:57 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 21 May 2015 18:49:57 +0100 Subject: =?UTF-8?B?UmU6IHNpbXBsaWZpZWQgQ2hpbmVzZSB3b3JkcyDvvIjlnJ8r5LuO77yJ?= In-Reply-To: References: Message-ID: Hi Shi Zhao, The character ?+? is not yet in Unicode, but it is scheduled for inclusion in CJK Extension F. You can see the character here (http://www.unicode.org/L2/L2014/14271-n4637.pdf on p. 148), but you should not rely on the code point which will surely change. Andrew On 21 May 2015 at 16:06, shi zhao wrote: > simplified Chinese words ??+?, Hanyupinyin: zong1?don't in unihan. > > simplified Chinese: (?+?) > traditional Chinese: ? (U+3661) > > see > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661&useutf8=false > http://glyphwiki.org/wiki/u2ff0-u571f-u4ece > > http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014&filename=KJSY201404019&v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=&filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96 > > From eik at iki.fi Thu May 21 14:52:34 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Thu, 21 May 2015 22:52:34 +0300 Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> Message-ID: <005901d093ff$aec230d0$0c469270$@fi> I don?t think so. Sincerely, Erkki L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Peter Constable L?hetetty: 21. toukokuuta 2015 18:46 Vastaanottaja: Asmus Freytag (t); Eric Muller; unicode at unicode.org Aihe: RE: Tag characters Would Unicode really want to get into the business of running a UFL service? P From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: Wednesday, May 20, 2015 10:15 PM To: Eric Muller; unicode at unicode.org Subject: Re: Tag characters On 5/20/2015 9:57 PM, Eric Muller wrote: On 5/20/2015 7:11 PM, Doug Ewell wrote: In any event, URLs that point to images would be an awful basis for an encoding. I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. Eric. Currently that gives me Not Found The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was not found on this server. :) However, I agree, all we need to do is create a UFL (Universal Flag Locator) and we can keep it as stable as we want. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu May 21 15:25:56 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 21 May 2015 13:25:56 -0700 Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> Message-ID: <555E3F54.6020907@ix.netcom.com> On 5/21/2015 8:46 AM, Peter Constable wrote: > > Would Unicode really want to get into the business of running a UFL > service? > I suspect both Eric and I may have have been slightly tongue-in-cheek with respect to UFLs... ... not sure about anybody else. Cheers, A./ > > P > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag (t) > *Sent:* Wednesday, May 20, 2015 10:15 PM > *To:* Eric Muller; unicode at unicode.org > *Subject:* Re: Tag characters > > On 5/20/2015 9:57 PM, Eric Muller wrote: > > On 5/20/2015 7:11 PM, Doug Ewell wrote: > > In any event, URLs that point to images would be an awful > basis for an encoding. > > > I would make an exception for the URL > http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html > . > > Eric. > > > Currently that gives me > > > Not Found > > The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was > not found on this server. > > > :) > > However, I agree, all we need to do is create a UFL (Universal Flag > Locator) and we can keep it as stable as we want. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri May 22 06:01:13 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 22 May 2015 12:01:13 +0100 (BST) Subject: Tag characters and localizable sentence technology (from Tag characters) Message-ID: <32759766.22530.1432292473336.JavaMail.defaultUser@defaultHost> Tag characters and localizable sentence technology (from Tag characters) I refer to the following documents, the first about localizable sentences and the second about, amongst other matters, applying tag characters using a new encoding format. http://www.unicode.org/L2/L2013/13079-loc-sentance.pdf http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf Starting from the idea of the markup bubble from the first document and applying the tag method and the ISO standard document method from the second document, there arises the following possibility for the future for localizable sentence technology. A single character would be added into Unicode, the name of the character being LOCALIZABLE SENTENCE BASE CHARACTER and then the plain text encoding of a particular localizable sentence would be defined as being expressed as the LOCALIZABLE SENTENCE BASE CHARACTER character followed by the code for the localizable sentence specified in the ISO [number] document, the code being expressed using tag characters. Please find attached a design for the glyph for the LOCALIZABLE SENTENCE BASE CHARACTER character. I designed the glyph by adapting and then combining the designs for localizable sentence markup bubble brackets from the first of the two documents referenced earlier in this text. Each localizable sentence, carefully written so as to avoid in use any reliance as to meaning on any sentence previously used in the same document, would have a meaning expressed in words and possibly also have a glyph: more commonly used localizable sentences each having a glyph yet not all other localizable sentences necessarily having a glyph, though some could have a glyph, as desired. William Overington 22 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: glyph_for_localizable_sentence_base_character.png Type: image/png Size: 872 bytes Desc: not available URL: From baskar115 at gmail.com Sat May 23 07:41:36 2015 From: baskar115 at gmail.com (baskar raj) Date: Sat, 23 May 2015 18:11:36 +0530 Subject: Regarding Unicode for new Symbol Message-ID: Hi, Is it possible to get a Unicode for a new symbol, designed for a commonly used word, For Example lets say "and" . which can be used in conjunction with numbers or letters. so is it possible to file application seeking Unicode.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomasek at etf.cuni.cz Sat May 23 13:50:19 2015 From: tomasek at etf.cuni.cz (Petr Tomasek) Date: Sat, 23 May 2015 20:50:19 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150330.000738.23342035.wl@gnu.org> References: <20150330.000738.23342035.wl@gnu.org> Message-ID: <20150523185019.GA7442@ebed.etf.cuni.cz> On Mon, Mar 30, 2015 at 12:07:38AM +0200, Werner LEMBERG wrote: > > > That?s quite some variety. There are also the three-quarter flat and > > sharp in Western music to consider. I?ll be able to dig into this > > after I get back to Ireland from Sweden on Friday. > > You should check the Standard Music Font Layout (SmuFL) for details; > it also has a freely available font that covers it. > > http://www.smufl.org > > The recent version of the specification can be found at > > http://www.smufl.org/files/smufl-1.12.pdf > > Werner Hm, it seems that there is much more to be encoded in Unicode than just the quarter-tone signs... Petr From asmus-inc at ix.netcom.com Sat May 23 14:09:33 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 23 May 2015 12:09:33 -0700 Subject: Regarding Unicode for new Symbol In-Reply-To: References: Message-ID: <5560D06D.1080305@ix.netcom.com> On 5/23/2015 5:41 AM, baskar raj wrote: > Hi, > Is it possible to get a Unicode for a new symbol, designed for a > commonly used word, For Example lets say "and" . which can be used in > conjunction with numbers or letters. so is it possible to file > application seeking Unicode.... > > Generally, there is a problem with newly invented symbols (for any purpose). It is often impossible to predict whether they will become successful, get widely adopted and thus become an essential part of written text. When Unicode encodes something, it is permanent. If it encodes a symbol that ultimately fails or quickly falls out of use, that failure is now permanent. That fact alone forces Unicode to be very cautious. There are some obvious exceptions. New currency symbols are being invented regularly. But as soon as they are officially declared, practically everyone using that currency has a need to use that symbol in text. Such symbols are practically guaranteed to be successful in a way that other novel symbols are not. Your case sounds like more of the latter; it would seem highly uncertain whether people will adopt your invention. As a result, Unicode would most likely want to encode your symbol only after it has proven itself, and not as a first step. So, while it is "possible" it appears extremely unlikely in this case, unless there are circumstances that you have not mentioned, such as official government support in form of a spelling reform or something of that nature. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 23 17:45:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 24 May 2015 00:45:38 +0200 Subject: Regarding Unicode for new Symbol In-Reply-To: References: Message-ID: But there's already a symbol encoded for this common word: it is part of the ASCII subset (&) and is already encoded as a symbol (even if initially it was designed as a cursive simplification of a ligature for the Latin letters "et" (and used also within words containing these letters, in addition to the MAtin word "et" itself). Some fonts still make the ligature more evident, but as a symbol it allows more variation of its shape (it is also used in a trademark symbol for the Orange telecommunication group, with a specific design, but for such usage as a logo, the encoded character is not suitable: logos are transported as images to specify also this shape and color design, not encoded in the character itself). 2015-05-23 14:41 GMT+02:00 baskar raj : > Hi, > Is it possible to get a Unicode for a new symbol, designed for a commonly > used word, For Example lets say "and" . which can be used in conjunction > with numbers or letters. so is it possible to file application seeking > Unicode.... > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 23 18:00:59 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 24 May 2015 01:00:59 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150523185019.GA7442@ebed.etf.cuni.cz> References: <20150330.000738.23342035.wl@gnu.org> <20150523185019.GA7442@ebed.etf.cuni.cz> Message-ID: 2015-05-23 20:50 GMT+02:00 Petr Tomasek : > Hm, it seems that there is much more to be encoded in Unicode than just > the quarter-tone signs.. > Clearly not a valid arguments against encoding a character. There are plenty of characters still not encoded even in scripts already encoded, this never meant that the encoded part should have been stalled until the set was "complete". Each ecoded character has to be evaluated individually, even if it makes sense to add them in groups when their association in that group is necessary to make them usable (for example it would have been a non-sense in any language to encode only Latin vowels without any consonnant, but it would have been meaningful to encoded only basic Arabic consonnants and postpone the encoding of basic vowels. The merits of an encoding proposal is measured by its usage and usability in a well-established (orthographic) convention. It is important then to explore what is this convention and why more than 1 character are needed together for that convention. Then we can compare with other competing conventionw what they have in common (this is what Unicode considers a "script", even if it is not necessarily for writing spoken languages). -------------- next part -------------- An HTML attachment was scrubbed... URL: From baskar115 at gmail.com Sat May 23 23:55:50 2015 From: baskar115 at gmail.com (baskar raj) Date: Sun, 24 May 2015 10:25:50 +0530 Subject: Regarding Unicode for new Symbol In-Reply-To: References: Message-ID: i just gave "and" as an example (verdy), i am just curious to know if we propose a symbol for a word does Unicode encode it or accept when it is already used by a small community of users, shall we claim in letter like symbols (00?4F). (any possibility) or we can only implement in private use area until it is recognized - which is not possible for small mediums to get widely recognized other than bigger names like Microsoft or Apple proposing. On Sun, May 24, 2015 at 4:15 AM, Philippe Verdy wrote: > But there's already a symbol encoded for this common word: it is part of > the ASCII subset (&) and is already encoded as a symbol (even if initially > it was designed as a cursive simplification of a ligature for the Latin > letters "et" (and used also within words containing these letters, in > addition to the MAtin word "et" itself). > Some fonts still make the ligature more evident, but as a symbol it allows > more variation of its shape (it is also used in a trademark symbol for the > Orange telecommunication group, with a specific design, but for such usage > as a logo, the encoded character is not suitable: logos are transported as > images to specify also this shape and color design, not encoded in the > character itself). > > 2015-05-23 14:41 GMT+02:00 baskar raj : > >> Hi, >> Is it possible to get a Unicode for a new symbol, designed for a commonly >> used word, For Example lets say "and" . which can be used in conjunction >> with numbers or letters. so is it possible to file application seeking >> Unicode.... >> >> >> > -- Kind Regards, M Baskar Raj -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Sun May 24 03:02:49 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sun, 24 May 2015 11:02:49 +0300 Subject: VS: Regarding Unicode for new Symbol In-Reply-To: References: Message-ID: <000001d095f8$07cfeec0$176fcc40$@fi> You are not the first one to come up with this kind of a proposal (even for sentences), which has never received any noticeable support - for good reasons, I might add. Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel / Fax (by arr.): +358943682643 L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta baskar raj L?hetetty: 24. toukokuuta 2015 07:56 Vastaanottaja: verdy_p at wanadoo.fr; unicode Unicode Discussion; asmus-inc at ix.netcom.com Aihe: Re: Regarding Unicode for new Symbol i just gave "and" as an example (verdy), i am just curious to know if we propose a symbol for a word does Unicode encode it or accept when it is already used by a small community of users, shall we claim in letter like symbols (00?4F). (any possibility) or we can only implement in private use area until it is recognized - which is not possible for small mediums to get widely recognized other than bigger names like Microsoft or Apple proposing. On Sun, May 24, 2015 at 4:15 AM, Philippe Verdy wrote: But there's already a symbol encoded for this common word: it is part of the ASCII subset (&) and is already encoded as a symbol (even if initially it was designed as a cursive simplification of a ligature for the Latin letters "et" (and used also within words containing these letters, in addition to the MAtin word "et" itself). Some fonts still make the ligature more evident, but as a symbol it allows more variation of its shape (it is also used in a trademark symbol for the Orange telecommunication group, with a specific design, but for such usage as a logo, the encoded character is not suitable: logos are transported as images to specify also this shape and color design, not encoded in the character itself). 2015-05-23 14:41 GMT+02:00 baskar raj : Hi, Is it possible to get a Unicode for a new symbol, designed for a commonly used word, For Example lets say "and" . which can be used in conjunction with numbers or letters. so is it possible to file application seeking Unicode.... -- Kind Regards, M Baskar Raj -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun May 24 04:25:53 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 24 May 2015 10:25:53 +0100 Subject: Regarding Unicode for new Symbol In-Reply-To: References: Message-ID: <20150524102553.1ce9a877@JRWUBU2> On Sun, 24 May 2015 10:25:50 +0530 baskar raj wrote: > i just gave "and" as an example (verdy), i am just curious to know if > we propose a symbol for a word does Unicode encode it or accept when > it is already used by a small community of users, shall we claim in > letter like symbols (00?4F). (any possibility) > or we can only implement in private use area until it is recognized - > which is not possible for small mediums to get widely recognized > other than bigger names like Microsoft or Apple proposing. In general, a private use character can be promoted by including it in a generally useful font and providing soft keyboards that allow its use. There are two major exceptions to this - combining marks and characters that require a rendering engine. It might even be possible to get round these problems in many cases with a *lot* of ingenuity in the soft keyboards. I believe AAT fonts are a solution for the Apple world, but OpenType may be more difficult, and may need tackling application by application and rendered by renderer even with open source software. Another possible method would be to subvert the rendering engine. For open source applications, fonts using (SIL) Graphite often work. While Tai Tham was being encoded, I successfully used the PUA for generating word lists and successfully converted them to Unicode once the encoding was approved. My viewing tools were limited, and I was delighted when OpenOffice started supporting Graphite and when a version of Firefox appeared that also supported Graphite. There is another solution, which is *bad* but can work well for a short period. That solution is for a font to hijack a code point with the desired properties relevant to rendering. One solution along these lines, which may not yet be usable, would be to use a character with the right properties and then use a variation sequence to substitute one's own unrelated glyph. Gaps in character assignments tend to be used for these purposes (Lao is a good example), but renderer support varies. I remember that Windows XP initially didn't support U+0BB6 TAMIL LETTER SHA when using its native rendering stack. Richard. From tomasek at etf.cuni.cz Sun May 24 06:32:40 2015 From: tomasek at etf.cuni.cz (Petr Tomasek) Date: Sun, 24 May 2015 13:32:40 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: <20150330.000738.23342035.wl@gnu.org> <20150523185019.GA7442@ebed.etf.cuni.cz> Message-ID: <20150524113240.GA15445@ebed.etf.cuni.cz> On Sun, May 24, 2015 at 01:00:59AM +0200, Philippe Verdy wrote: > 2015-05-23 20:50 GMT+02:00 Petr Tomasek : > > > Hm, it seems that there is much more to be encoded in Unicode than just > > the quarter-tone signs.. > > > > Clearly not a valid arguments against encoding a character. Where do I argue against encoding a character? I was just surprised by how many musical symbols are there which would benefit from being encoded in unicode. Not less and not more. P.T. > There are > plenty of characters still not encoded even in scripts already encoded, > this never meant that the encoded part should have been stalled until the > set was "complete". > Each ecoded character has to be evaluated individually, even if it makes > sense to add them in groups when their association in that group is > necessary to make them usable (for example it would have been a non-sense > in any language to encode only Latin vowels without any consonnant, but it > would have been meaningful to encoded only basic Arabic consonnants and > postpone the encoding of basic vowels. > The merits of an encoding proposal is measured by its usage and usability > in a well-established (orthographic) convention. It is important then to > explore what is this convention and why more than 1 character are needed > together for that convention. Then we can compare with other competing > conventionw what they have in common (this is what Unicode considers a > "script", even if it is not necessarily for writing spoken languages). > From samjnaa at gmail.com Sun May 24 07:25:02 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Sun, 24 May 2015 17:55:02 +0530 Subject: 25CC for dotted circle, but what for dashed box? Message-ID: I hope the subject line makes it clear. What character is to be used when a dashed box such as that shown for special-rendering characters in the code chart is required to be actually shown in text? -- Shriramana Sharma ???????????? ???????????? From samjnaa at gmail.com Sun May 24 10:36:10 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Sun, 24 May 2015 21:06:10 +0530 Subject: 25CC for dotted circle, but what for dashed box? In-Reply-To: References: Message-ID: Nice -- I was searching for "DASHED BOX" since that's what TUS 7.0 ch 24.1 refers to it as and there are too many "SQUARE" characters... -- Shriramana Sharma ???????????? ???????????? From verdy_p at wanadoo.fr Sun May 24 13:52:26 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 24 May 2015 20:52:26 +0200 Subject: Regarding Unicode for new Symbol In-Reply-To: <20150524102553.1ce9a877@JRWUBU2> References: <20150524102553.1ce9a877@JRWUBU2> Message-ID: 2015-05-24 11:25 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > There is another solution, which is *bad* but can work well for a short > period. That solution is for a font to hijack a code point with the > desired properties relevant to rendering. > It is not so bad when the usage is limited to some documents using specific fonts designed for this purpose. OK it is not fully interchangeable, but it can be good for the start (including for creating documents showing the proposal for a new encoding). However if we want to limit the propagation of this "bad" encoding in documents not specifically linked to a specific font, a good solution is to embed that font directly in the document (the PDF format is suitable for that, but you can also do that with HTML documents using embedded SVG images, which can themselves be embedded in SVG fonts embeddable in the document itself). No need to use a variation sequence (unless it is also recognized spcifically in that embedded font) But it is not general enough for all complex scripts that require specific layout rules (GSUB/GPOS), notably when they are contextual. In summary we come back to the use of collections of glyphs (SVG) without actually any text rendering engine. With HTML5, the embedding of SVG is greatly facilitated (and can also be automated with some custom javascripts transforming an easily compable syntax into a sequence of text and images). You can even apply some limited CSS styling that can apply to both the text and inline SVG images, provided your SVG is designed to be scalable within the current text line metrics, for example when it uses a "viewport" attribute but not the "width" and "hight" attribute that should be set by the default HTML box model: it will work however reliably only for full clusters occupying the standard line height and vertical alignment relative to the baseline, not for individual characters if they are combining or using some contextual layout). Now it's up to you to invent your own syntax for making the transform into sequences of plain-text and inline images). However you won't get some font-specific features such as hinting for small font sizes (SVG fonts currently have no standard way to include hinting instructions in order to transform the geometry of paths according to the physical device, and there are also difficulties with the specification of sizes in CSS, for example on Hi-DPI displays such as smartphones, or with the zoom in/out feature of browsers: it requires fine tuning not with the CSS "logical pixel" unit, scaled in logical "dpi", but with the newer "ddpx" unit, plus some other metrics related to subpixels of the rendering surface, or relative alignment of pixels with the physical positions, which are not necessarily in a simple grid, but mapped using "screening" technics which are very common when printing). As far as I known "font hinting" is still a work in progress (since long), it is also very complex in TrueType/OpenType and has no real standard (only a few specialists can use it to design specific fonts and it is not easily reusable elsewhere), so nothing in this domain is supported by SVG fonts (for small font sizes the current solution is still to use bitmap images instead, assuming that the HTML rendering engine is using its best efforts to map the logical pixels of bitmaps into physical pixels or subpixels on the rendering surface, and to preserve their intended color gamut and contrasts without excessive distortions); in fact TrueType/OpenType or SVG and CSS does not even have any decent support for "screening" technics, like those that exist in PostScript since several decennials; and for this reasons, publishers still **love** PostScript for the fine tuning of the typography and images and for getting the best final result that the final printing medium can support. So PostScript fonts are definitely not dead, but still not enough supported and used for display due to lack of equivalent support in OSes and browsers (even in HTML5, there's still no decent support in the newest "canvas", that still have lots of quirks at this level, and that also don't support any suitable screening). And most popular printers do not even have Postscript (it is replaced by some capabilities of the printer drivers doing all the work, via the more limited graphic APIs of the OS used by applications: those printers only support simple bitmaps). It is then still difficult to create for the widest range of devices any document elbedding simultaneously plain-text rendered with fonts, scalable images (such as SVG), and bitmap images (including photography), without first assuming some physical properties of the rendering surface (but also taking into account local preferences of the final user, such as zoom level, or colorimetric profiles, or choice of paper and print quality, or multiple displays). The "WYSIWYG" concept is just an advertized goal, but still a myth as it is largely not implemented or supported. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Tue May 26 08:48:51 2015 From: eric.muller at efele.net (Eric Muller) Date: Tue, 26 May 2015 06:48:51 -0700 Subject: Tag characters In-Reply-To: <555E3F54.6020907@ix.netcom.com> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> <555E3F54.6020907@ix.netcom.com> Message-ID: <556479C3.9040805@efele.net> An HTML attachment was scrubbed... URL: From pzi at ingerman.org Tue May 26 09:45:37 2015 From: pzi at ingerman.org (Peter Zilahy Ingerman, PhD) Date: Tue, 26 May 2015 10:45:37 -0400 Subject: Tag characters In-Reply-To: <556479C3.9040805@efele.net> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> <555E3F54.6020907@ix.netcom.com> <556479C3.9040805@efele.net> Message-ID: <55648711.5000203@ingerman.org> Aww... I was SURE you meant UFOs! On 2015-05-26 09:48, Eric Muller wrote: > On 5/21/2015 1:25 PM, Asmus Freytag (t) wrote: >> On 5/21/2015 8:46 AM, Peter Constable wrote: >>> >>> Would Unicode really want to get into the business of running a UFL >>> service? >>> >> >> I suspect both Eric and I may have have been slightly tongue-in-cheek >> with respect to UFLs... > > Actually, I was serious. > > Eric. > > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2015.0.5961 / Virus Database: 4354/9871 - Release Date: 05/26/15 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed May 27 02:53:52 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 27 May 2015 08:53:52 +0100 (BST) Subject: Tag characters In-Reply-To: References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> Message-ID: <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost> Peter Constable wrote as follows: > Would Unicode really want to get into the business of running a UFL service? Well, Unicode is about precision, interoperability and long-term stability, and, given, in relation to one particular specified base character followed by some tag characters, that a particular sequence of Unicode characters is intended to lead to the display of an image representing a particular flag, it seems to me highly reasonable that the Unicode Technical Committee might seriously consider providing that facility. William Overington 27 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Wed May 27 03:18:13 2015 From: moyogo at gmail.com (Denis Jacquerye) Date: Wed, 27 May 2015 08:18:13 +0000 Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?= =?UTF-8?Q?ts?= In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: The South China Morning Post published a similar infographic: A world of languages - and how many speak them http://www.scmp.com/infographics/article/1810040/infographic-world-languages -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed May 27 05:22:38 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 27 May 2015 12:22:38 +0200 Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?= =?UTF-8?Q?ts?= In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: Hmmm. How accurate can it be? They forgot Austria, and got Switzerland wrong by almost a power of 10. Mark *? Il meglio ? l?inimico del bene ?* On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye wrote: > The South China Morning Post published a similar infographic: > A world of languages - and how many speak them > > http://www.scmp.com/infographics/article/1810040/infographic-world-languages > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Wed May 27 09:59:37 2015 From: moyogo at gmail.com (Denis Jacquerye) Date: Wed, 27 May 2015 14:59:37 +0000 Subject: FYI: The world's languages, in 7 maps and charts In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: The data used to build the infographic comes from Ethnologue.com. http://www.ethnologue.com/language/deu does not indicate the Standard German L1 population in Austria and gives a population of 727?000 Standard German L1 speakers in Switzerland (the difference is counted as Swiss German L1 speakers). On Wed, 27 May 2015 at 11:22 Mark Davis ?? wrote: > Hmmm. How accurate can it be? They forgot Austria, and got Switzerland > wrong by almost a power of 10. > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye > wrote: > >> The South China Morning Post published a similar infographic: >> A world of languages - and how many speak them >> >> http://www.scmp.com/infographics/article/1810040/infographic-world-languages >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From clarkcox3 at gmail.com Wed May 27 10:57:38 2015 From: clarkcox3 at gmail.com (clarkcox3 at gmail.com) Date: Wed, 27 May 2015 08:57:38 -0700 Subject: FYI: The world's languages, in 7 maps and charts In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: <60B6D84E-453F-489E-9F16-8BEB2919833B@gmail.com> If the various Chinese languages/dialects are similar enough to be counted in a single category, then certainly Swiss German Is similar enough to the German spoken in Germany and Austria to be counted in the same category. Sent from my iPhone > On May 27, 2015, at 07:59, Denis Jacquerye wrote: > > The data used to build the infographic comes from Ethnologue.com. > http://www.ethnologue.com/language/deu does not indicate the Standard German L1 population in Austria and gives a population of 727?000 Standard German L1 speakers in Switzerland (the difference is counted as Swiss German L1 speakers). > >> On Wed, 27 May 2015 at 11:22 Mark Davis ?? wrote: >> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland wrong by almost a power of 10. >> >> >> Mark >> >> ? Il meglio ? l?inimico del bene ? >> >>> On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye wrote: >>> The South China Morning Post published a similar infographic: >>> A world of languages - and how many speak them >>> http://www.scmp.com/infographics/article/1810040/infographic-world-languages -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed May 27 11:10:46 2015 From: petercon at microsoft.com (Peter Constable) Date: Wed, 27 May 2015 16:10:46 +0000 Subject: Tag characters In-Reply-To: <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost> References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net> <555D65A5.4090705@efele.net> <555D69C5.9040901@ix.netcom.com> <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost> Message-ID: Well, the same reasoning could also argue for the contra-positive (a?b ? ?b??a): that UTC should not consider endorsing such a tag scheme. Peter From: William_J_G Overington [mailto:wjgo_10009 at btinternet.com] Sent: Wednesday, May 27, 2015 12:54 AM To: unicode at unicode.org; Peter Constable; eric.muller at efele.net; asmus-inc at ix.netcom.com Subject: Re: Tag characters Peter Constable wrote as follows: > Would Unicode really want to get into the business of running a UFL service? Well, Unicode is about precision, interoperability and long-term stability, and, given, in relation to one particular specified base character followed by some tag characters, that a particular sequence of Unicode characters is intended to lead to the display of an image representing a particular flag, it seems to me highly reasonable that the Unicode Technical Committee might seriously consider providing that facility. William Overington 27 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed May 27 11:26:07 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 27 May 2015 17:26:07 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications. The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character. The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character. The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer. Examples of displays: Each example is left to right along the line then lines down the page from upper to lower. 7r means 7 pixels red 7r5y means 7 pixels red then 5 pixels yellow 7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue Examples of colours available: k black n brown r red o orange y yellow g green (0, 255, 0) b blue m magenta e grey w white c cyan p pink d dark grey i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1) f deeper green (foliage colour) (0, 128, 0) Next line request: - moves to the next line Local palette requests: 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) 7,2u means 7 pixels using local palette colour 2 Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font. Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted. William Overington 27 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed May 27 12:06:41 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 27 May 2015 10:06:41 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150527100641.665a7a7059d7ee80bb4d670165c8327d.9c484cc1df.wbe@email03.secureserver.net> William_J_G Overington wrote: > Please feel free to suggest improvements. http://en.wikipedia.org/wiki/Scalable_Vector_Graphics -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed May 27 12:49:31 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 27 May 2015 10:49:31 -0700 Subject: Tag characters Message-ID: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net> On Tuesday, May 19, Mark Davis ?? wrote: > A more concrete proposal will be in a PRI to be issued soon, If the new mechanism is intended "for Unicode 8.0," as stated in the minutes at http://www.unicode.org/L2/L2015/15107.htm#143-M1 ... ... and if Unicode 8.0 is "planned for release in June, 2015," as stated on the Beta Review page... ... and if June 2015 starts in less than a week... ... shouldn't we be seeing that PRI real soon now? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Wed May 27 13:08:44 2015 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 27 May 2015 11:08:44 -0700 Subject: Tag characters In-Reply-To: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net> References: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net> Message-ID: <5566082C.4060904@att.net> Doug, Read on in the minutes to the next day. 143-C27 and related actions. There are a few things to keep in mind here. 1. The un-deprecation of the tags U+E0020..U+E007E *is* part of the UCD for Unicode 8.0. The change has already taken place in the revised beta files now posted (see PropList.txt), and will be part of the 8.0 release next month. 2. UTR #51, while scheduled to come out at the same time as the Unicode 8.0 release, is a UTR and is not formally either a part of the Unicode Standard per se, nor a formal part of the Unicode 8.0 release. 3. As per the minutes, when the approved version of UTR #51 is first published, more or less simultaneously with the Unicode 8.0 release (and explaining other aspects of emoji related to the release, such as the use of emoji modifiers), it will *not* yet contain the flag-tag discussion and mechanism. 4. Once the PRI is up, it will be used as the basis for the next proposed update of UTR #51. And the review of that proposed update and publication of the *subsequent* revision of UTR #51 need not wait for the next Unicode release (9.0 in summer, 2016). So at that point, the flag-tag mechanism will be available for use *with* Unicode 8.0 -- it just won't be a formal part of the release per se. Clear? --Ken On 5/27/2015 10:49 AM, Doug Ewell wrote: > On Tuesday, May 19, Mark Davis ?? wrote: > >> A more concrete proposal will be in a PRI to be issued soon, > If the new mechanism is intended "for Unicode 8.0," as stated in the > minutes at http://www.unicode.org/L2/L2015/15107.htm#143-M1 ... > > ... and if Unicode 8.0 is "planned for release in June, 2015," as stated > on the Beta Review page... > > ... and if June 2015 starts in less than a week... > > ... shouldn't we be seeing that PRI real soon now? > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > From doug at ewellic.org Wed May 27 14:06:26 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 27 May 2015 12:06:26 -0700 Subject: Tag characters Message-ID: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net> Ken Whistler wrote: > Read on in the minutes to the next day. 143-C27 and related actions. Ah. Thank you. Now I understand what Steven meant by "read the minutes," too. That's the problem with reading individual items in meeting minutes: each item is a snapshot in time, and the next day of the meeting might have brought no change, or a big change. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Wed May 27 14:10:53 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 27 May 2015 21:10:53 +0200 Subject: FYI: The world's languages, in 7 maps and charts In-Reply-To: References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry> <000401d08d5e$a811de90$f8359bb0$@gmail.com> Message-ID: I think it is gives a misleading picture to only include mother-language speakers, rather than all languages (at a reasonable level of fluency). Every Swiss German is fluent in High German. Part of the problem is that it is very hard to get good data on the multiple languages that people speak?a huge number of people are fluent in more than one?and on the level of fluency in each. That alone makes it difficult to do accurate representations. That level of accuracy may not be necessary to get a general picture, but when the map purports to go into great detail... Mark *? Il meglio ? l?inimico del bene ?* On Wed, May 27, 2015 at 4:59 PM, Denis Jacquerye wrote: > The data used to build the infographic comes from Ethnologue.com. > http://www.ethnologue.com/language/deu does not indicate the Standard > German L1 population in Austria and gives a population of 727?000 Standard > German L1 speakers in Switzerland (the difference is counted as Swiss > German L1 speakers). > > On Wed, 27 May 2015 at 11:22 Mark Davis [image: ?]? > wrote: > >> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland >> wrong by almost a power of 10. >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye >> wrote: >> >>> The South China Morning Post published a similar infographic: >>> A world of languages - and how many speak them >>> >>> http://www.scmp.com/infographics/article/1810040/infographic-world-languages >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1616 bytes Desc: not available URL: From jimbreen at gmail.com Wed May 27 18:15:28 2015 From: jimbreen at gmail.com (Jim Breen) Date: Thu, 28 May 2015 09:15:28 +1000 Subject: FYI: The world?s languages, in 7 maps and charts Message-ID: "Mark Davis" wrote: >> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland >> wrong by almost a power of 10. I was a little surprised to see only 15.6 Australians speak English, which led me to wonder what the other 8 million of us speak. I see that the ethnologue site they used quotes the 2006 Australian census as saying the population was 15.6 million. I can't imagine where they got that, as that census reported the population as being just under 20 million. The 2011 census recorded the population at 21.7 million. I guess if they are prone to using inaccurate data from old sources, it explains some of the other oddities in that map. Jim Breen -- Jim Breen Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University From mark at kli.org Wed May 27 18:41:39 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 27 May 2015 19:41:39 -0400 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> Message-ID: <55665633.8040503@kli.org> I think I've figured out the philosophy WJGO is trying to follow here. "We should have a way to encode graphics in Unicode" "We should have a way to encode programming instructions in Unicode" How about "We should have a way to encode sound-waves in Unicode"? Or "We should have a way to encode *moving* graphics, maybe with sound, in Unicode"? Now, he didn't say the last two, in fairness to him. But I think that's the thinking. WJGO, not *everything* computers do has to be part of Unicode. Doing so essentially makes *everything* that wants to support "Unicode" have to be... well, pretty much *everything* all other computers are. We have graphics formats that encode graphics; they're *good* at it. They're made for it. We have sound formats for encoding sounds. We have various bytecodes for programming--different ones, written by different people, that do things in different ways, because one size does not fit all. Unicode can't be the one size. It was never intended to. Don't make Unicode into an operating system, or worse, THE operating system. It's a character encoding. For encoding characters. ~mark On 05/27/2015 12:26 PM, William_J_G Overington wrote: > Tag characters and in-line graphics (from Tag characters) > > > This document suggests a way to use the method of a base character > together with tag characters to produce a graphic. The approach is > theoretical and has not, at this time, been tried in practice. > > > The application in mind is to enable the graphic for an emoji > character to be included within a plain text stream, though there will > hopefully be other applications. > From srl at icu-project.org Wed May 27 22:04:21 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 27 May 2015 22:04:21 -0500 Subject: Tag characters In-Reply-To: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net> References: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net> Message-ID: <5AC7D996-8BA2-4F31-9BD8-5B8B18026C96@icu-project.org> Thanks Ken; and yes Doug; http://www.unicode.org/L2/L2015/15107.htm#143-C27 was the reference I was looking for when I wrote my too- brief reply earlier. My apologies. S Enviado desde nuestro iPhone. > On May 27, 2015, at 2:06 PM, Doug Ewell wrote: > > Ken Whistler wrote: > >> Read on in the minutes to the next day. 143-C27 and related actions. > > Ah. Thank you. Now I understand what Steven meant by "read the minutes," > too. > > That's the problem with reading individual items in meeting minutes: > each item is a snapshot in time, and the next day of the meeting might > have brought no change, or a big change. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu May 28 06:50:09 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 28 May 2015 12:50:09 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <55665633.8040503@kli.org> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> Message-ID: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> Responding to Mark E. Shoulson: The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. The following may be useful as a guide to the original problem that I am trying to solve. http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term I tried to apply the brilliant new "base character followed by tag characters" format to the problem. In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. William Overington 28 May 2015 From idou747 at gmail.com Wed May 27 23:48:23 2015 From: idou747 at gmail.com (Chris) Date: Thu, 28 May 2015 14:48:23 +1000 Subject: Arrow dingbats Message-ID: Unicode has the arrow dingbats ??????? in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" conspicuously missing is the right arrow The closest one can find is 27a1 ?BLACK RIGHT ARROW" ? But everywhere I can see that has this arrow, it looks a lot different to the other arrows with a narrower body and head. Whose fault is this, and who will fix it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 28 09:53:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 07:53:42 -0700 Subject: "Unicode of Death" Message-ID: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Unicode is in the news today as some folks with waaay too much time on their hands have discovered a string consisting of Latin, Arabic, Devanagari, and CJK characters that crashes Apple devices when it appears as a pop-up message. Although most people seem to identify it correctly as a CoreText bug, there are a handful, as you might expect, who attribute it to some shady weirdness in Unicode itself. My favorite quote from a Reddit user was this: "Every character you use has a unicode value which tells your phone what to display. One of the unicode values is actually never-ending and so when the phone tries to read it it goes into an infinite loop which crashes it." I've read TUS Chapter 4 and UTR #23 and I still can't find the "never-ending" Unicode property. Perhaps astonishingly to some, the string displays fine on all my Windows devices. Not all apps get the directionality right, but no crashes. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu May 28 10:03:41 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 08:03:41 -0700 Subject: Arrow dingbats Message-ID: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net> Chris wrote: > Unicode has the arrow dingbats ??????? > > in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" > conspicuously missing is the right arrow > > The closest one can find is 27a1 ?BLACK RIGHT ARROW" > ? > > But everywhere I can see that has this arrow, it looks a lot different > to the other arrows with a narrower body and head. > > Whose fault is this, and who will fix it? U+2B95 RIGHTWARDS BLACK ARROW ? might be a better fit. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From boldewyn at gmail.com Thu May 28 10:34:53 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Thu, 28 May 2015 17:34:53 +0200 Subject: Arrow dingbats In-Reply-To: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net> References: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net> Message-ID: Interesting! Out of curiosity: How come this was recognized in Unicode 7? Is that documented anywhere? 2015-05-28 17:03 GMT+02:00 Doug Ewell : > Chris wrote: > > > Unicode has the arrow dingbats ??????? > > > > in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" > > conspicuously missing is the right arrow > > > > The closest one can find is 27a1 ?BLACK RIGHT ARROW" > > ? > > > > But everywhere I can see that has this arrow, it looks a lot different > > to the other arrows with a narrower body and head. > > > > Whose fault is this, and who will fix it? > > U+2B95 RIGHTWARDS BLACK ARROW ? might be a better fit. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothy at greenwood.name Thu May 28 10:47:10 2015 From: timothy at greenwood.name (Tim Greenwood) Date: Thu, 28 May 2015 15:47:10 +0000 Subject: "Unicode of Death" In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: Must be that same evil Unicode consortium that is destroying civilization by inventing emoji. The Guardian article has been edited since yesterday, when it did actually claim that Unicode invented all emoji. http://gu.com/p/4997q On Thu, May 28, 2015 at 11:04 AM Doug Ewell wrote: > Unicode is in the news today as some folks with waaay too much time on > their hands have discovered a string consisting of Latin, Arabic, > Devanagari, and CJK characters that crashes Apple devices when it > appears as a pop-up message. > > Although most people seem to identify it correctly as a CoreText bug, > there are a handful, as you might expect, who attribute it to some shady > weirdness in Unicode itself. My favorite quote from a Reddit user was > this: > > "Every character you use has a unicode value which tells your phone what > to display. One of the unicode values is actually never-ending and so > when the phone tries to read it it goes into an infinite loop which > crashes it." > > I've read TUS Chapter 4 and UTR #23 and I still can't find the > "never-ending" Unicode property. > > Perhaps astonishingly to some, the string displays fine on all my > Windows devices. Not all apps get the directionality right, but no > crashes. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 11:06:01 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 09:06:01 -0700 Subject: "Unicode of Death" In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: > > Unicode is in the news today as some folks with waaay too much time on > their hands have discovered a string consisting of Latin, Arabic, > Devanagari, and CJK characters that crashes Apple devices when it > appears as a pop-up message. > We should be thankful to those folks "waaay too much time on their hands" to discover these for us all. Although most people seem to identify it correctly as a CoreText bug, Any good technical write up about this? ? Shervin On Thu, May 28, 2015 at 7:53 AM, Doug Ewell wrote: > Unicode is in the news today as some folks with waaay too much time on > their hands have discovered a string consisting of Latin, Arabic, > Devanagari, and CJK characters that crashes Apple devices when it > appears as a pop-up message. > > Although most people seem to identify it correctly as a CoreText bug, > there are a handful, as you might expect, who attribute it to some shady > weirdness in Unicode itself. My favorite quote from a Reddit user was > this: > > "Every character you use has a unicode value which tells your phone what > to display. One of the unicode values is actually never-ending and so > when the phone tries to read it it goes into an infinite loop which > crashes it." > > I've read TUS Chapter 4 and UTR #23 and I still can't find the > "never-ending" Unicode property. > > Perhaps astonishingly to some, the string displays fine on all my > Windows devices. Not all apps get the directionality right, but no > crashes. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 11:12:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 18:12:25 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> Message-ID: There's no advantage because what you want to create is effectively another markup language with its own syntax (but requiring new obscure characters that most applications and users will not be able to interpret and render correctly in the way intended by you, and with still many things you have forgotten about the specific needs for images (e.g. colorimetry profiles, aspect ratio of pixels with bitmaps, undesired effects that must be controled such as "moir?" artefacts). You don't need new characters to create a markup language and its syntax. Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don't require any external request, or embedding special effects on images, such as animation or dynamic layouts for adapting the document to the redering device, with the help of CSS and Javascript that are also embeddable). At least with HTML5 they don't try to reinvent the image formats and there's ample space for supporting multiple images formats tuned for specific needs (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and video, and synchronization of images and audio in time for videos, or with user interactions. They are designed separately and benefit from patient researches made since long (your desired format, still undocumented, is largely under the level needed for images, independantly of the markup syntax you want to create to support them, and independantly of the fact that you also want to encode these syntaxic elements with new characters, something that is absolutely not needed for any markup language) In summary, you are reinventing the wheel. 2015-05-28 13:50 GMT+02:00 William_J_G Overington : > Responding to Mark E. Shoulson: > > The big advantage of this new format is that the result is an unambiguous > Unicode plain text file and could be placed within a file of plain text > without having to make the whole document a markup file to some format. > Plain text is the key advantage. > > The following may be useful as a guide to the original problem that I am > trying to solve. > > http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term > > I tried to apply the brilliant new "base character followed by tag > characters" format to the problem. > > In the future, maybe Serif DrawPlus will have the ability to export a > picture to this new format. > > William Overington > > 28 May 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 28 11:16:31 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 09:16:31 -0700 Subject: Arrow dingbats Message-ID: <20150528091631.665a7a7059d7ee80bb4d670165c8327d.c67f478dad.wbe@email03.secureserver.net> Manuel Strehl wrote: > Interesting! Out of curiosity: How come this was recognized in Unicode > 7? > Is that documented anywhere? NamesList.txt contains this entry for the left arrow: 2B05 LEFTWARDS BLACK ARROW x (black rightwards arrow - 27A1) x (rightwards black arrow - 2B95) I don't know how U+2B95 came to be encoded in 7.0 when all of the similar U+2B0x arrows had been in place since 4.0. Presumably, before then, it was felt that U+27A1 was an appropriate fit, though as Chris idou747 pointed out, not all fonts show perfect symmetry here. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu May 28 11:18:16 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 09:18:16 -0700 Subject: "Unicode of Death" Message-ID: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net> Shervin Afshar wrote: > Any good technical write up about this? Haven't seen one yet. Just a lot of "OMG, look at this" so far. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From shervinafshar at gmail.com Thu May 28 12:13:16 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 10:13:16 -0700 Subject: "Unicode of Death" In-Reply-To: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net> References: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net> Message-ID: I'm no iOS dev, but it seems like CoreText is trying[1] to truncate text for SpringBoard (to shorten it with ellipses to fit the notification box) and it crashes and burns with a segmentation fault[2]. FWIW, Reddit abides[3][4] and reacts with "Unicode Suppressor"[5]...heh...as if! [1]: http://pastebin.com/cQyQE7Ws [2]: http://stackoverflow.com/questions/12601286/i-am-getting-a-lot-of-sigsegv-exception-in-my-ios-app-crash-report-and-that-too [3]: https://www.reddit.com/r/apple/comments/37e8c1/malicious_text_message/crm4h4x [4]: http://www.reddit.com/r/iphone/comments/37eaxs/um_can_someone_explain_this_phenomenon/crm3adg [5]: https://www.myrepospace.com/profile/effective/688319/Unicode_Suppresor ? Shervin On Thu, May 28, 2015 at 9:18 AM, Doug Ewell wrote: > Shervin Afshar wrote: > > > Any good technical write up about this? > > Haven't seen one yet. Just a lot of "OMG, look at this" so far. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu May 28 14:13:02 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 28 May 2015 20:13:02 +0100 Subject: Arrow dingbats In-Reply-To: References: Message-ID: On 28 May 2015 at 05:48, Chris wrote: > > Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" > conspicuously missing is the right arrow > > But everywhere I can see that has this arrow, it looks a lot different to > the other arrows with a narrower body and head. > > Whose fault is this, The three left/up/downwards black arrows were added at the request of North Korea, so I guess you can blame Kim Jong-Il for the missing rightwards arrow ... perhaps the North Korean army never went to the right. > and who will fix it? It was fixed in Unicode 7.0 last year with the addition of U+2B95 RIGHTWARDS BLACK ARROW. Of course, it may not be fixed for you and other users unless you have a font installed that supports all the arrows in a consistent style. I don't know why the character was added in 7.0, but it may have been prompted by the same question as yours that was asked on this list in 2013 . Andrew From verdy_p at wanadoo.fr Thu May 28 14:46:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 21:46:55 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Message-ID: Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 14:59:55 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 12:59:55 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Single and double diamond? https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg ? Shervin On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy wrote: > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > I'm looking for the symbol itself, not the color, or the form of the sign. > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Thu May 28 15:02:07 2015 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 28 May 2015 17:02:07 -0300 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING UPWARD POINTING TRIANGLE, and pretend the triangle is a hill. ?? ? If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW CAPPED MOUNTAIN. Or anything else. 2015-05-28 16:46 GMT-03:00 Philippe Verdy : > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > I'm looking for the symbol itself, not the color, or the form of the sign. > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu May 28 15:04:11 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 28 May 2015 20:04:11 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: So is double black diamond a separate symbol? Or just two of the black diamond? And Blue-Black? I?m drawing a blank on a specific bunny sign, in my experience those are usually just green. Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 12:47 PM To: unicode Unicode Discussion Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 15:03:43 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 22:03:43 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Well also these symbols, if you want (these are not really "diamonds"), but the wordpress page forgets the "bunny hill". It starts only with the green circle (in fact a black disc colored in green) which maps to blue pistes in Europe. 2015-05-28 21:59 GMT+02:00 Shervin Afshar : > Single and double diamond? > > https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg > > http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg > > http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg > > > ? Shervin > > On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy > wrote: > >> Is there a symbol that can represent the "Bunny hill" symbol used in >> North America and some other American territories with mountains, to >> designate the ski pistes open to novice skiers (those pistes are signaled >> with green signs in Europe). >> >> I'm looking for the symbol itself, not the color, or the form of the sign. >> >> For example blue pistes in Europe are designed with a green circle in >> America, but we have a symbol for the circle; red pistes in Europe are >> signaled by a blue square in America, but we have a symbol for the square; >> black pistes in Europe are signaled by a black diamond in America, but we >> also have such "black" diamond in Unicode. >> >> But I can't find an equivalent to the American "Bunny hill" signal, >> equivalent to green pistes in Europe (this is a problem for webpages >> related to skiing: do we have to embed an image ?). >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 15:10:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 22:10:23 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: A single "black diamond" symbol would be sufficient I think (in fact a black square rotated 45?, not the same as the symbol of card decks which typically has borders rounded inward) The effective color does not really matter here, it can be generated by styling the text, something necessary anyway with the European piste colors that don't use any specific symbol, but signs that are most frequently circular, or sometimes shaped as squares, or "diamonds"). So for the "black diamond" it just means that this is a symbol fully filled with the text color (like other Unicode characters named with "BLACK". 2015-05-28 22:04 GMT+02:00 Shawn Steele : > So is double black diamond a separate symbol? Or just two of the black > diamond? > > > > And Blue-Black? > > > > I?m drawing a blank on a specific bunny sign, in my experience those are > usually just green. > > > > Aren?t there a lot of cartography symbols for various systems that aren?t > present in Unicode? > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy > *Sent:* Thursday, May 28, 2015 12:47 PM > *To:* unicode Unicode Discussion > *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes > for novices > > > > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > > > I'm looking for the symbol itself, not the color, or the form of the sign. > > > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 15:11:26 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 22:11:26 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Very poor suggestion I think. This is a single symbol by itself. 2015-05-28 22:02 GMT+02:00 Leonardo Boiko : > You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING > UPWARD POINTING TRIANGLE, and pretend the triangle is a hill. [image: ??] > ? > > If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW > CAPPED MOUNTAIN. Or anything else. > > > 2015-05-28 16:46 GMT-03:00 Philippe Verdy : > > Is there a symbol that can represent the "Bunny hill" symbol used in North >> America and some other American territories with mountains, to designate >> the ski pistes open to novice skiers (those pistes are signaled with green >> signs in Europe). >> >> I'm looking for the symbol itself, not the color, or the form of the sign. >> >> For example blue pistes in Europe are designed with a green circle in >> America, but we have a symbol for the circle; red pistes in Europe are >> signaled by a blue square in America, but we have a symbol for the square; >> black pistes in Europe are signaled by a black diamond in America, but we >> also have such "black" diamond in Unicode. >> >> But I can't find an equivalent to the American "Bunny hill" signal, >> equivalent to green pistes in Europe (this is a problem for webpages >> related to skiing: do we have to embed an image ?). >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f407.png Type: image/png Size: 1902 bytes Desc: not available URL: From shervinafshar at gmail.com Thu May 28 15:11:12 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 13:11:12 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Well...to pick the nit, these shapes are rhombi; known colloquially as "diamonds". So what's the symbol for "bunny hill" in Europe? ? Shervin On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy wrote: > Well also these symbols, if you want (these are not really "diamonds"), > but the wordpress page forgets the "bunny hill". It starts only with the > green circle (in fact a black disc colored in green) which maps to blue > pistes in Europe. > > 2015-05-28 21:59 GMT+02:00 Shervin Afshar : > >> Single and double diamond? >> >> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg >> >> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg >> >> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg >> >> >> ? Shervin >> >> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy >> wrote: >> >>> Is there a symbol that can represent the "Bunny hill" symbol used in >>> North America and some other American territories with mountains, to >>> designate the ski pistes open to novice skiers (those pistes are signaled >>> with green signs in Europe). >>> >>> I'm looking for the symbol itself, not the color, or the form of the >>> sign. >>> >>> For example blue pistes in Europe are designed with a green circle in >>> America, but we have a symbol for the circle; red pistes in Europe are >>> signaled by a blue square in America, but we have a symbol for the square; >>> black pistes in Europe are signaled by a black diamond in America, but we >>> also have such "black" diamond in Unicode. >>> >>> But I can't find an equivalent to the American "Bunny hill" signal, >>> equivalent to green pistes in Europe (this is a problem for webpages >>> related to skiing: do we have to embed an image ?). >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 15:16:01 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 22:16:01 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: I saif it: there's no symbol in Europe for pistes, just colors. The American "Bunny hill" maps to "green" pistes in Europe. (the European piste colors are used also for drawing their ways on maps, not just found in signages). Piste signs are typically all the same shape in the same station (most often discs) and the text on it (if present) shows the name or number of the piste in the station, or just an arrow showing the direction to follow. 2015-05-28 22:11 GMT+02:00 Shervin Afshar : > Well...to pick the nit, these shapes are rhombi; known colloquially as > "diamonds". > > So what's the symbol for "bunny hill" in Europe? > > ? Shervin > > On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy > wrote: > >> Well also these symbols, if you want (these are not really "diamonds"), >> but the wordpress page forgets the "bunny hill". It starts only with the >> green circle (in fact a black disc colored in green) which maps to blue >> pistes in Europe. >> >> 2015-05-28 21:59 GMT+02:00 Shervin Afshar : >> >>> Single and double diamond? >>> >>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg >>> >>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg >>> >>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg >>> >>> >>> ? Shervin >>> >>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy >>> wrote: >>> >>>> Is there a symbol that can represent the "Bunny hill" symbol used in >>>> North America and some other American territories with mountains, to >>>> designate the ski pistes open to novice skiers (those pistes are signaled >>>> with green signs in Europe). >>>> >>>> I'm looking for the symbol itself, not the color, or the form of the >>>> sign. >>>> >>>> For example blue pistes in Europe are designed with a green circle in >>>> America, but we have a symbol for the circle; red pistes in Europe are >>>> signaled by a blue square in America, but we have a symbol for the square; >>>> black pistes in Europe are signaled by a black diamond in America, but we >>>> also have such "black" diamond in Unicode. >>>> >>>> But I can't find an equivalent to the American "Bunny hill" signal, >>>> equivalent to green pistes in Europe (this is a problem for webpages >>>> related to skiing: do we have to embed an image ?). >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 15:25:02 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 13:25:02 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Makes sense. But it doesn't seem like we need any new symbols. I think one of these should do for hard and extra-hard slopes: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g= Also, I'm not at all against making use of the actual [image: ??]we have. I will not hold my breath for a combining rabbit symbol though. ? Shervin On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy wrote: > I saif it: there's no symbol in Europe for pistes, just colors. The > American "Bunny hill" maps to "green" pistes in Europe. > (the European piste colors are used also for drawing their ways on maps, > not just found in signages). > Piste signs are typically all the same shape in the same station (most > often discs) and the text on it (if present) shows the name or number of > the piste in the station, or just an arrow showing the direction to follow. > > 2015-05-28 22:11 GMT+02:00 Shervin Afshar : > >> Well...to pick the nit, these shapes are rhombi; known colloquially as >> "diamonds". >> >> So what's the symbol for "bunny hill" in Europe? >> >> ? Shervin >> >> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy >> wrote: >> >>> Well also these symbols, if you want (these are not really "diamonds"), >>> but the wordpress page forgets the "bunny hill". It starts only with the >>> green circle (in fact a black disc colored in green) which maps to blue >>> pistes in Europe. >>> >>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar : >>> >>>> Single and double diamond? >>>> >>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg >>>> >>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg >>>> >>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg >>>> >>>> >>>> ? Shervin >>>> >>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy >>>> wrote: >>>> >>>>> Is there a symbol that can represent the "Bunny hill" symbol used in >>>>> North America and some other American territories with mountains, to >>>>> designate the ski pistes open to novice skiers (those pistes are signaled >>>>> with green signs in Europe). >>>>> >>>>> I'm looking for the symbol itself, not the color, or the form of the >>>>> sign. >>>>> >>>>> For example blue pistes in Europe are designed with a green circle in >>>>> America, but we have a symbol for the circle; red pistes in Europe are >>>>> signaled by a blue square in America, but we have a symbol for the square; >>>>> black pistes in Europe are signaled by a black diamond in America, but we >>>>> also have such "black" diamond in Unicode. >>>>> >>>>> But I can't find an equivalent to the American "Bunny hill" signal, >>>>> equivalent to green pistes in Europe (this is a problem for webpages >>>>> related to skiing: do we have to embed an image ?). >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f407.png Type: image/png Size: 1902 bytes Desc: not available URL: From leoboiko at namakajiri.net Thu May 28 15:33:40 2015 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 28 May 2015 17:33:40 -0300 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: Serious question: Has someone discussed a generic combining mechanism? I mean, characters with an effect like "combine the last two". Say, '!' + '?' + COMBINING OVERLAY = '?'. '!' + '!' + COMBINING SIDE BY SIDE = '?', and so on. Similar in spirit to the Ideographic Description Characters, but meant to actually tell the rendering system to combine stuff. 2015-05-28 17:25 GMT-03:00 Shervin Afshar : > Makes sense. But it doesn't seem like we need any new symbols. I think one > of these should do for hard and extra-hard slopes: > > > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g= > > Also, I'm not at all against making use of the actual [image: ??]we have. > I will not hold my breath for a combining rabbit symbol though. > > ? Shervin > > On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy > wrote: > >> I saif it: there's no symbol in Europe for pistes, just colors. The >> American "Bunny hill" maps to "green" pistes in Europe. >> (the European piste colors are used also for drawing their ways on maps, >> not just found in signages). >> Piste signs are typically all the same shape in the same station (most >> often discs) and the text on it (if present) shows the name or number of >> the piste in the station, or just an arrow showing the direction to follow. >> >> 2015-05-28 22:11 GMT+02:00 Shervin Afshar : >> >>> Well...to pick the nit, these shapes are rhombi; known colloquially as >>> "diamonds". >>> >>> So what's the symbol for "bunny hill" in Europe? >>> >>> ? Shervin >>> >>> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy >>> wrote: >>> >>>> Well also these symbols, if you want (these are not really "diamonds"), >>>> but the wordpress page forgets the "bunny hill". It starts only with the >>>> green circle (in fact a black disc colored in green) which maps to blue >>>> pistes in Europe. >>>> >>>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar : >>>> >>>>> Single and double diamond? >>>>> >>>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg >>>>> >>>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg >>>>> >>>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg >>>>> >>>>> >>>>> ? Shervin >>>>> >>>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy >>>>> wrote: >>>>> >>>>>> Is there a symbol that can represent the "Bunny hill" symbol used in >>>>>> North America and some other American territories with mountains, to >>>>>> designate the ski pistes open to novice skiers (those pistes are signaled >>>>>> with green signs in Europe). >>>>>> >>>>>> I'm looking for the symbol itself, not the color, or the form of the >>>>>> sign. >>>>>> >>>>>> For example blue pistes in Europe are designed with a green circle in >>>>>> America, but we have a symbol for the circle; red pistes in Europe are >>>>>> signaled by a blue square in America, but we have a symbol for the square; >>>>>> black pistes in Europe are signaled by a black diamond in America, but we >>>>>> also have such "black" diamond in Unicode. >>>>>> >>>>>> But I can't find an equivalent to the American "Bunny hill" signal, >>>>>> equivalent to green pistes in Europe (this is a problem for webpages >>>>>> related to skiing: do we have to embed an image ?). >>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f407.png Type: image/png Size: 1902 bytes Desc: not available URL: From doug at ewellic.org Thu May 28 15:44:22 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 13:44:22 -0700 Subject: Arrow dingbats Message-ID: <20150528134422.665a7a7059d7ee80bb4d670165c8327d.cf04b7950e.wbe@email03.secureserver.net> Andrew West wrote: > I don't know why the character was added in 7.0, but it may have been > prompted by the same question as yours that was asked on this list in > 2013 . And the answer, from Michel Suignard in http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0079.html : > Rejoice! > Added in 2B95 in Unicode 7.0 > > (was added when the Wingdings set was added with Amendment 1 of > 10646:2012, part of the target set for 7.0) -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu May 28 15:59:03 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 28 May 2015 13:59:03 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Message-ID: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> http://www.signsofthemountains.com/what-do-the-symbols-on-ski-trail-signs-mean-d/ http://news.outdoortechnology.com/2015/02/04/ski-slope-rating-symbols-mean-really-mean/ Looks like a green circle is the symbol for a beginner slope. (The first link also shows that "piste" is the European word for what we call a trail, run, or slope). There is no difference between a "bunny slope" and a "beginner" or "novice" slope. Unicode has some suitable filled circles (particularly U+2B24 and U+25CF), and it has a green apple, heart, and book, but as yet no green circle. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu May 28 16:00:29 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 23:00:29 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: What you'd like is in act similar to the zero-width joiner, between two combining sequences, to make them overlap. A sort of "negative-width" joiner that we could call "overlay joiner". So '!' + OVERLAY JOINER + '?' = '?'. But in legacy charsets, this role was encoded as a BACKSPACE control (it was used to produce combining accents as well, by combining a letter and a *spacing* accent), and I think it is still a solution for the same problem without needing a new character. So '!' + BACKSPACE + '?' = '?'. 2015-05-28 22:33 GMT+02:00 Leonardo Boiko : > Serious question: Has someone discussed a generic combining mechanism? I > mean, characters with an effect like "combine the last two". Say, '!' + > '?' + COMBINING OVERLAY = '?'. '!' + '!' + COMBINING SIDE BY SIDE = '?', > and so on. Similar in spirit to the Ideographic Description Characters, > but meant to actually tell the rendering system to combine stuff. > > 2015-05-28 17:25 GMT-03:00 Shervin Afshar : > > Makes sense. But it doesn't seem like we need any new symbols. I think one >> of these should do for hard and extra-hard slopes: >> >> >> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g= >> >> Also, I'm not at all against making use of the actual [image: ??]we >> have. I will not hold my breath for a combining rabbit symbol though. >> >> ? Shervin >> >> On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy >> wrote: >> >>> I saif it: there's no symbol in Europe for pistes, just colors. The >>> American "Bunny hill" maps to "green" pistes in Europe. >>> (the European piste colors are used also for drawing their ways on maps, >>> not just found in signages). >>> Piste signs are typically all the same shape in the same station (most >>> often discs) and the text on it (if present) shows the name or number of >>> the piste in the station, or just an arrow showing the direction to follow. >>> >>> 2015-05-28 22:11 GMT+02:00 Shervin Afshar : >>> >>>> Well...to pick the nit, these shapes are rhombi; known colloquially as >>>> "diamonds". >>>> >>>> So what's the symbol for "bunny hill" in Europe? >>>> >>>> ? Shervin >>>> >>>> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy >>>> wrote: >>>> >>>>> Well also these symbols, if you want (these are not really >>>>> "diamonds"), but the wordpress page forgets the "bunny hill". It starts >>>>> only with the green circle (in fact a black disc colored in green) which >>>>> maps to blue pistes in Europe. >>>>> >>>>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar : >>>>> >>>>>> Single and double diamond? >>>>>> >>>>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg >>>>>> >>>>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg >>>>>> >>>>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg >>>>>> >>>>>> >>>>>> ? Shervin >>>>>> >>>>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy >>>>>> wrote: >>>>>> >>>>>>> Is there a symbol that can represent the "Bunny hill" symbol used in >>>>>>> North America and some other American territories with mountains, to >>>>>>> designate the ski pistes open to novice skiers (those pistes are signaled >>>>>>> with green signs in Europe). >>>>>>> >>>>>>> I'm looking for the symbol itself, not the color, or the form of the >>>>>>> sign. >>>>>>> >>>>>>> For example blue pistes in Europe are designed with a green circle >>>>>>> in America, but we have a symbol for the circle; red pistes in Europe are >>>>>>> signaled by a blue square in America, but we have a symbol for the square; >>>>>>> black pistes in Europe are signaled by a black diamond in America, but we >>>>>>> also have such "black" diamond in Unicode. >>>>>>> >>>>>>> But I can't find an equivalent to the American "Bunny hill" signal, >>>>>>> equivalent to green pistes in Europe (this is a problem for webpages >>>>>>> related to skiing: do we have to embed an image ?). >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f407.png Type: image/png Size: 1902 bytes Desc: not available URL: From billposer2 at gmail.com Thu May 28 16:01:55 2015 From: billposer2 at gmail.com (Bill Poser) Date: Thu, 28 May 2015 14:01:55 -0700 Subject: "Unicode of Death" In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: No doubt the evil Unicode Consortium is in league with the Trilateral Commission, the Elders of Zion,and the folks at NASA who faked the moon landing.... :) On Thu, May 28, 2015 at 7:53 AM, Doug Ewell wrote: > Unicode is in the news today as some folks with waaay too much time on > their hands have discovered a string consisting of Latin, Arabic, > Devanagari, and CJK characters that crashes Apple devices when it > appears as a pop-up message. > > Although most people seem to identify it correctly as a CoreText bug, > there are a handful, as you might expect, who attribute it to some shady > weirdness in Unicode itself. My favorite quote from a Reddit user was > this: > > "Every character you use has a unicode value which tells your phone what > to display. One of the unicode values is actually never-ending and so > when the phone tries to read it it goes into an infinite loop which > crashes it." > > I've read TUS Chapter 4 and UTR #23 and I still can't find the > "never-ending" Unicode property. > > Perhaps astonishingly to some, the string displays fine on all my > Windows devices. Not all apps get the directionality right, but no > crashes. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu May 28 16:07:11 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 28 May 2015 21:07:11 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: <55677762.3060805@oracle.com> References: <55677762.3060805@oracle.com> Message-ID: I?m wondering if it?s a regional thing, I haven?t seen it, at least in the mostly-west of North America. An east coast thing? From: Jim Melton [mailto:jim.melton at oracle.com] Sent: Thursday, May 28, 2015 1:16 PM To: Shawn Steele Cc: verdy_p at wanadoo.fr; unicode Unicode Discussion Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States. I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner". That's anecdotal evidence at best, but my observations cover numerous skiing sites. I have encountered such a symbol in Europe and in New Zealand, but not in the USA. (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.) The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points. Hope this helps, Jim On 5/28/2015 2:04 PM, Shawn Steele wrote: So is double black diamond a separate symbol? Or just two of the black diamond? And Blue-Black? I?m drawing a blank on a specific bunny sign, in my experience those are usually just green. Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 12:47 PM To: unicode Unicode Discussion Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -- ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com ======================================================================== = Facts are facts. But any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ======================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu May 28 16:08:24 2015 From: idou747 at gmail.com (Chris) Date: Fri, 29 May 2015 07:08:24 +1000 Subject: Arrow dingbats In-Reply-To: References: Message-ID: So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs. Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?) Except that nobody seems to have U+2B95 aligned either. On unicode-table.com it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align? > On 29 May 2015, at 5:13 am, Andrew West wrote: > > On 28 May 2015 at 05:48, Chris wrote: >> >> Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" >> conspicuously missing is the right arrow >> >> But everywhere I can see that has this arrow, it looks a lot different to >> the other arrows with a narrower body and head. >> >> Whose fault is this, > > The three left/up/downwards black arrows were added at the request of > North Korea, so I guess you can blame Kim Jong-Il for the missing > rightwards arrow ... perhaps the North Korean army never went to the > right. > >> and who will fix it? > > It was fixed in Unicode 7.0 last year with the addition of U+2B95 > RIGHTWARDS BLACK ARROW. Of course, it may not be fixed for you and > other users unless you have a font installed that supports all the > arrows in a consistent style. > > I don't know why the character was added in 7.0, but it may have been > prompted by the same question as yours that was asked on this list in > 2013 . > > Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 16:11:49 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 23:11:49 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: <55677762.3060805@oracle.com> References: <55677762.3060805@oracle.com> Message-ID: Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. It's a good point for using only one symbol, encoding it twice in plain-text if needed. 2015-05-28 22:15 GMT+02:00 Jim Melton : > I no longer ski, but I did so for many years, mostly (but not > exclusively) in the western United States. I never encountered, at any USA > ski hill/mountain/resort, a special symbol for "bunny hills", which are > typically represented by the green circle meaning "beginner". That's > anecdotal evidence at best, but my observations cover numerous skiing > sites. I have encountered such a symbol in Europe and in New Zealand, but > not in the USA. (I have not had the pleasure of skiing in Canada and am > thus unable to speak about ski areas in that country.) > > The double black diamond would appear to be a unique symbol worthy of > encoding, simply because the only valid typographical representation (in > the USA) is two single black diamonds stacked one above the other and > touching at the points. > > Hope this helps, > Jim > > On 5/28/2015 2:04 PM, Shawn Steele wrote: > > So is double black diamond a separate symbol? Or just two of the black > diamond? > > > > And Blue-Black? > > > > I?m drawing a blank on a specific bunny sign, in my experience those are > usually just green. > > > > Aren?t there a lot of cartography symbols for various systems that aren?t > present in Unicode? > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Philippe Verdy > *Sent:* Thursday, May 28, 2015 12:47 PM > *To:* unicode Unicode Discussion > *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes > for novices > > > > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > > > I'm looking for the symbol itself, not the color, or the form of the sign. > > > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > > > > -- > ======================================================================== > Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 > Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 > Oracle Corporation Oracle Email: jim dot melton at oracle dot com > 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org > Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com > ======================================================================== > = Facts are facts. But any opinions expressed are the opinions = > = only of myself and may or may not reflect the opinions of anybody = > = else with whom I may or may not have discussed the issues at hand. = > ======================================================================== > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu May 28 16:15:13 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 28 May 2015 21:15:13 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: I?m used to them being next to each other. So the entire discussion seems to be about how to encode a concept vs how to get the shape you want with existing code points. If you just want the perfect shape, then maybe an svg is a better choice. If we?re talking about describing ski-run difficulty levels in plain-text, then the hodge-podge of glyphs being offered in this thread seems kinda hacky to me. -Shawn From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 2:12 PM To: Jim Melton Cc: Shawn Steele; unicode Unicode Discussion Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. It's a good point for using only one symbol, encoding it twice in plain-text if needed. 2015-05-28 22:15 GMT+02:00 Jim Melton >: I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States. I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner". That's anecdotal evidence at best, but my observations cover numerous skiing sites. I have encountered such a symbol in Europe and in New Zealand, but not in the USA. (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.) The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points. Hope this helps, Jim On 5/28/2015 2:04 PM, Shawn Steele wrote: So is double black diamond a separate symbol? Or just two of the black diamond? And Blue-Black? I?m drawing a blank on a specific bunny sign, in my experience those are usually just green. Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 12:47 PM To: unicode Unicode Discussion Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -- ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com ======================================================================== = Facts are facts. But any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ======================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 16:16:35 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 23:16:35 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> Message-ID: The "green" physical color does not need encoding. A black disc is enough, just like the black square and the black diamond/romb (the rest is styling). There's also the orange oval (horizontal) used for free-rides in America (in Europe, not symbol but the yellow color, used for some authorized "free-ride" pistes in Switzerland; in France, free-ride is severely reglemented but there's no signage used as these are not open for the general public, as they are too risky and such signs could bring too many skiers to dangereous areas without proper training and equipement). 2015-05-28 22:59 GMT+02:00 Doug Ewell : > > http://www.signsofthemountains.com/what-do-the-symbols-on-ski-trail-signs-mean-d/ > > http://news.outdoortechnology.com/2015/02/04/ski-slope-rating-symbols-mean-really-mean/ > > Looks like a green circle is the symbol for a beginner slope. (The first > link also shows that "piste" is the European word for what we call a > trail, run, or slope). There is no difference between a "bunny slope" > and a "beginner" or "novice" slope. > > Unicode has some suitable filled circles (particularly U+2B24 and > U+25CF), and it has a green apple, heart, and book, but as yet no green > circle. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 16:20:17 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 14:20:17 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: Since the double-diamond has map and map legend usage, it might be a good idea to have it encoded separately. I know that I'm stating the obvious here, but the important point is doing the research and showing that it has widespread usage. ? Shervin On Thu, May 28, 2015 at 2:15 PM, Shawn Steele wrote: > I?m used to them being next to each other. So the entire discussion > seems to be about how to encode a concept vs how to get the shape you want > with existing code points. If you just want the perfect shape, then maybe > an svg is a better choice. If we?re talking about describing ski-run > difficulty levels in plain-text, then the hodge-podge of glyphs being > offered in this thread seems kinda hacky to me. > > > > -Shawn > > > > *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe > Verdy > *Sent:* Thursday, May 28, 2015 2:12 PM > *To:* Jim Melton > *Cc:* Shawn Steele; unicode Unicode Discussion > *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski > pistes for novices > > > > Some documentations also suggest that the two diamonds are not stacked one > above the other, but horizontally. It's a good point for using only one > symbol, encoding it twice in plain-text if needed. > > > > 2015-05-28 22:15 GMT+02:00 Jim Melton : > > I no longer ski, but I did so for many years, mostly (but not > exclusively) in the western United States. I never encountered, at any USA > ski hill/mountain/resort, a special symbol for "bunny hills", which are > typically represented by the green circle meaning "beginner". That's > anecdotal evidence at best, but my observations cover numerous skiing > sites. I have encountered such a symbol in Europe and in New Zealand, but > not in the USA. (I have not had the pleasure of skiing in Canada and am > thus unable to speak about ski areas in that country.) > > The double black diamond would appear to be a unique symbol worthy of > encoding, simply because the only valid typographical representation (in > the USA) is two single black diamonds stacked one above the other and > touching at the points. > > Hope this helps, > Jim > > > On 5/28/2015 2:04 PM, Shawn Steele wrote: > > So is double black diamond a separate symbol? Or just two of the black > diamond? > > > > And Blue-Black? > > > > I?m drawing a blank on a specific bunny sign, in my experience those are > usually just green. > > > > Aren?t there a lot of cartography symbols for various systems that aren?t > present in Unicode? > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Philippe Verdy > *Sent:* Thursday, May 28, 2015 12:47 PM > *To:* unicode Unicode Discussion > *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes > for novices > > > > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > > > I'm looking for the symbol itself, not the color, or the form of the sign. > > > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > > > > > -- > > ======================================================================== > > Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 > > Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 > > Oracle Corporation Oracle Email: jim dot melton at oracle dot com > > 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org > > Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com > > ======================================================================== > > = Facts are facts. But any opinions expressed are the opinions = > > = only of myself and may or may not reflect the opinions of anybody = > > = else with whom I may or may not have discussed the issues at hand. = > > ======================================================================== > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 16:20:17 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 14:20:17 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: Since the double-diamond has map and map legend usage, it might be a good idea to have it encoded separately. I know that I'm stating the obvious here, but the important point is doing the research and showing that it has widespread usage. ? Shervin On Thu, May 28, 2015 at 2:15 PM, Shawn Steele wrote: > I?m used to them being next to each other. So the entire discussion > seems to be about how to encode a concept vs how to get the shape you want > with existing code points. If you just want the perfect shape, then maybe > an svg is a better choice. If we?re talking about describing ski-run > difficulty levels in plain-text, then the hodge-podge of glyphs being > offered in this thread seems kinda hacky to me. > > > > -Shawn > > > > *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe > Verdy > *Sent:* Thursday, May 28, 2015 2:12 PM > *To:* Jim Melton > *Cc:* Shawn Steele; unicode Unicode Discussion > *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski > pistes for novices > > > > Some documentations also suggest that the two diamonds are not stacked one > above the other, but horizontally. It's a good point for using only one > symbol, encoding it twice in plain-text if needed. > > > > 2015-05-28 22:15 GMT+02:00 Jim Melton : > > I no longer ski, but I did so for many years, mostly (but not > exclusively) in the western United States. I never encountered, at any USA > ski hill/mountain/resort, a special symbol for "bunny hills", which are > typically represented by the green circle meaning "beginner". That's > anecdotal evidence at best, but my observations cover numerous skiing > sites. I have encountered such a symbol in Europe and in New Zealand, but > not in the USA. (I have not had the pleasure of skiing in Canada and am > thus unable to speak about ski areas in that country.) > > The double black diamond would appear to be a unique symbol worthy of > encoding, simply because the only valid typographical representation (in > the USA) is two single black diamonds stacked one above the other and > touching at the points. > > Hope this helps, > Jim > > > On 5/28/2015 2:04 PM, Shawn Steele wrote: > > So is double black diamond a separate symbol? Or just two of the black > diamond? > > > > And Blue-Black? > > > > I?m drawing a blank on a specific bunny sign, in my experience those are > usually just green. > > > > Aren?t there a lot of cartography symbols for various systems that aren?t > present in Unicode? > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Philippe Verdy > *Sent:* Thursday, May 28, 2015 12:47 PM > *To:* unicode Unicode Discussion > *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes > for novices > > > > Is there a symbol that can represent the "Bunny hill" symbol used in North > America and some other American territories with mountains, to designate > the ski pistes open to novice skiers (those pistes are signaled with green > signs in Europe). > > > > I'm looking for the symbol itself, not the color, or the form of the sign. > > > > For example blue pistes in Europe are designed with a green circle in > America, but we have a symbol for the circle; red pistes in Europe are > signaled by a blue square in America, but we have a symbol for the square; > black pistes in Europe are signaled by a black diamond in America, but we > also have such "black" diamond in Unicode. > > > > But I can't find an equivalent to the American "Bunny hill" signal, > equivalent to green pistes in Europe (this is a problem for webpages > related to skiing: do we have to embed an image ?). > > > > > > -- > > ======================================================================== > > Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 > > Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 > > Oracle Corporation Oracle Email: jim dot melton at oracle dot com > > 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org > > Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com > > ======================================================================== > > = Facts are facts. But any opinions expressed are the opinions = > > = only of myself and may or may not reflect the opinions of anybody = > > = else with whom I may or may not have discussed the issues at hand. = > > ======================================================================== > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 16:26:00 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 28 May 2015 23:26:00 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> Message-ID: 2015-05-28 22:59 GMT+02:00 Doug Ewell : > Looks like a green circle is the symbol for a beginner slope. (The first > link also shows that "piste" is the European word for what we call a > trail, run, or slope). There is no difference between a "bunny slope" > and a "beginner" or "novice" slope. > The difference is obvious in Europe where the "novice" difficulty is marked as green pistes (slopes are below 30% or almost flat), and the "beginner/moderate" difficulty is marked as blue pistes (slopes about 30-35%). Even America must have this "novice" difficulty, with areas mostly used by young children (with their parents not skiing but following them by foot, and a restriction of speeds); these areas are protected so that other skiers will not pass through them. In fact if you remain on these novice areas you cannot reach any speed that could cause dangerous shocks: you have to "push" to advance, otherwise you'll slow down naturally and stop on the snow. These areas can be used by walkers, and randonners using "raquettes". -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Thu May 28 16:36:32 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 29 May 2015 07:36:32 +1000 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: Not the first time unicode crashes things. There was the google chrome bug on osx that crashed the tab for any syriac text. A. On Friday, 29 May 2015, Bill Poser wrote: > No doubt the evil Unicode Consortium is in league with the Trilateral Commission, the Elders of Zion,and the folks at NASA who faked the moon landing.... :) > > On Thu, May 28, 2015 at 7:53 AM, Doug Ewell wrote: >> >> Unicode is in the news today as some folks with waaay too much time on >> their hands have discovered a string consisting of Latin, Arabic, >> Devanagari, and CJK characters that crashes Apple devices when it >> appears as a pop-up message. >> >> Although most people seem to identify it correctly as a CoreText bug, >> there are a handful, as you might expect, who attribute it to some shady >> weirdness in Unicode itself. My favorite quote from a Reddit user was >> this: >> >> "Every character you use has a unicode value which tells your phone what >> to display. One of the unicode values is actually never-ending and so >> when the phone tries to read it it goes into an infinite loop which >> crashes it." >> >> I've read TUS Chapter 4 and UTR #23 and I still can't find the >> "never-ending" Unicode property. >> >> Perhaps astonishingly to some, the string displays fine on all my >> Windows devices. Not all apps get the directionality right, but no >> crashes. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> > > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu May 28 16:44:58 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 28 May 2015 21:44:58 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> Message-ID: Typically we have ?slow? zones with include both ?novice? areas and congested areas. Additionally the ?novice? part of a slope often has a rope fence delineating it from the rest of the slow. However on the maps, etc, its usually just off to the side of a green run and doesn?t have a special symbol. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 2:26 PM To: Doug Ewell Cc: Unicode Mailing List Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices 2015-05-28 22:59 GMT+02:00 Doug Ewell >: Looks like a green circle is the symbol for a beginner slope. (The first link also shows that "piste" is the European word for what we call a trail, run, or slope). There is no difference between a "bunny slope" and a "beginner" or "novice" slope. The difference is obvious in Europe where the "novice" difficulty is marked as green pistes (slopes are below 30% or almost flat), and the "beginner/moderate" difficulty is marked as blue pistes (slopes about 30-35%). Even America must have this "novice" difficulty, with areas mostly used by young children (with their parents not skiing but following them by foot, and a restriction of speeds); these areas are protected so that other skiers will not pass through them. In fact if you remain on these novice areas you cannot reach any speed that could cause dangerous shocks: you have to "push" to advance, otherwise you'll slow down naturally and stop on the snow. These areas can be used by walkers, and randonners using "raquettes". -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu May 28 16:56:39 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 28 May 2015 14:56:39 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: Being used in maps and map legends is not a sufficient condition for encoding a symbol. If it were, all symbols used in physical maps would have been encoded, including each and every mineral and rare metal. Leo On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar wrote: > Since the double-diamond has map and map legend usage, it might be a good > idea to have it encoded separately. I know that I'm stating the obvious > here, but the important point is doing the research and showing that it has > widespread usage. > > ? Shervin > > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele > wrote: >> >> I?m used to them being next to each other. So the entire discussion seems >> to be about how to encode a concept vs how to get the shape you want with >> existing code points. If you just want the perfect shape, then maybe an >> svg is a better choice. If we?re talking about describing ski-run >> difficulty levels in plain-text, then the hodge-podge of glyphs being >> offered in this thread seems kinda hacky to me. >> >> >> >> -Shawn >> >> >> >> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe >> Verdy >> Sent: Thursday, May 28, 2015 2:12 PM >> To: Jim Melton >> Cc: Shawn Steele; unicode Unicode Discussion >> Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes >> for novices >> >> >> >> Some documentations also suggest that the two diamonds are not stacked one >> above the other, but horizontally. It's a good point for using only one >> symbol, encoding it twice in plain-text if needed. >> >> >> >> 2015-05-28 22:15 GMT+02:00 Jim Melton : >> >> I no longer ski, but I did so for many years, mostly (but not exclusively) >> in the western United States. I never encountered, at any USA ski >> hill/mountain/resort, a special symbol for "bunny hills", which are >> typically represented by the green circle meaning "beginner". That's >> anecdotal evidence at best, but my observations cover numerous skiing sites. >> I have encountered such a symbol in Europe and in New Zealand, but not in >> the USA. (I have not had the pleasure of skiing in Canada and am thus >> unable to speak about ski areas in that country.) >> >> The double black diamond would appear to be a unique symbol worthy of >> encoding, simply because the only valid typographical representation (in the >> USA) is two single black diamonds stacked one above the other and touching >> at the points. >> >> Hope this helps, >> Jim >> >> >> On 5/28/2015 2:04 PM, Shawn Steele wrote: >> >> So is double black diamond a separate symbol? Or just two of the black >> diamond? >> >> >> >> And Blue-Black? >> >> >> >> I?m drawing a blank on a specific bunny sign, in my experience those are >> usually just green. >> >> >> >> Aren?t there a lot of cartography symbols for various systems that aren?t >> present in Unicode? >> >> >> >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe >> Verdy >> Sent: Thursday, May 28, 2015 12:47 PM >> To: unicode Unicode Discussion >> Subject: "Bunny hill" symbol, used in America for signaling ski pistes for >> novices >> >> >> >> Is there a symbol that can represent the "Bunny hill" symbol used in North >> America and some other American territories with mountains, to designate the >> ski pistes open to novice skiers (those pistes are signaled with green signs >> in Europe). >> >> >> >> I'm looking for the symbol itself, not the color, or the form of the sign. >> >> >> >> For example blue pistes in Europe are designed with a green circle in >> America, but we have a symbol for the circle; red pistes in Europe are >> signaled by a blue square in America, but we have a symbol for the square; >> black pistes in Europe are signaled by a black diamond in America, but we >> also have such "black" diamond in Unicode. >> >> >> >> But I can't find an equivalent to the American "Bunny hill" signal, >> equivalent to green pistes in Europe (this is a problem for webpages related >> to skiing: do we have to embed an image ?). >> >> >> >> >> >> -- >> >> ======================================================================== >> >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 >> >> Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 >> >> Oracle Corporation Oracle Email: jim dot melton at oracle dot com >> >> 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org >> >> Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com >> >> ======================================================================== >> >> = Facts are facts. But any opinions expressed are the opinions = >> >> = only of myself and may or may not reflect the opinions of anybody = >> >> = else with whom I may or may not have discussed the issues at hand. = >> >> ======================================================================== >> >> > > From verdy_p at wanadoo.fr Thu May 28 17:00:32 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 29 May 2015 00:00:32 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> Message-ID: The rope (or other barriers) are also present in Europe, but they are considered true "pistes" by themselves, even if they are relatively short. In frequent cases they are connected upward to a blue piste (not for novices) but there are "slow down" warnings displayed on them and the regulation requires taking care of every skier that could be in front of you. Various tools are used to force skiers to slow down, including forcing them to slalom between barriers, or including flat sections or sections going upward, and adding a large rest area around this interconnection. The European green pistes for novices are also relatively well separated from blue pistes (used by all other skiers and interconnected with mor difficult ones: red and black): if there's a blue piste, it will most often be parallel and separated physically by barriers, this limits the number of intersections or the need for interconnections (the only intersection is then at the station itself, in a crowded area near the equipments to bring skiers to the upper part of the piste). But my initial question was about the symbol that I have seen (partly) documented without an actual image for ski stations in US. May be the "bunny hills" symbol is specific to a station, not used elsewhere, or there are other similar symbols used locally. I wonder if this is not simply the symbol/logo of a local ski school... 2015-05-28 23:44 GMT+02:00 Shawn Steele : > Typically we have ?slow? zones with include both ?novice? areas and > congested areas. Additionally the ?novice? part of a slope often has a > rope fence delineating it from the rest of the slow. However on the maps, > etc, its usually just off to the side of a green run and doesn?t have a > special symbol. > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy > *Sent:* Thursday, May 28, 2015 2:26 PM > *To:* Doug Ewell > *Cc:* Unicode Mailing List > *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski > pistes for novices > > > > 2015-05-28 22:59 GMT+02:00 Doug Ewell : > > Looks like a green circle is the symbol for a beginner slope. (The first > link also shows that "piste" is the European word for what we call a > trail, run, or slope). There is no difference between a "bunny slope" > and a "beginner" or "novice" slope. > > > > The difference is obvious in Europe where the "novice" difficulty is > marked as green pistes (slopes are below 30% or almost flat), and the > "beginner/moderate" difficulty is marked as blue pistes (slopes about > 30-35%). > > > > Even America must have this "novice" difficulty, with areas mostly used by > young children (with their parents not skiing but following them by foot, > and a restriction of speeds); these areas are protected so that other > skiers will not pass through them. In fact if you remain on these novice > areas you cannot reach any speed that could cause dangerous shocks: you > have to "push" to advance, otherwise you'll slow down naturally and stop on > the snow. > > > > These areas can be used by walkers, and randonners using "raquettes". > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu May 28 17:06:59 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 28 May 2015 22:06:59 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net> Message-ID: What is the image?, curiosity killed the bunny ? I expect that it?s limited to a single ski area or maybe region. From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 3:01 PM To: Shawn Steele Cc: Doug Ewell; Unicode Mailing List Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices The rope (or other barriers) are also present in Europe, but they are considered true "pistes" by themselves, even if they are relatively short. In frequent cases they are connected upward to a blue piste (not for novices) but there are "slow down" warnings displayed on them and the regulation requires taking care of every skier that could be in front of you. Various tools are used to force skiers to slow down, including forcing them to slalom between barriers, or including flat sections or sections going upward, and adding a large rest area around this interconnection. The European green pistes for novices are also relatively well separated from blue pistes (used by all other skiers and interconnected with mor difficult ones: red and black): if there's a blue piste, it will most often be parallel and separated physically by barriers, this limits the number of intersections or the need for interconnections (the only intersection is then at the station itself, in a crowded area near the equipments to bring skiers to the upper part of the piste). But my initial question was about the symbol that I have seen (partly) documented without an actual image for ski stations in US. May be the "bunny hills" symbol is specific to a station, not used elsewhere, or there are other similar symbols used locally. I wonder if this is not simply the symbol/logo of a local ski school... 2015-05-28 23:44 GMT+02:00 Shawn Steele >: Typically we have ?slow? zones with include both ?novice? areas and congested areas. Additionally the ?novice? part of a slope often has a rope fence delineating it from the rest of the slow. However on the maps, etc, its usually just off to the side of a green run and doesn?t have a special symbol. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 2:26 PM To: Doug Ewell Cc: Unicode Mailing List Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices 2015-05-28 22:59 GMT+02:00 Doug Ewell >: Looks like a green circle is the symbol for a beginner slope. (The first link also shows that "piste" is the European word for what we call a trail, run, or slope). There is no difference between a "bunny slope" and a "beginner" or "novice" slope. The difference is obvious in Europe where the "novice" difficulty is marked as green pistes (slopes are below 30% or almost flat), and the "beginner/moderate" difficulty is marked as blue pistes (slopes about 30-35%). Even America must have this "novice" difficulty, with areas mostly used by young children (with their parents not skiing but following them by foot, and a restriction of speeds); these areas are protected so that other skiers will not pass through them. In fact if you remain on these novice areas you cannot reach any speed that could cause dangerous shocks: you have to "push" to advance, otherwise you'll slow down naturally and stop on the snow. These areas can be used by walkers, and randonners using "raquettes". -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 28 17:07:13 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 29 May 2015 00:07:13 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: Not just maps, but documentations. Ski resorts deliver many documentations, including those explaining security rules or promoting their equipement. And they are used on signs (the pistes themselves are not colored, the snow is still white !). In fact maps are the least common use of these symbols (there are far less maps available), and skiers don't have to follow a map when they practice their sport, they follow the signs. You'll find a large map display only in stations, and poor rough maps on documentations not showing many details seen on the terrain (and constantly varying across the seasons or with the weather conditions, so a map will not really help). But it's more important to train people about the signalisation they'll encounter. 2015-05-28 23:56 GMT+02:00 Leo Broukhis : > Being used in maps and map legends is not a sufficient condition for > encoding a symbol. If it were, all symbols used in physical maps would > have been encoded, including each and every mineral and rare metal. > > > Leo > > On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar > wrote: > > Since the double-diamond has map and map legend usage, it might be a good > > idea to have it encoded separately. I know that I'm stating the obvious > > here, but the important point is doing the research and showing that it > has > > widespread usage. > > > > ? Shervin > > > > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele < > Shawn.Steele at microsoft.com> > > wrote: > >> > >> I?m used to them being next to each other. So the entire discussion > seems > >> to be about how to encode a concept vs how to get the shape you want > with > >> existing code points. If you just want the perfect shape, then maybe > an > >> svg is a better choice. If we?re talking about describing ski-run > >> difficulty levels in plain-text, then the hodge-podge of glyphs being > >> offered in this thread seems kinda hacky to me. > >> > >> > >> > >> -Shawn > >> > >> > >> > >> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe > >> Verdy > >> Sent: Thursday, May 28, 2015 2:12 PM > >> To: Jim Melton > >> Cc: Shawn Steele; unicode Unicode Discussion > >> Subject: Re: "Bunny hill" symbol, used in America for signaling ski > pistes > >> for novices > >> > >> > >> > >> Some documentations also suggest that the two diamonds are not stacked > one > >> above the other, but horizontally. It's a good point for using only one > >> symbol, encoding it twice in plain-text if needed. > >> > >> > >> > >> 2015-05-28 22:15 GMT+02:00 Jim Melton : > >> > >> I no longer ski, but I did so for many years, mostly (but not > exclusively) > >> in the western United States. I never encountered, at any USA ski > >> hill/mountain/resort, a special symbol for "bunny hills", which are > >> typically represented by the green circle meaning "beginner". That's > >> anecdotal evidence at best, but my observations cover numerous skiing > sites. > >> I have encountered such a symbol in Europe and in New Zealand, but not > in > >> the USA. (I have not had the pleasure of skiing in Canada and am thus > >> unable to speak about ski areas in that country.) > >> > >> The double black diamond would appear to be a unique symbol worthy of > >> encoding, simply because the only valid typographical representation > (in the > >> USA) is two single black diamonds stacked one above the other and > touching > >> at the points. > >> > >> Hope this helps, > >> Jim > >> > >> > >> On 5/28/2015 2:04 PM, Shawn Steele wrote: > >> > >> So is double black diamond a separate symbol? Or just two of the black > >> diamond? > >> > >> > >> > >> And Blue-Black? > >> > >> > >> > >> I?m drawing a blank on a specific bunny sign, in my experience those are > >> usually just green. > >> > >> > >> > >> Aren?t there a lot of cartography symbols for various systems that > aren?t > >> present in Unicode? > >> > >> > >> > >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Philippe > >> Verdy > >> Sent: Thursday, May 28, 2015 12:47 PM > >> To: unicode Unicode Discussion > >> Subject: "Bunny hill" symbol, used in America for signaling ski pistes > for > >> novices > >> > >> > >> > >> Is there a symbol that can represent the "Bunny hill" symbol used in > North > >> America and some other American territories with mountains, to > designate the > >> ski pistes open to novice skiers (those pistes are signaled with green > signs > >> in Europe). > >> > >> > >> > >> I'm looking for the symbol itself, not the color, or the form of the > sign. > >> > >> > >> > >> For example blue pistes in Europe are designed with a green circle in > >> America, but we have a symbol for the circle; red pistes in Europe are > >> signaled by a blue square in America, but we have a symbol for the > square; > >> black pistes in Europe are signaled by a black diamond in America, but > we > >> also have such "black" diamond in Unicode. > >> > >> > >> > >> But I can't find an equivalent to the American "Bunny hill" signal, > >> equivalent to green pistes in Europe (this is a problem for webpages > related > >> to skiing: do we have to embed an image ?). > >> > >> > >> > >> > >> > >> -- > >> > >> ======================================================================== > >> > >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 > >> > >> Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 > >> > >> Oracle Corporation Oracle Email: jim dot melton at oracle dot com > >> > >> 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org > >> > >> Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com > >> > >> ======================================================================== > >> > >> = Facts are facts. But any opinions expressed are the opinions = > >> > >> = only of myself and may or may not reflect the opinions of anybody = > >> > >> = else with whom I may or may not have discussed the issues at hand. = > >> > >> ======================================================================== > >> > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michel at suignard.com Thu May 28 17:08:09 2015 From: michel at suignard.com (Michel Suignard) Date: Thu, 28 May 2015 22:08:09 +0000 Subject: Arrow dingbats In-Reply-To: References: Message-ID: So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs. Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?) Except that nobody seems to have U+2B95 aligned either. On unicode-table.com it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align? Wingdings added way more arrows, check the 1F800-1F8FF Supplemental Arrows-C. In the process, many unification happened along existing arrows, resulting among other addition of 2B95 and re-use in the context of Wingdings of many already encoded characters. I have written various documents when working on the Wingdings that were posted on the UTC web site that explains the rationale in more details. Obviously when working with a posteriori unification, sometimes we have to adjust slightly the glyphs in the charts to make the set consistent. For example, we may use Wingdings glyphs in some characters that were encoded before we added Wingdings. If you look at the chart page for the block 2B00-2BFF it is totally obvious how the set in 2B05-2B0D and 2B95 go together and there are x references in the name list to make that explicit. Glyph consistency is something I take very seriously when creating charts because so many look at the chart glyphs as the reference and given the various sources it is not a simple matter. I use a complex mix of fonts to get where we are now. By no mean Unicode-table.com represents a reference for these matters. How they get implemented in various platforms and fonts is beyond my control, but at least I work on having a decent reference in the official Unicode pdf charts (and 10646). Michel From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris Sent: Thursday, May 28, 2015 2:08 PM To: Unicode Discussion Subject: Re: Arrow dingbats So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs. Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?) Except that nobody seems to have U+2B95 aligned either. On unicode-table.com it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align? On 29 May 2015, at 5:13 am, Andrew West > wrote: On 28 May 2015 at 05:48, Chris > wrote: Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW" conspicuously missing is the right arrow But everywhere I can see that has this arrow, it looks a lot different to the other arrows with a narrower body and head. Whose fault is this, The three left/up/downwards black arrows were added at the request of North Korea, so I guess you can blame Kim Jong-Il for the missing rightwards arrow ... perhaps the North Korean army never went to the right. and who will fix it? It was fixed in Unicode 7.0 last year with the addition of U+2B95 RIGHTWARDS BLACK ARROW. Of course, it may not be fixed for you and other users unless you have a font installed that supports all the arrows in a consistent style. I don't know why the character was added in 7.0, but it may have been prompted by the same question as yours that was asked on this list in 2013 . Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu May 28 17:21:41 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 28 May 2015 15:21:41 -0700 Subject: Encoding map symbols (was: Re: "Bunny hill" symbol...) Message-ID: Sufficiency of conditions for encoding is decided on a case by case basis by the UTC. According to the existing criteria for encoding symbols, being a symbol used in maps and map legends contributes to multiple of criteria items on that document and strengthens the case for acceptance. May be all symbols used in physical maps in the world *could* be encoded if there is a strong, compelling case presentable for them to be used in text environments. Correlation of them not being encoded so far and them being widespreadly used on maps and map legend does not mean any causation that one can not provide a strong case for encoding them. Personally speaking, I'm currently researching for a proposal for encoding of some of the USGS symbols as well as some other general map symbols. ? Shervin On Thu, May 28, 2015 at 2:56 PM, Leo Broukhis wrote: > Being used in maps and map legends is not a sufficient condition for > encoding a symbol. If it were, all symbols used in physical maps would > have been encoded, including each and every mineral and rare metal. > > > Leo > > On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar > wrote: > > Since the double-diamond has map and map legend usage, it might be a good > > idea to have it encoded separately. I know that I'm stating the obvious > > here, but the important point is doing the research and showing that it > has > > widespread usage. > > > > ? Shervin > > > > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele < > Shawn.Steele at microsoft.com> > > wrote: > >> > >> I?m used to them being next to each other. So the entire discussion > seems > >> to be about how to encode a concept vs how to get the shape you want > with > >> existing code points. If you just want the perfect shape, then maybe > an > >> svg is a better choice. If we?re talking about describing ski-run > >> difficulty levels in plain-text, then the hodge-podge of glyphs being > >> offered in this thread seems kinda hacky to me. > >> > >> > >> > >> -Shawn > >> > >> > >> > >> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe > >> Verdy > >> Sent: Thursday, May 28, 2015 2:12 PM > >> To: Jim Melton > >> Cc: Shawn Steele; unicode Unicode Discussion > >> Subject: Re: "Bunny hill" symbol, used in America for signaling ski > pistes > >> for novices > >> > >> > >> > >> Some documentations also suggest that the two diamonds are not stacked > one > >> above the other, but horizontally. It's a good point for using only one > >> symbol, encoding it twice in plain-text if needed. > >> > >> > >> > >> 2015-05-28 22:15 GMT+02:00 Jim Melton : > >> > >> I no longer ski, but I did so for many years, mostly (but not > exclusively) > >> in the western United States. I never encountered, at any USA ski > >> hill/mountain/resort, a special symbol for "bunny hills", which are > >> typically represented by the green circle meaning "beginner". That's > >> anecdotal evidence at best, but my observations cover numerous skiing > sites. > >> I have encountered such a symbol in Europe and in New Zealand, but not > in > >> the USA. (I have not had the pleasure of skiing in Canada and am thus > >> unable to speak about ski areas in that country.) > >> > >> The double black diamond would appear to be a unique symbol worthy of > >> encoding, simply because the only valid typographical representation > (in the > >> USA) is two single black diamonds stacked one above the other and > touching > >> at the points. > >> > >> Hope this helps, > >> Jim > >> > >> > >> On 5/28/2015 2:04 PM, Shawn Steele wrote: > >> > >> So is double black diamond a separate symbol? Or just two of the black > >> diamond? > >> > >> > >> > >> And Blue-Black? > >> > >> > >> > >> I?m drawing a blank on a specific bunny sign, in my experience those are > >> usually just green. > >> > >> > >> > >> Aren?t there a lot of cartography symbols for various systems that > aren?t > >> present in Unicode? > >> > >> > >> > >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Philippe > >> Verdy > >> Sent: Thursday, May 28, 2015 12:47 PM > >> To: unicode Unicode Discussion > >> Subject: "Bunny hill" symbol, used in America for signaling ski pistes > for > >> novices > >> > >> > >> > >> Is there a symbol that can represent the "Bunny hill" symbol used in > North > >> America and some other American territories with mountains, to > designate the > >> ski pistes open to novice skiers (those pistes are signaled with green > signs > >> in Europe). > >> > >> > >> > >> I'm looking for the symbol itself, not the color, or the form of the > sign. > >> > >> > >> > >> For example blue pistes in Europe are designed with a green circle in > >> America, but we have a symbol for the circle; red pistes in Europe are > >> signaled by a blue square in America, but we have a symbol for the > square; > >> black pistes in Europe are signaled by a black diamond in America, but > we > >> also have such "black" diamond in Unicode. > >> > >> > >> > >> But I can't find an equivalent to the American "Bunny hill" signal, > >> equivalent to green pistes in Europe (this is a problem for webpages > related > >> to skiing: do we have to embed an image ?). > >> > >> > >> > >> > >> > >> -- > >> > >> ======================================================================== > >> > >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: > +1.801.942.0144 > >> > >> Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : > +1.801.942.3345 > >> > >> Oracle Corporation Oracle Email: jim dot melton at oracle dot com > >> > >> 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org > >> > >> Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com > >> > >> ======================================================================== > >> > >> = Facts are facts. But any opinions expressed are the opinions = > >> > >> = only of myself and may or may not reflect the opinions of anybody = > >> > >> = else with whom I may or may not have discussed the issues at hand. = > >> > >> ======================================================================== > >> > >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu May 28 18:42:14 2015 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 28 May 2015 19:42:14 -0400 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> Message-ID: <5567A7D6.6060102@kli.org> As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. You would prefer *everything* be in plain text, "so you wouldn't have to use other formats for it." You're essentially converting plain text into THE format for everything. But it isn't suited for that. If you really believe one size should fit all in this way, I think the problem is that pretty much all of the rest of the computer science community doesn't agree with you. Sorry. ~mark On 05/28/2015 07:50 AM, William_J_G Overington wrote: > Responding to Mark E. Shoulson: > > The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. > > The following may be useful as a guide to the original problem that I am trying to solve. > > http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term > > I tried to apply the brilliant new "base character followed by tag characters" format to the problem. > > In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. > > William Overington > > 28 May 2015 > From idou747 at gmail.com Thu May 28 21:37:25 2015 From: idou747 at gmail.com (John) Date: Thu, 28 May 2015 19:37:25 -0700 (PDT) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <5567A7D6.6060102@kli.org> References: <5567A7D6.6060102@kli.org> Message-ID: <1432867044809.9dc7c15b@Nodemailer> "Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request? If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? Part of the reason at least of having any code system rather than just pixels and images is to efficiently and consistently encode data. Unicode has private use ranges of codes. I can see an argument that it would be desirable to be able to send someone text with private use ranges and have the header define some default renderings. I?m not sure that replacing a document of 100,000 characters with 100,000 embedded html5 wrote: > As was pointed out to me, essentially what you are saying is you reject > my premise that one size does not fit all. You would prefer > *everything* be in plain text, "so you wouldn't have to use other > formats for it." You're essentially converting plain text into THE > format for everything. > But it isn't suited for that. If you really believe one size should fit > all in this way, I think the problem is that pretty much all of the rest > of the computer science community doesn't agree with you. Sorry. > ~mark > On 05/28/2015 07:50 AM, William_J_G Overington wrote: >> Responding to Mark E. Shoulson: >> >> The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage. >> >> The following may be useful as a guide to the original problem that I am trying to solve. >> >> http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term >> >> I tried to apply the brilliant new "base character followed by tag characters" format to the problem. >> >> In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format. >> >> William Overington >> >> 28 May 2015 >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Thu May 28 22:14:19 2015 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 28 May 2015 20:14:19 -0700 Subject: Arrow dingbats In-Reply-To: References: Message-ID: <5567D98B.4080006@att.net> Michel Suignard (editor of ISO/IEC 10646) responded to these questions, but let me augment his response with some more detailed history here. (Pardon the length of the reply, but these things tend never to be as simple as people assume and hope they are.) On 5/28/2015 2:08 PM, Chris wrote: > > So it sounds like 27a1 came first. Then 2b05 etc was added to complete > the set with 27a1, except that it didn?t complete the set because > nobody aligned the glyphs. Then they added U+2B95 in a 2nd attempt to > complete the set? (Why not just fix the old arrow?) O.k. That is *roughly* correct, but only very roughly. U+27A1 BLACK RIGHTWARDS ARROW That *did* come first. It has a Unicode Age=V1_1, dating back to 1993 in the standard. (Actually, its Unicode history goes back even further, but 1993 is enough for this discussion.) U+27A1 was part of the set of dingbats encoded for compatibility with the ITC Zapf Dingbats series 100, which saw widespread early commercial implementation on PostScript printers and was widely used as a font encoding back in the 80's and early 90's. An important thing to note about the Zapf Dingbat arrows (go look at the Unicode code chart for the 27XX block) is that almost all of those arrows are exclusively right-facing: http://www.unicode.org/charts/PDF/U2700.pdf It was assumed at the time that in actual implementations that used these arrows in documents, they would be used by PostScript drivers that had arbitrary scale and rotate functions that would allow, among other things, the rotation of an arrow to display in any orientation. The Unicode *character* encoding of these was, rather, intended as a code point compatibility mapping that would enable Unicode mapping of documents that had used font encoded Zapf dingbats simply as symbolic "blorts" in text. This compatibility issue explains why, back in 1993, the whole set of Dingbat arrows was not elaborated into character-encoded rotational sets of symbols (i.e. rightwards, leftwards, upwards, downwards, ...) U+2B05 LEFTWARDS BLACK ARROW That one (and the near-complete rotational set of similar black arrows at U+2B05..U+2B0D) have a Unicode Age=V4_0 (2003). Andrew West was correct in identifying the source of these. They were brought to SC2/WG2 and proposed for encoding by the DPRK, back in 2001, for compatibility with a North Korean standard. See page 5 of the pdf in: http://www.unicode.org/L2/L2001/01349-N2374-DPRK-AddSymbols.pdf That is the proximate source of these "black arrows" in the Unicode Standard (along with the white versions at U+2B00..U+2B04). The glyphs that were used for these arrows in Unicode 4.0 are also derived from that source. However, the fact that WG2 N2374 (i.e., the DPRK) did not ask for also encoding a separate "RIGHTWARDS BLACK ARROW" indicates that they considered the existing U+27A1 BLACK RIGHTWARDS ARROW to suffice for mapping to their standard. The fact that "nobody aligned the glyphs" in 2003, when these were published was partly because: a) the glyphs were inherited from the proposal document and then ISO ballot documents, and nobody commented on or required them to be changed in ballot comments, and b) nobody much cared, because these were compatibility additions for a DPRK standard, and weren't mapped to any commercial sets at the time, anyway. The glyphs for U+2B05..U+2B0D remained unchanged in the standard from Unicode 4.0 through Unicode 6.3. (Again, because nobody had any strong reason to do otherwise.) And that explains why, as implementations of the Unicode 4.0 (and later) repertoire came to be more widely supported in fonts, the glyphs for U+2B05 tended to have a relatively narrow arrow shaft that matched the Unicode charts. The unification of the rotational set U+2B05..U+2B0D with the existing ITC Zap Dingbat U+27A1 was *implicit* in the encoding, but was not explicitly called out by anything other than a note in the names list for the 2BXX block that pointed to the 27XX block for "Other white and black arrows to complete this set". In practice, most people just put glyphs in fonts that matched the code charts. U+2B95 RIGHTWARDS BLACK ARROW This one has a Unicode Age=V7_0 (2014). It was added as a result of a complete re-rationalization of all of the arrow symbols in the standard, required, as Michel Suignard noted, to deal with the addition of compatibility characters to cover the multitude of arrow symbols in the Wingding sets. If you want to see the explicit rationale and the point at which this happened, see page 21 in the pdf of: http://www.unicode.org/L2/L2012/12130-n4239.pdf That was the disposition of comments for PDAM 1.2 to ISO/IEC 10646 3rd edition. And the relevant note from the editor is: "To complete the set of BLACK ARROW in 2B05..2B0D a new character is added: 2B95 RIGHTWARDS BLACK ARROW (The character 27A1 BLACK RIGHTWARDS ARROW in the dingbat block is not an appropriate match for the other 9 characters)." This happened in the context of mapping against multiple Wingding arrow shapes, which were at the time being added to the standard in explicit rotational sets. Doing this consistently required a rationalization of the shapes and aspects of the white and black arrows in the 2BXX block. And the explicit changes that ended up in the Unicode code charts can be traced back to the following repertoire chart: http://www.unicode.org/L2/L2012/12128-n4244.pdf See pages 36 and 49 of the pdf. Page 49, in particular, shows explicitly what Michel pointed out: the addition of the new character 2B95 was deliberately aligned with the glyph changes for 2B05, etc. So now finally, to your question: "Why not fix the old arrow?" Well, Michel explained that in WG2 N4239. If you are going to map an entire set to Wingdings (as opposed to a then decade-old proposal document from the DPRK) it makes sense to use appropriate glyphs for that, in the context of all the other additions. But it is *not* appropriate to retroactively pick out the old ITC Zapf Dingbats series 100 glyph (from amongst a set of others with very explicit shapes) and change *that* glyph just to make the rotational set complete. Hence the addition of U+2B95 as the best solution for Unicode 7.0. > > Except that nobody seems to have U+2B95 aligned either. It takes a while for font implementations to catch up with the standard. The glyphs for U+2B05..U+2B0D have been in fonts for some time now, and the multitude of arrow additions for Unicode 7.0 are relatively new and not yet fully supported in many fonts. *When* a font adjusts for the addition of the new sets of arrows, however, it *should* take into account the explicit glyph updates for U+2B00..U+2B0D, which were clearly intentional, as part of all this work on the arrows to cover Wingdings. > On unicode-table.com it looks totally > different, You cannot depend on unicode-table.com for definitive information about glyphs. That site is not coordinated with or sanctioned by the Unicode Consortium. If you want definitive information about encoding and current representative glyphs for each character, please go instead to: http://www.unicode.org/charts/ > and Mac doesn?t even have it. Implementations may well lag in addition of new sets of symbols from Unicode 7.0. > Is there any hope this will actually fix it? Yes. > Has the unicode consortium made it clear to one and all that U+2B95 is > supposed to align? Yes. (See above.) --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri May 29 00:46:58 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 28 May 2015 22:46:58 -0700 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: <5567FD52.6020007@ix.netcom.com> On 5/28/2015 2:15 PM, Shawn Steele wrote: > > I?m used to them being next to each other. So the entire discussion > seems to be about how to encode a concept vs how to get the shape you > want with existing code points. If you just want the perfect shape, > then maybe an svg is a better choice. If we?re talking about > describing ski-run difficulty levels in plain-text, then the > hodge-podge of glyphs being offered in this thread seems kinda hacky > to me. > > -Shawn > > *Symbols, have a rather different relation between identity and collection of typical shapes than letters.* For symbols, the way they are re-used in different conventions is different as well. For letters, in many scripts, what matters is that they represent a) a member of an alphabet (subset of a script) b) readers and writers can agree *which* member of the alphabet is intended (identity). This identity selection is the sum total of the "semantics" of the character, when it comes to letters. Some symbols, like the integral signs, are closely tied to a well-defined notation, which in turn governs the acceptability of the range of visual representations. For general symbols you quickly get to the situation where the shape *is* the identity. For geometric shapes, you can't really predict how they are going to be used and in which conventions. (That is true for the more generically shaped punctuation marks as well, like period). Because you can't predict the use to be made of them, what you need to guarantee the writer (author) is that the shape he or she sees is what the reader will see, so that the author can make the determination that the symbol represents the notational element, or the concept, that was intended. That means, you really need to approach the encoding of symbols differently from letters, where the latter have a well established "identity" and the only task for a visual representation is to give enough unambiguous details so as to be able to select that identity from a restricted set. (Hence the wide range of wonderfully whimsical decorative fonts). It's useless to treat some "concept" as the functional equivalent of a letter's membership in an alphabet. Unlike the case of writing systems, neither authors nore readers have the same kind prior agreement of how much you can vary a shape and still refer to the same concept. (Obviously, even among symbols there is some variation in this regard). As a result, you simply need to allow the encoding to become more shape based. So that authors can create documents that do not have to rely on the missing agreement with the readers on what other shapes may or may not be substituted successfully without affecting the semantics (not of the code point, but of the text). A./* * -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Fri May 29 02:32:30 2015 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 29 May 2015 09:32:30 +0200 Subject: Aw: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> , Message-ID: An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri May 29 03:38:19 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 29 May 2015 09:38:19 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <5567A7D6.6060102@kli.org> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> <5567A7D6.6060102@kli.org> Message-ID: <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost> Responding to Mark E. Shoulson: > As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML. HTML is good for web pages. Plain text is, amongst other applications, good for text messages. The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. I have not purported that it become the only format for transmitting images. > You would prefer *everything* be in plain text, "so you wouldn't have to use other formats for it." You're essentially converting plain text into THE format for everything. No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software. > If you really believe one size should fit all in this way, ... But I don't. Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text. For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate. http://www.users.globalnet.co.uk/~ngo/library.htm http://www.users.globalnet.co.uk/~ngo/spec0001.htm http://www.users.globalnet.co.uk/~ngo/song1018.htm http://www.users.globalnet.co.uk/~ngo/song1021.htm I have embedded a wav file in a pdf and published the result on the web. http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? William Overington 29 May 2015 From andrewcwest at gmail.com Fri May 29 05:30:40 2015 From: andrewcwest at gmail.com (Andrew West) Date: Fri, 29 May 2015 11:30:40 +0100 Subject: KPS 9566 mappings (was Re: Arrow dingbats) Message-ID: As someone who supports opening of KPS 9566 encoded files in my software (BabelPad), I am interested in those characters proposed by DPRK (http://std.dkuug.dk/jtc1/sc2/wg2/Docs/n2374.pdf) that were not accepted for encoding but which are still in the latest version of the DPRK standard, KPS 9566-2012(?). Red Star OS 3.0 Unicode-maps most of them to the PUA, which is not satisfactory in most cases. LEFTWARDS SCISSORS = KPS 9566-2012 ACD5 There are five scissors characters at 2700..2704, but they are all right-facing. I think it would not be unreasonable to encode a left-facing scissors character for compatibility with KPS 9566. Alternatively, standardized variants for left-facing and right-facing scissors could be defined for all 2700..2704, but that might open a nasty precedent that we come to regret, so I would prefer simply encoding a single left-facing scissor character. CIRCLED UPWARD INDICATION = KPS 9566-2012 ACD4 This could be represented as U+1F446 WHITE UP POINTING BACKHAND INDEX + U+20DD COMBINING ENCLOSING CIRCLE. WHITE UP-POINTING TRIANGLE WITH BLACK TRIANGLE = KPS 9566-2012 A2F1 WHITE UP-POINTING TRIANGLE WITH HORIZONTAL FILL = KPS 9566-2012 A2F2 WHITE UP-POINTING TRIANGLE WITH UPPER LEFT TO LOWER RIGHT FILL = KPS 9566-2012 A2F3 WHITE UP-POINTING TRIANGLE WITH UPPER RIGHT TO LOWER LEFT FILL = KPS 9566-2012 A2F4 I don't know why these were not accepted for encoding. As far as I can tell, they cannot be represented by any current Unicode character, and I think it would be reasonable to encode them for compatibility with KPS 9566. RIGHT PARENTHESIS WITH FULL STOP = KPS 9566-2012 A1DC RIGHT DOUBLE ANGLE BRACKET WITH FULL STOP = KPS 9566-2012 A1DD I understand why these were not accepted for encoding, but the precedent of U+2047 DOUBLE QUESTION MARK, U+2048 QUESTION EXCLAMATION MARK, and U+2049 EXCLAMATION QUESTION MARK, which I believe were encoded because they are used in vertically oriented Mongolian text and it is problematic to embed ?? etc. horizontally in vertical text suggests that it may be appropriate to encode these two characters for compatibility with KPS 9566. VULGAR FRACTION ONE HALF WITH HORIZONTAL BAR = KPS 9566-2012 A7FA VULGAR FRACTION ONE THIRD WITH HORIZONTAL BAR = KPS 9566-2012 A7FB VULGAR FRACTION TWO THIRDS WITH HORIZONTAL BAR = KPS 9566-2012 A7FC VULGAR FRACTION ONE QUARTER WITH HORIZONTAL BAR = KPS 9566-2012 A7FD VULGAR FRACTION THREE QUARTERS WITH HORIZONTAL BAR = KPS 9566-2012 A7FE These contrast with KPS 9566 A7CA..A7CE which are vulgar fractions with diagonal bar. The issue of distinguishing between a horizontal and a diagonal fraction slash is not restricted to North Korea, and I think that there is an argument to be made for defining standardized variants for all vulgar fraction characters to specify a glyph with either a horizontal bar or a diagonal bar. HAMMER AND SICKLE AND BRUSH CIRCLED HAMMER AND SICKLE AND BRUSH I assume that there is no appetite to encode these symbols for the Workers' Party of Korea, and so mapping them to the PUA is appropriate. There is also the proposed VERTICAL TILDE character which was not accepted for encoding, but which Red Star OS 3.0 Unicode-maps to U+2E2F VERTICAL TILDE which was added in Unicode 5.1 for Cyrillic transliteration. This mapping does not seem wholy satisfactory to me, and I wonder whether it would not be better to simply encode a PRESENTATION FORM FOR VERTICAL TILDE at FE1A. Andrew From alolita.sharma at gmail.com Fri May 29 10:51:50 2015 From: alolita.sharma at gmail.com (Alolita Sharma) Date: Fri, 29 May 2015 08:51:50 -0700 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: Seems like we may see a temporary fix for iOS. http://www.businessinsider.com/apple-issues-temporary-siri-workaround-iphone-crash-unicode-text-message-bug-2015-5 Best, Alolita On Thu, May 28, 2015 at 2:36 PM, Andrew Cunningham wrote: > Not the first time unicode crashes things. There was the google chrome bug > on osx that crashed the tab for any syriac text. > > A. > > > On Friday, 29 May 2015, Bill Poser wrote: > > No doubt the evil Unicode Consortium is in league with the Trilateral > Commission, the Elders of Zion,and the folks at NASA who faked the moon > landing.... :) > > > > On Thu, May 28, 2015 at 7:53 AM, Doug Ewell wrote: > >> > >> Unicode is in the news today as some folks with waaay too much time on > >> their hands have discovered a string consisting of Latin, Arabic, > >> Devanagari, and CJK characters that crashes Apple devices when it > >> appears as a pop-up message. > >> > >> Although most people seem to identify it correctly as a CoreText bug, > >> there are a handful, as you might expect, who attribute it to some shady > >> weirdness in Unicode itself. My favorite quote from a Reddit user was > >> this: > >> > >> "Every character you use has a unicode value which tells your phone what > >> to display. One of the unicode values is actually never-ending and so > >> when the phone tries to read it it goes into an infinite loop which > >> crashes it." > >> > >> I've read TUS Chapter 4 and UTR #23 and I still can't find the > >> "never-ending" Unicode property. > >> > >> Perhaps astonishingly to some, the string displays fine on all my > >> Windows devices. Not all apps get the directionality right, but no > >> crashes. > >> > >> -- > >> Doug Ewell | http://ewellic.org | Thornton, CO ???? > >> > > > > > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Fri May 29 11:09:47 2015 From: leob at mailcom.com (Leo Broukhis) Date: Fri, 29 May 2015 09:09:47 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> <5567A7D6.6060102@kli.org> <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost> Message-ID: > The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. A more common occurrence is the need to include a non-standard character in a text message, be it a ski piste symbol or an obscure CJK ideogram. Have you thought of embedding TrueType in Unicode? Leo On Fri, May 29, 2015 at 1:38 AM, William_J_G Overington wrote: > Responding to Mark E. Shoulson: > > >> As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all. > > > Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML. > > > HTML is good for web pages. > > > Plain text is, amongst other applications, good for text messages. > > > The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text. > > > I have not purported that it become the only format for transmitting images. > > >> You would prefer *everything* be in plain text, "so you wouldn't have to use other formats for it." You're essentially converting plain text into THE format for everything. > > > No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software. > > >> If you really believe one size should fit all in this way, ... > > > But I don't. > > > Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text. > > > For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate. > > > http://www.users.globalnet.co.uk/~ngo/library.htm > > > http://www.users.globalnet.co.uk/~ngo/spec0001.htm > > > http://www.users.globalnet.co.uk/~ngo/song1018.htm > > > http://www.users.globalnet.co.uk/~ngo/song1021.htm > > > I have embedded a wav file in a pdf and published the result on the web. > > > http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf > > > Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? > > > What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? > > > William Overington > > > 29 May 2015 > > > > From shervinafshar at gmail.com Fri May 29 11:16:50 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 29 May 2015 09:16:50 -0700 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: > > Ask Siri to "read unread messages." > Siri saves the day :). ? Shervin On Fri, May 29, 2015 at 8:51 AM, Alolita Sharma wrote: > Seems like we may see a temporary fix for iOS. > > > http://www.businessinsider.com/apple-issues-temporary-siri-workaround-iphone-crash-unicode-text-message-bug-2015-5 > > Best, > Alolita > > > > On Thu, May 28, 2015 at 2:36 PM, Andrew Cunningham > wrote: > >> Not the first time unicode crashes things. There was the google chrome >> bug on osx that crashed the tab for any syriac text. >> >> A. >> >> >> On Friday, 29 May 2015, Bill Poser wrote: >> > No doubt the evil Unicode Consortium is in league with the Trilateral >> Commission, the Elders of Zion,and the folks at NASA who faked the moon >> landing.... :) >> > >> > On Thu, May 28, 2015 at 7:53 AM, Doug Ewell wrote: >> >> >> >> Unicode is in the news today as some folks with waaay too much time on >> >> their hands have discovered a string consisting of Latin, Arabic, >> >> Devanagari, and CJK characters that crashes Apple devices when it >> >> appears as a pop-up message. >> >> >> >> Although most people seem to identify it correctly as a CoreText bug, >> >> there are a handful, as you might expect, who attribute it to some >> shady >> >> weirdness in Unicode itself. My favorite quote from a Reddit user was >> >> this: >> >> >> >> "Every character you use has a unicode value which tells your phone >> what >> >> to display. One of the unicode values is actually never-ending and so >> >> when the phone tries to read it it goes into an infinite loop which >> >> crashes it." >> >> >> >> I've read TUS Chapter 4 and UTR #23 and I still can't find the >> >> "never-ending" Unicode property. >> >> >> >> Perhaps astonishingly to some, the string displays fine on all my >> >> Windows devices. Not all apps get the directionality right, but no >> >> crashes. >> >> >> >> -- >> >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> >> > >> > >> >> -- >> Andrew Cunningham >> Project Manager, Research and Development >> (Social and Digital Inclusion) >> Public Libraries and Community Engagement >> State Library of Victoria >> 328 Swanston Street >> Melbourne VIC 3000 >> Australia >> >> Ph: +61-3-8664-7430 >> Mobile: 0459 806 589 >> Email: acunningham at slv.vic.gov.au >> lang.support at gmail.com >> >> http://www.openroad.net.au/ >> http://www.mylanguage.gov.au/ >> http://www.slv.vic.gov.au/ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri May 29 11:31:00 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 29 May 2015 17:31:00 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> Message-ID: <33086681.59649.1432917060255.JavaMail.defaultUser@defaultHost> Responding to Philippe Verdy: > There's no advantage because what you want to create is effectively another markup language with its own syntax (but requiring new obscure characters that most applications and users will not be able to interpret and render correctly in the way intended by you, ... Well, if the format became accepted as part of Unicode then appropriate applications could well be produced that would interpret the format and display an image in the desired place. > ... and with still many things you have forgotten about the specific needs for images (e.g. colorimetry profiles, aspect ratio of pixels with bitmaps, undesired effects that must be controled such as "moir?" artefacts). The format is just at present a basic suggestion. Rather than just state what you consider what I have forgotten and dismiss the format, how about joining in progress and specifying what you consider needs adding to the format and perhaps suggest how to add in that functionality in the style that the format uses. > You don't need new characters to create a markup language and its syntax. Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don't require any external request, or embedding special effects on images, such as animation or dynamic layouts for adapting the document to the redering device, with the help of CSS and Javascript that are also embeddable). The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here. Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? > At least with HTML5 they don't try to reinvent the image formats and there's ample space for supporting multiple images formats tuned for specific needs (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and video, and synchronization of images and audio in time for videos, or with user interactions. They are designed separately and benefit from patient researches made since long (your desired format, still undocumented, is largely under the level needed for images, independantly of the markup syntax you want to create to support them, and independantly of the fact that you also want to encode these syntaxic elements with new characters, something that is absolutely not needed for any markup language) Well it is undocumented apart from posts in this thread because I have put forward the format for discussion. A pdf document for consideration by the Unicode Technical Committee could be produced and submitted if there is interest in the format, the content of the pdf document perhaps including suggestions from this thread if any such suggestions are forthcoming. > In summary, you are reinventing the wheel. Well, this is progress, producing an additional format for expressing an image for application in various specific specialised circumstances. William Overington 29 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri May 29 13:12:46 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 29 May 2015 11:12:46 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150529111246.665a7a7059d7ee80bb4d670165c8327d.10c55b41ea.wbe@email03.secureserver.net> William_J_G Overington wrote: >> There's no advantage because what you want to create is effectively >> another markup language with its own syntax (but requiring new >> obscure characters that most applications and users will not be able >> to interpret and render correctly in the way intended by you, ... > > Well, if the format became accepted as part of Unicode then > appropriate applications could well be produced that would interpret > the format and display an image in the desired place. I think this cuts to the heart of what people have been trying to say all along. Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Fri May 29 14:07:45 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 29 May 2015 21:07:45 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1432867044809.9dc7c15b@Nodemailer> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> Message-ID: 2015-05-29 4:37 GMT+02:00 John : > "Today the world goes very well with HTML(5) which is now the bext markup > language for document (including for inserting embedded images that don?t > require any external request? > If I had a large document that reused a particular character thousands of > times, would this HTML markup require embedding that character thousands of > times, or could I define the character once at the beginning of the > sequence, and then refer back to it in a space efficient way? > HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content. You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs). If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both (and SVG as well when it contains plain-text elements). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 29 15:23:22 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 29 May 2015 22:23:22 +0200 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: 2015-05-28 23:36 GMT+02:00 Andrew Cunningham : > Not the first time unicode crashes things. There was the google chrome bug > on osx that crashed the tab for any syriac text. > "Unicode crashes things"? Unicode has nothing to do in those crashes caused by bugs in applications that make incorrect assumptions (in fact not even related to characters themselves but to the supposed behavior of the layout engine. Programmers and designers for example VERY frequently forget the constraints for RTL languages and make incorrect assumptions about left and right sides when sizing objects, or they don't expect that the cursor will advance backward and forget that some measurements can be negative: if they use this negative value to compute the size of a bitmap redering surface, they'll get out of memory, unchecked null pointers returned, then they will crash assuming the buffer was effectively allocated. These are the same kind of bugs as with the too common buffer overruns with unchecked assumtions: the code is kept because "it works as is" in their limited immediate tests. Producing full coverage tests is a difficult and lengthy task, that programmers not always have the time to do, when they are urged to produce a workable solution for some clients and then given no time to improve the code before the same code is distributed to a wider range of clients. Commercial staffs do that frequently, they can't even read the technical limitations even when they are documented by programmers... in addition the commercial staff like selling softwares that will cause customers to ask for support... that will be billed ! After that, programmers are overwhelmed by bug reports and support requests, and have even less time to design other thigs that they are working on and still have to produce. QA tools may help programmers in this case by providing statistics about the effective costs of producing new software with better quality, and the cost of supporting it when it contains too many bugs: commercial teams like those statistics because they can convert them to costs, commercial margins, and billing rates. (When such QA tools are not used, programmers will rapidly leave the place, they are fed up by the growing pressure to do always more in the same time, with also a growing number of "urgent" support requests.). Those that say "Unicode crashes things" do the same thing: they make broad unchecked assumptions about how things are really made or how things are actually working. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri May 29 18:20:08 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 30 May 2015 09:20:08 +1000 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: Geez Philippe, It was tounge in cheek. A. On Saturday, 30 May 2015, Philippe Verdy wrote: > > 2015-05-28 23:36 GMT+02:00 Andrew Cunningham : >> >> Not the first time unicode crashes things. There was the google chrome bug on osx that crashed the tab for any syriac text. > > "Unicode crashes things"? Unicode has nothing to do in those crashes caused by bugs in applications that make incorrect assumptions (in fact not even related to characters themselves but to the supposed behavior of the layout engine. Programmers and designers for example VERY frequently forget the constraints for RTL languages and make incorrect assumptions about left and right sides when sizing objects, or they don't expect that the cursor will advance backward and forget that some measurements can be negative: if they use this negative value to compute the size of a bitmap redering surface, they'll get out of memory, unchecked null pointers returned, then they will crash assuming the buffer was effectively allocated. > These are the same kind of bugs as with the too common buffer overruns with unchecked assumtions: the code is kept because "it works as is" in their limited immediate tests. > Producing full coverage tests is a difficult and lengthy task, that programmers not always have the time to do, when they are urged to produce a workable solution for some clients and then given no time to improve the code before the same code is distributed to a wider range of clients. > Commercial staffs do that frequently, they can't even read the technical limitations even when they are documented by programmers... in addition the commercial staff like selling softwares that will cause customers to ask for support... that will be billed ! After that, programmers are overwhelmed by bug reports and support requests, and have even less time to design other thigs that they are working on and still have to produce. QA tools may help programmers in this case by providing statistics about the effective costs of producing new software with better quality, and the cost of supporting it when it contains too many bugs: commercial teams like those statistics because they can convert them to costs, commercial margins, and billing rates. (When such QA tools are not used, programmers will rapidly leave the place, they are fed up by the growing pressure to do always more in the same time, with also a growing number of "urgent" support requests.). > Those that say "Unicode crashes things" do the same thing: they make broad unchecked assumptions about how things are really made or how things are actually working. > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Fri May 29 19:20:42 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Sat, 30 May 2015 08:20:42 +0800 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: Message-ID: Hello, I am new to this maillist and have some questions about unicode that i am looking for answers or guide to answer. Can anyone provide me some information regrading any of those questions below or point out where should I find out answers to these questions instead? 1. I have seen a chinese character ??? from a Vietnamese dictionary NHAT DUNG THUONG DAM DICTIONARY which digitalized on http://www.nomfoundation.org/common/show.php?detail=2117 and I've also checked the unihan database which do not include this character. Then I read http://unicode.org/pending/proposals.html which listed the requirement and processes needed to propose a new character to Unicode and point to mail lists for help. So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that CJK Extension E and F have already been accepted, but where can I check those proposals to see if the xharacter is in them or not? and b.) it say to propose a new character, the proposal must include information about someone who would agree to provide a computer font for publishing the standard, do that mean i have to provide info about someone who is anticipated to agree on doing so or do i need to contact them for their agreement first, and does that mean I can just put info of someone who are making free full unicode CJK coverage font into the proposal?, and c.) just like the question (b), do "names and addresses of appropriate contacts within national body or user organizations" represent Vietnamese government in this case? 2. Is combined characters like U+20DD intended to work with all different type of characters, or is it some problem related to implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle) appear to be separate on most font I use, but if I change the Hiragana Yu into a conventional = sign or some latin character, most fonts are at least somehow able to put them together. Or, is there any better/alternative representation in unicode that can show japanese hiragana yu in a circle? 3.From what I read, Unicode record different regional glyphs for a single character. Is there a character in Vietnamese chu nom that the character is also persent in other languages (Chinese, Japanese, Vietnamese), but it have some special feature in the glyph that make it different from all other variants, so that if the fomputer system displaying that character out, i can immediately tell it is displaying the character as Vietnamese chu nom, or displaying as character of other languages? Furthermore, is using simply 'vi' in CSS's lang parameter sufficient to force browsers to show Chu Nom glyph instead of other glyphs, or is something like vi-Nom or vi-Hani or Han-Nom is needed? (This part is less directly related to unicode so i don't know if it is a suitable place to ask, plese tell me if it is not the case.) 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark are not included. While there are charactera like U+2584, U+FE33, U+FE4F, and U+FE34 in unicode that is more or less a representation for the two symbol, they do not appear below or on the left of typed characters when text flow is horizontal/vertical, and instead, they occupy their own space which make them having little use in daily life, and while the proper name mark and book name mark can represented by text editing softwares and css but those representation are not ideal and they do match "Criteria for Encoding Symbols". Is it possible to make a new unicode symbol, or change some current symbol into one that could appear in suitable place of other characters when typed? And a property of the symbol is that when used in case like ???? which ?? and ?? are two different proper name (place name), so an underline should go below them without any separation between the character ?and? or ?and? (when text are written horizontally), but at the same time the underline should not be linked between ? and ? as ? is the end of first place name while ? is the start of the other. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri May 29 20:50:28 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 29 May 2015 18:50:28 -0700 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: Message-ID: <55691764.4030802@att.net> On 5/29/2015 5:20 PM, gfb hjjhjh wrote: > > 1. I have seen a chinese character ??? from a Vietnamese dictionary > NHAT DUNG THUONG DAM DICTIONARY** > > So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that > CJK Extension E and F have already been accepted, but where can I > check those proposals to see if the xharacter is in them or not? > For Extension E, you can check the following code chart: http://www.unicode.org/charts/PDF/Unicode-8.0/U80-2B820.pdf See: U+2C89A..U+2C931 (pp. 54-56 of the pdf) for the relevant radical (#149). But I don't see that character in the list of Extension E characters. Extension F is harder to track down, because it has not yet been approved by the UTC, and comes in two pieces, with different progression so far in the ISO committee. Perhaps somebody on this list who has better access to the relevant documents can let you know whether ??? can be found in those sets. > and b.) it say to propose a new character, the proposal must include > information about someone who would agree to provide a computer font > for publishing the standard, do that mean i have to provide info about > someone who is anticipated to agree on doing so or do i need to > contact them for their agreement first, and does that mean I can just > put info of someone who are making free full unicode CJK coverage font > into the proposal?, > It would require (eventually) provision of a font with correct display of just the character proposed -- but in the case of CJK additions, these first go through a process of collection and review by the Ideographic Rapporteur Group. The best thing to do is to work with a national body concerned with CJK characters and ensure that they include this character on their list of submissions for IRG review. > and c.) just like the question (b), do "names and addresses of > appropriate contacts within national body or user organizations" > represent Vietnamese government in this case? > If the character has not been submitted to the IRG for review, it would probably be best to work through the Vietnamese national standards body. Again, people on this list may be able to provide you the correct contact information for them. > 2. Is combined characters like U+20DD intended to work with all > different type of characters, or is it some problem related to > implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + > Combining Enclosing Circle) appear to be separate on most font I use, > but if I change the Hiragana Yu into a conventional = sign or some > latin character, most fonts are at least somehow able to put them > together. Or, is there any better/alternative representation in > unicode that can show japanese hiragana yu in a circle? > Combining enclosing marks in principle could work with most characters, but in practice most arbitrary combinations do not work very well, because they would require very complicated font support. > 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark > are not included. While there are charactera like U+2584, U+FE33, > U+FE4F, and U+FE34 in unicode that is more or less a representation > for the two symbol, they do not appear below or on the left of typed > characters when text flow is horizontal/vertical, and instead, they > occupy their own space which make them having little use in daily > life, and while the proper name mark and book name mark can > represented by text editing softwares and css but those representation > are not ideal and they do match "Criteria for Encoding Symbols". Is it > possible to make a new unicode symbol, or change some current symbol > into one that could appear in suitable place of other characters when > typed? And a property of the symbol is that when used in case like ? > ??? which ?? and ?? are two different proper name (place name), > so an underline should go below them without any separation between > the character ?and? or ?and? (when text are written horizontally), > but at the same time the underline should not be linked between ? and > ? as ? is the end of first place name while ? is the start of the > other. > What you are talking about is, indeed, best handled by text styling attributes, rather than by individual character encoding. These are various CJK-specific underlining styles (for horizontal text layout) or sidelining styles (for vertical text layout). It is precisely because these require highlighting for ranges of characters (without breaks) that this kind of text decoration is handled best by style attributes (or markup), rather than by individual combining symbols. The characters U+FE33, U+FE34, U+FE4F (but not U+2584) are compatibility characters only for mapping to old Chinese standards that had individual characters encoded for these underlining or sidelining text highlights, but which required specialized text layout programs to make any use of them. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpsuzuki at hiroshima-u.ac.jp Sat May 30 00:46:21 2015 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Sat, 30 May 2015 14:46:21 +0900 Subject: ["Unicode"] Re: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: <55691764.4030802@att.net> References: <55691764.4030802@att.net> Message-ID: <55694EAD.6030604@hiroshima-u.ac.jp> Hi, Please let me ask a slightly off-topic question, ? = ??? (not ???) is coded at U+46E9. Of course, the unification between ? vs ? is not applied basically, so the separated encoding of ??? would be reasonable (if there is a requirement), but I want to know whether Vietnamese user community distinguishes ??? and ??? semantically. Do you know anything? Regards, mpsuzuki Ken Whistler wrote: > > > On 5/29/2015 5:20 PM, gfb hjjhjh wrote: >> >> 1. I have seen a chinese character ??? from a Vietnamese dictionary >> NHAT DUNG THUONG DAM DICTIONARY* * >> > >> So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that >> CJK Extension E and F have already been accepted, but where can I >> check those proposals to see if the xharacter is in them or not? >> > > For Extension E, you can check the following code chart: > > http://www.unicode.org/charts/PDF/Unicode-8.0/U80-2B820.pdf > > See: U+2C89A..U+2C931 (pp. 54-56 of the pdf) for the relevant > radical (#149). But I don't see that character in the list of > Extension E characters. > > Extension F is harder to track down, because it has not yet been > approved by the UTC, and comes in two pieces, with different > progression so far in the ISO committee. Perhaps somebody on this list > who has better access to the relevant documents can let you > know whether ??? can be found in those sets. > >> and b.) it say to propose a new character, the proposal must include >> information about someone who would agree to provide a computer font >> for publishing the standard, do that mean i have to provide info about >> someone who is anticipated to agree on doing so or do i need to >> contact them for their agreement first, and does that mean I can just >> put info of someone who are making free full unicode CJK coverage font >> into the proposal?, >> > > It would require (eventually) provision of a font with correct display > of just the character proposed -- but in the case of CJK additions, these > first go through a process of collection and review by the Ideographic > Rapporteur Group. The best thing to do is to work with a national > body concerned with CJK characters and ensure that they include > this character on their list of submissions for IRG review. > >> and c.) just like the question (b), do "names and addresses of >> appropriate contacts within national body or user organizations" >> represent Vietnamese government in this case? >> > > If the character has not been submitted to the IRG for review, it would > probably be best to work through the Vietnamese national standards > body. Again, people on this list may be able to provide you the > correct contact information for them. > >> 2. Is combined characters like U+20DD intended to work with all >> different type of characters, or is it some problem related to >> implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + >> Combining Enclosing Circle) appear to be separate on most font I use, >> but if I change the Hiragana Yu into a conventional = sign or some >> latin character, most fonts are at least somehow able to put them >> together. Or, is there any better/alternative representation in >> unicode that can show japanese hiragana yu in a circle? >> > > Combining enclosing marks in principle could work with most characters, > but in practice most arbitrary combinations do not work very well, > because they would require very complicated font support. > >> 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark >> are not included. While there are charactera like U+2584, U+FE33, >> U+FE4F, and U+FE34 in unicode that is more or less a representation >> for the two symbol, they do not appear below or on the left of typed >> characters when text flow is horizontal/vertical, and instead, they >> occupy their own space which make them having little use in daily >> life, and while the proper name mark and book name mark can >> represented by text editing softwares and css but those representation >> are not ideal and they do match "Criteria for Encoding Symbols". Is it >> possible to make a new unicode symbol, or change some current symbol >> into one that could appear in suitable place of other characters when >> typed? And a property of the symbol is that when used in case like ? >> ??? which ?? and ?? are two different proper name (place name), >> so an underline should go below them without any separation between >> the character ?and? or ?and? (when text are written horizontally), >> but at the same time the underline should not be linked between ? and >> ? as ? is the end of first place name while ? is the start of the >> other. >> > > What you are talking about is, indeed, best handled by text styling > attributes, > rather than by individual character encoding. These are various CJK-specific > underlining styles (for horizontal text layout) or sidelining styles (for > vertical text layout). It is precisely because these require > highlighting for > ranges of characters (without breaks) that this kind of text decoration is > handled best by style attributes (or markup), rather than by individual > combining symbols. > > The characters U+FE33, U+FE34, U+FE4F (but not U+2584) are compatibility > characters only for mapping to old Chinese standards that had individual > characters encoded for these underlining or sidelining text highlights, > but which required specialized text layout programs to make any use > of them. > > --Ken > From wjgo_10009 at btinternet.com Sat May 30 03:47:05 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 30 May 2015 09:47:05 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost> Responding to Doug Ewell: > I think this cuts to the heart of what people have been trying to say all along. > Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction. History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress. For example, there was the extension from 1 plane to 17 planes. There was the introduction of emoji support. There was the introduction of the policy of colour sometimes being a recorded property rather than having just the original monochrome recording policy. There has been the change of encoding policy that facilitated the introduction of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly than had been thought possible, so that the encoding was ready for use when needed. There has been the recent encoding policy change regarding encoding of pure electronic use items taking place without (extensive prior use using a Private Use Area encoding), such as the encoding of the UNICORN FACE. There is the recent change to the deprecation status of most of the tag characters and the acceptance of the base character followed by tag characters technique so as to allow the specifying of a larger collection of particular flags. ---- The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here. Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting? What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting? William Overington 30 May 2015 From andrewcwest at gmail.com Sat May 30 04:19:03 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sat, 30 May 2015 10:19:03 +0100 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: <55691764.4030802@att.net> References: <55691764.4030802@att.net> Message-ID: On 30 May 2015 at 02:50, Ken Whistler wrote: > > 1. I have seen a chinese character ??? from a Vietnamese dictionary NHAT > DUNG THUONG DAM DICTIONARY > > Extension F is harder to track down, because it has not yet been > approved by the UTC, and comes in two pieces, with different > progression so far in the ISO committee. Perhaps somebody on this list > who has better access to the relevant documents can let you > know whether ??? can be found in those sets. It's not in my lists of F1 and F2 characters. > 2. Is combined characters like U+20DD intended to work with all different > type of characters, or is it some problem related to implementation ? as I > when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle) > appear to be separate on most font I use, but if I change the Hiragana Yu > into a conventional = sign or some latin character, most fonts are at least > somehow able to put them together. Or, is there any better/alternative > representation in unicode that can show japanese hiragana yu in a circle? > > Combining enclosing marks in principle could work with most characters, > but in practice most arbitrary combinations do not work very well, > because they would require very complicated font support. It's not that complicated, but I think most fonts don't support arbitrary combinations with combining enclosing circle because there is little or no demand for them. BabelStone Han displays Japanese Hiragana Letter Yu + Combining Enclosing Circle quite well, but on the other hand it does not work so well with CJK ideographs, and fails with Latin letters and punctuation. ? > 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark are > not included. While there are charactera like U+2584, U+FE33, U+FE4F, and > U+FE34 in unicode that is more or less a representation for the two symbol, > they do not appear below or on the left of typed characters when text flow > is horizontal/vertical, and instead, they occupy their own space which make > them having little use in daily life, and while the proper name mark and > book name mark can represented by text editing softwares and css but those > representation are not ideal and they do match "Criteria for Encoding > Symbols". Is it possible to make a new unicode symbol, or change some > current symbol into one that could appear in suitable place of other > characters when typed? And a property of the symbol is that when used in > case like ???? which ?? and ?? are two different proper name (place name), > so an underline should go below them without any separation between the > character ?and? or ?and? (when text are written horizontally), but at the > same time the underline should not be linked between ? and ? as ? is the end > of first place name while ? is the start of the other. > > > What you are talking about is, indeed, best handled by text styling > attributes,rather than by individual character encoding. I agree. However, if you really do want to represent underlining of proper names at the character encoding level, then you would have to do something like put U+0332 Combining Low Line after each character to be underlined, and select a font that supports Combining Low Line with CJK ideographs. BabelStone Han supports this low-level method of underlining CJK ideographs, but if you want a space in the underlining between ?? and ?? you would have to insert a very thin space (U+200A Hair Space in this example) between the characters. ? Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: MeiguoNiuyue.png Type: image/png Size: 27233 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: circled yu.png Type: image/png Size: 26781 bytes Desc: not available URL: From wjgo_10009 at btinternet.com Sat May 30 04:22:34 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 30 May 2015 10:22:34 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <55665633.8040503@kli.org> <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost> <5567A7D6.6060102@kli.org> <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost> Message-ID: <22703395.8755.1432977754153.JavaMail.defaultUser@defaultHost> Responding to Leo Broukhis: > A more common occurrence is the need to include a non-standard character in a text message, be it a ski piste symbol or an obscure CJK ideogram. Have you thought of embedding TrueType in Unicode? Not congruently so, yet, in effect, yes, as I have considered including individual OpenType-compatible glyphs in a base character followed by tag characters format. OpenType is a development from TrueType that can achieve more than can TrueType on its own. There is a little about this in the last two paragraphs of the following post. http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html There would need to be a few additions to make if work effectively: for example, a value for each of advance width, ascent maximum, descent maximum and fontunits per em. William Overington 30 May 2015 From idou747 at gmail.com Sat May 30 09:14:05 2015 From: idou747 at gmail.com (John) Date: Sat, 30 May 2015 07:14:05 -0700 (PDT) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: Message-ID: <1432995244747.7fa720f5@Nodemailer> Hmm, these "once entities" of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language. But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML. But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document? I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group. And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that. And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a "character". I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to ?work on a standard to do that, not just point to html5. On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy , wrote: 2015-05-29 4:37 GMT+02:00 John : "Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request? If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content. You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs). If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both?(and SVG as well when it contains plain-text elements). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 30 13:50:21 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 30 May 2015 20:50:21 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost> References: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost> Message-ID: 2015-05-30 10:47 GMT+02:00 William_J_G Overington : > Responding to Doug Ewell: > > > I think this cuts to the heart of what people have been trying to say > all along. > > > Historically, Unicode was not meant to be the means by which brand new > ideas are run up the proverbial flagpole to see if they will gain traction. > > History is interesting and can be a good guide, yet many things that are > an accepted part of Unicode today started as new ideas that gained traction > and became implemented. So history should not be allowed to be a reason to > restrict progress. > > For example, there was the extension from 1 plane to 17 planes. > Actually this was a restriction of the UCS to *only* 17 planes. Before that the UCS contained 31-bit code points, i.e. 32768 planes ! If you're speaking about the old Unicode 1.0 it was then still not the UCS and it was then incompatible with the UCS for many important parts, and the initial targets of Unicode was only to have an "industry standard" immediately usable between a few software providers (Unicode 1.0 was then not an international standard, forget it !). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 30 16:56:26 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 30 May 2015 23:56:26 +0200 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: But observations show that the vertical stacking is not universal. Horizontal stacking is also used in direction signs. My opinion is that they are just two separate "diamonds" and not a single symbol. Quite equivalent to the situation with the classification of hotels with stars (generally aligned horizontally but not always, we can see them also arranged vertically, or on two rows 1+1, 1+2 or 2+1 or 2+3 or 3+2...) I don't think the exact layout of individual symbols (diamond, star, ...) is semantically significant, only their number is important (and the fact they are grouped together on the same medium with the same foreground/background colors or tecturing and the same sizes). 2015-05-29 9:32 GMT+02:00 "J?rg Knappen" : > From the description of the symbol it looks like a geometric shape. I > think it is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS > VERTICALLY STACKED or something like this) with a note * bunny hill. It may > have (r find in future) other uses. > > --J?rg Knappen > > *Gesendet:* Donnerstag, 28. Mai 2015 um 23:20 Uhr > *Von:* "Shervin Afshar" > *An:* "Shawn Steele" > *Cc:* "verdy_p at wanadoo.fr" , "unicode Unicode > Discussion" , "Jim Melton" > *Betreff:* Re: "Bunny hill" symbol, used in America for signaling ski > pistes for novices > Since the double-diamond has map and map legend usage, it might be a > good idea to have it encoded separately. I know that I'm stating the > obvious here, but the important point is doing the research and showing > that it has widespread usage. > > ? Shervin > > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele > wrote: >> >> I?m used to them being next to each other. So the entire discussion >> seems to be about how to encode a concept vs how to get the shape you want >> with existing code points. If you just want the perfect shape, then maybe >> an svg is a better choice. If we?re talking about describing ski-run >> difficulty levels in plain-text, then the hodge-podge of glyphs being >> offered in this thread seems kinda hacky to me. >> >> >> >> -Shawn >> >> >> >> *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe >> Verdy >> *Sent:* Thursday, May 28, 2015 2:12 PM >> *To:* Jim Melton >> *Cc:* Shawn Steele; unicode Unicode Discussion >> *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski >> pistes for novices >> >> >> >> Some documentations also suggest that the two diamonds are not stacked >> one above the other, but horizontally. It's a good point for using only one >> symbol, encoding it twice in plain-text if needed. >> >> >> >> 2015-05-28 22:15 GMT+02:00 Jim Melton : >> >> I no longer ski, but I did so for many years, mostly (but not >> exclusively) in the western United States. I never encountered, at any USA >> ski hill/mountain/resort, a special symbol for "bunny hills", which are >> typically represented by the green circle meaning "beginner". That's >> anecdotal evidence at best, but my observations cover numerous skiing >> sites. I have encountered such a symbol in Europe and in New Zealand, but >> not in the USA. (I have not had the pleasure of skiing in Canada and am >> thus unable to speak about ski areas in that country.) >> >> The double black diamond would appear to be a unique symbol worthy of >> encoding, simply because the only valid typographical representation (in >> the USA) is two single black diamonds stacked one above the other and >> touching at the points. >> >> Hope this helps, >> Jim >> >> >> On 5/28/2015 2:04 PM, Shawn Steele wrote: >> >> So is double black diamond a separate symbol? Or just two of the black >> diamond? >> >> >> >> And Blue-Black? >> >> >> >> I?m drawing a blank on a specific bunny sign, in my experience those are >> usually just green. >> >> >> >> Aren?t there a lot of cartography symbols for various systems that aren?t >> present in Unicode? >> >> >> >> *From:* Unicode [mailto:unicode-bounces at unicode.org >> ] *On Behalf Of *Philippe Verdy >> *Sent:* Thursday, May 28, 2015 12:47 PM >> *To:* unicode Unicode Discussion >> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes >> for novices >> >> >> >> Is there a symbol that can represent the "Bunny hill" symbol used in >> North America and some other American territories with mountains, to >> designate the ski pistes open to novice skiers (those pistes are signaled >> with green signs in Europe). >> >> >> >> I'm looking for the symbol itself, not the color, or the form of the sign. >> >> >> >> For example blue pistes in Europe are designed with a green circle in >> America, but we have a symbol for the circle; red pistes in Europe are >> signaled by a blue square in America, but we have a symbol for the square; >> black pistes in Europe are signaled by a black diamond in America, but we >> also have such "black" diamond in Unicode. >> >> >> >> But I can't find an equivalent to the American "Bunny hill" signal, >> equivalent to green pistes in Europe (this is a problem for webpages >> related to skiing: do we have to embed an image ?). >> >> >> >> >> >> -- >> >> ======================================================================== >> >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 >> >> Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 >> >> Oracle Corporation Oracle Email: jim dot melton at oracle dot com >> >> 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org >> >> Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com >> >> ======================================================================== >> >> = Facts are facts. But any opinions expressed are the opinions = >> >> = only of myself and may or may not reflect the opinions of anybody = >> >> = else with whom I may or may not have discussed the issues at hand. = >> >> ======================================================================== >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat May 30 18:21:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 30 May 2015 16:21:44 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net> Note: Everything below is my personal opinion and does not represent any official Unicode Consortium or UTC position. William_J_G Overington wrote: >> Historically, Unicode was not meant to be the means by which brand >> new ideas are run up the proverbial flagpole to see if they will gain >> traction. > > History is interesting and can be a good guide, yet many things that > are an accepted part of Unicode today started as new ideas that gained > traction and became implemented. So history should not be allowed to > be a reason to restrict progress. I used "historically" to distinguish between the pre- and post-Emoji Revolution eras. There have clearly been changes recently, but there is still at least a minimal expectation that proposed characters will fulfill a demonstrated need. I'm not seeing any truly novel, untested ideas in the list below that Unicode implemented purely on speculation. > For example, there was the extension from 1 plane to 17 planes. That was an architectural extension, brought about by the realization that 64K code points wasn't enough for even the original scope. There's no comparison. > There was the introduction of emoji support. Emoji proponents would argue that "emoji support" began in 1.0 with the inclusion of various dingbats. But even emoji are arguably "characters" in some sense. They aren't a mini-language used to define images pixel by pixel. > There was the introduction of the policy of colour sometimes being a > recorded property rather than having just the original monochrome > recording policy. There isn't any such policy. There is a variation selector to suggest that the rendering engine show certain characters in "emoji style" instead of "text style," and there are characters with colors in their names, but there is no policy that specific colors are "recorded" as part of the encoding. YELLOW HEART could conformantly appear in any color. > There has been the change of encoding policy that facilitated the > introduction of the Indian Rupee character into Unicode and ISO/IEC > 10646 far more quickly than had been thought possible, so that the > encoding was ready for use when needed. That's not a change to what types of things get encoded. It's a procedural change, one which I would agree has been applied with increasing creativity. > There has been the recent encoding policy change regarding encoding of > pure electronic use items taking place without (extensive prior use > using a Private Use Area encoding), such as the encoding of the > UNICORN FACE. This is probably your best analogy. People like Asmus have addressed it, saying it's not reasonable to expect users to adopt PUA solutions and wait for them to catch on. > There is the recent change to the deprecation status of most of the > tag characters and the acceptance of the base character followed by > tag characters technique so as to allow the specifying of a larger > collection of particular flags. There must have been a great wailing and gnashing of teeth over that decision. So many statements were made over the years about the basic evilness of tag characters. But the concept of representing flags was already agreed upon as a "compatibility" measure, and the Regional Indicator Symbols solution was a compromise that allowed expansion beyond the 10 flags that Japanese telcos chose to include. RIS were an architectural decision. The tag solution (to be fully outlined in a future PRI) was another architectural decision. Neither (I believe) is analogous to a scope decision to start encoding different types of non-character things as if they were characters, and as I have said before, assigning a glyph to a thing that isn't a character doesn't make it one. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From Shawn.Steele at microsoft.com Sat May 30 18:34:38 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sat, 30 May 2015 23:34:38 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: <55677762.3060805@oracle.com> Message-ID: I guess it depends on what you?re representing. If it is the concept of ?double black?, then maybe a separate symbol and the ?font? or other selectors determine if it?s vertically or horizontally rendered. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Saturday, May 30, 2015 2:56 PM To: J?rg Knappen Cc: Shervin Afshar; unicode Unicode Discussion Subject: Re: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices But observations show that the vertical stacking is not universal. Horizontal stacking is also used in direction signs. My opinion is that they are just two separate "diamonds" and not a single symbol. Quite equivalent to the situation with the classification of hotels with stars (generally aligned horizontally but not always, we can see them also arranged vertically, or on two rows 1+1, 1+2 or 2+1 or 2+3 or 3+2...) I don't think the exact layout of individual symbols (diamond, star, ...) is semantically significant, only their number is important (and the fact they are grouped together on the same medium with the same foreground/background colors or tecturing and the same sizes). 2015-05-29 9:32 GMT+02:00 "J?rg Knappen" >: From the description of the symbol it looks like a geometric shape. I think it is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS VERTICALLY STACKED or something like this) with a note * bunny hill. It may have (r find in future) other uses. --J?rg Knappen Gesendet: Donnerstag, 28. Mai 2015 um 23:20 Uhr Von: "Shervin Afshar" > An: "Shawn Steele" > Cc: "verdy_p at wanadoo.fr" >, "unicode Unicode Discussion" >, "Jim Melton" > Betreff: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices Since the double-diamond has map and map legend usage, it might be a good idea to have it encoded separately. I know that I'm stating the obvious here, but the important point is doing the research and showing that it has widespread usage. ? Shervin On Thu, May 28, 2015 at 2:15 PM, Shawn Steele > wrote: I?m used to them being next to each other. So the entire discussion seems to be about how to encode a concept vs how to get the shape you want with existing code points. If you just want the perfect shape, then maybe an svg is a better choice. If we?re talking about describing ski-run difficulty levels in plain-text, then the hodge-podge of glyphs being offered in this thread seems kinda hacky to me. -Shawn From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 2:12 PM To: Jim Melton Cc: Shawn Steele; unicode Unicode Discussion Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. It's a good point for using only one symbol, encoding it twice in plain-text if needed. 2015-05-28 22:15 GMT+02:00 Jim Melton >: I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States. I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner". That's anecdotal evidence at best, but my observations cover numerous skiing sites. I have encountered such a symbol in Europe and in New Zealand, but not in the USA. (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.) The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points. Hope this helps, Jim On 5/28/2015 2:04 PM, Shawn Steele wrote: So is double black diamond a separate symbol? Or just two of the black diamond? And Blue-Black? I?m drawing a blank on a specific bunny sign, in my experience those are usually just green. Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, May 28, 2015 12:47 PM To: unicode Unicode Discussion Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -- ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com ======================================================================== = Facts are facts. But any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ======================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sat May 30 19:34:50 2015 From: prosfilaes at gmail.com (David Starner) Date: Sun, 31 May 2015 00:34:50 +0000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net> References: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net> Message-ID: I would say that a system would conform with Unicode in having yellow heart red (in a non-monochrome font) as well as if it made it a cross. Either way it's violating character identity. I'd say that being monochromatic is now like being monospaced; it's suboptimal for a Unicode implementation, but hardly something Unicode can condemn as nonconformant. On 4:25pm, Sat, May 30, 2015 Doug Ewell wrote: > Note: Everything below is my personal opinion and does not represent any > official Unicode Consortium or UTC position. > > William_J_G Overington > wrote: > > >> Historically, Unicode was not meant to be the means by which brand > >> new ideas are run up the proverbial flagpole to see if they will gain > >> traction. > > > > History is interesting and can be a good guide, yet many things that > > are an accepted part of Unicode today started as new ideas that gained > > traction and became implemented. So history should not be allowed to > > be a reason to restrict progress. > > I used "historically" to distinguish between the pre- and post-Emoji > Revolution eras. There have clearly been changes recently, but there is > still at least a minimal expectation that proposed characters will > fulfill a demonstrated need. > > I'm not seeing any truly novel, untested ideas in the list below that > Unicode implemented purely on speculation. > > > For example, there was the extension from 1 plane to 17 planes. > > That was an architectural extension, brought about by the realization > that 64K code points wasn't enough for even the original scope. There's > no comparison. > > > There was the introduction of emoji support. > > Emoji proponents would argue that "emoji support" began in 1.0 with the > inclusion of various dingbats. But even emoji are arguably "characters" > in some sense. They aren't a mini-language used to define images pixel > by pixel. > > > There was the introduction of the policy of colour sometimes being a > > recorded property rather than having just the original monochrome > > recording policy. > > There isn't any such policy. There is a variation selector to suggest > that the rendering engine show certain characters in "emoji style" > instead of "text style," and there are characters with colors in their > names, but there is no policy that specific colors are "recorded" as > part of the encoding. YELLOW HEART could conformantly appear in any > color. > > > There has been the change of encoding policy that facilitated the > > introduction of the Indian Rupee character into Unicode and ISO/IEC > > 10646 far more quickly than had been thought possible, so that the > > encoding was ready for use when needed. > > That's not a change to what types of things get encoded. It's a > procedural change, one which I would agree has been applied with > increasing creativity. > > > There has been the recent encoding policy change regarding encoding of > > pure electronic use items taking place without (extensive prior use > > using a Private Use Area encoding), such as the encoding of the > > UNICORN FACE. > > This is probably your best analogy. People like Asmus have addressed it, > saying it's not reasonable to expect users to adopt PUA solutions and > wait for them to catch on. > > > There is the recent change to the deprecation status of most of the > > tag characters and the acceptance of the base character followed by > > tag characters technique so as to allow the specifying of a larger > > collection of particular flags. > > There must have been a great wailing and gnashing of teeth over that > decision. So many statements were made over the years about the basic > evilness of tag characters. > > But the concept of representing flags was already agreed upon as a > "compatibility" measure, and the Regional Indicator Symbols solution was > a compromise that allowed expansion beyond the 10 flags that Japanese > telcos chose to include. RIS were an architectural decision. The tag > solution (to be fully outlined in a future PRI) was another > architectural decision. Neither (I believe) is analogous to a scope > decision to start encoding different types of non-character things as if > they were characters, and as I have said before, assigning a glyph to a > thing that isn't a character doesn't make it one. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Sat May 30 22:02:11 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 31 May 2015 03:02:11 +0000 Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices In-Reply-To: References: Message-ID: I?m really curious to see one of these signs. Is it a regional thing? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Leonardo Boiko Sent: Thursday, May 28, 2015 1:02 PM To: Philippe Verdy Cc: unicode Unicode Discussion Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING UPWARD POINTING TRIANGLE, and pretend the triangle is a hill. ?? ? If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW CAPPED MOUNTAIN. Or anything else. 2015-05-28 16:46 GMT-03:00 Philippe Verdy >: Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe). I'm looking for the symbol itself, not the color, or the form of the sign. For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode. But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?). -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Sun May 31 03:43:12 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Sun, 31 May 2015 16:43:12 +0800 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: <55694EAD.6030604@hiroshima-u.ac.jp> References: <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp> Message-ID: Thanks for answers. As of ??? versus ???, as I don't have much knowledge about Vietnamese and the character is from chu han instead of chu nom, I don't really know if there are any semantic difference between the two, but at least the one usage of ??? shown in the word on that dictionary page would be something like "dumb, mute" which were not listed as part of the meaning of the character ? in wiktionary. And for the proper name mark and book name mark, while i see the point that it wiuld be best achieve via word processor styling or markup language, so is it a good idea to integrating things similar to markup language into unicode, like create a character ps that indicate start of proper name mark and pe for end of proper name mark, then typing psPROPERNAMEpe would result in something similar to PROPERNAME? And if using the work around suggested by Andrew, yes the hair space work but it a distance between characters a gap with width equal to an 'i'. Have also tried characters like u+200c or u+034f which does not work. and it seem like babelstone han is not supporting U+1AB6? and is there any vertical edition of the two characters... -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun May 31 06:05:10 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 31 May 2015 04:05:10 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1432995244747.7fa720f5@Nodemailer> References: <1432995244747.7fa720f5@Nodemailer> Message-ID: <556AEAE6.2040203@ix.netcom.com> John, reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML. But I think you are onto something with your hypothetical example of the "subset that works in ALL textual situations". There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it. What people seem to have in mind is something like "inline" text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available. With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take "inline text". I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format. Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions "stick" are doomed to failure. The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange. Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context. If your skeptical position proves correct in that this is something that turns out to not be tractable, then I think you've provided conclusive proof why stickers won't happen and why encoding emoji was the only sensible decision Unicode could have taken. A./ On 5/30/2015 7:14 AM, John wrote: > > Hmm, these "once entities" of which you speak, do they require > javascript? Because I'm not sure what we are looking for here is > static documents requiring a full programming language. > > But let's say for a moment that html5 can, or could do the job here. > Then to make the dream come true that you could just cut and paste > text that happened to contain a custom character to somewhere else, > and nothing untoward would happen, would mean that everything in the > computing universe should allow full blown html. So every Java Swing > component, every Apple gui component, every .NET component, every > windows component, every browser, every Android and IOS component > would allow text entry of HTML entities. OK, so let's say everyone > agrees with this course of action, now the universal text format is HTML. > > But in this new world where anywhere that previously you could input > text, you can now input full blown html, does that actually make > sense? Does it make sense that you can for example, put full blown > HTML inside a H1 tag in html itself? That's a lot of recursion going > on there. Or in a MS-Excel cell? Or interspersed in some otherwise > fairly regular text in a Word document? > > I suppose someone could define a strict limited subset of HTML to be > that subset that makes sense in ALL textual situations. That subset > would be something like just defining things that act like characters, > and not like a full blown rendering engine. But who would define that > subset? Not the HTML groups, because their mandate is to define full > blown rendering engines. It would be more likely to be something like > the unicode group. > > And also, in this brave new world where HTML5 is the new standard text > format, what would the binary format of it be? I mean, if I have the > string of unicode characters that should be rendered as such? Or would it be text that happens to > contain greater than symbol, I, M and G? It would have to be the > former I guess, and thereby there would no longer be a unicode symbol > for the mathematical greater than symbol. Rather there would be a > unicode symbol for opening a HTML tag, and the text code for greater > than would be > Never again would a computer store > to mean > greater than. Do we want HTML to be so pervasive? Not sure it deserves > that. > > And from a programmers point of view, he wants to be able to iterate > over an array of characters and treat each one the same way, > regardless if it is a custom character or not. Without that kind of > programmatic abstraction, the whole thing can never gain traction. I > don't think fully blown HTML embedded in your text can fulfill that. A > very strictly defined subset, possibly could. Sure HTML5 can RENDER > stuff adquately, if the only aim of the game is provide a correct > rendering. But to be able to actually treat particular images embedded > as characters, and have some programming library see that abstraction > consistently, I'm not sure I'm convinced that is possible. Not without > nailing down exactly what html elements in what particular > circumstances constitute a "character". > > I guess in summary, yes we have the technology already to render > anything. But I don't think the whole standards framework does > anything to allow the computing universe to actually exchange custom > characters as if they were just any other text. Someone would actually > have to work on a standard to do that, not just point to html5. > > > On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy > >, wrote: > > > 2015-05-29 4:37 GMT+02:00 John >: > > "Today the world goes very well with HTML(5) which is now the > bext markup language for document (including for inserting > embedded images that don?t require any external request? > If I had a large document that reused a particular character > thousands of times, would this HTML markup require embedding > that character thousands of times, or could I define the > character once at the beginning of the sequence, and then > refer back to it in a space efficient way? > > > HTML(5) allows defining *once* entities for images that can then > be reused thousands of times without repeting their definition. > You can do this as well with CSS styles, just define a class for a > small element. This element may still be an "image", but the > semantic is carried by the class you assign to it. You are not > required to provide an external source URL for that image if the > CSS style provides the content. > > You may also use PUAs for the same purpose (however I have not > seen how CSS allows to style individual characters in text > elements as these characters are not elements, and there's no > defined selector for pseudo-elements matching a single character). > PUAs are perfectly usable in the situation where you have embedded > a custom font in your document for assigning glyphs to characters > (you can still do that, but I would avoid TrueType/OpenType for > this purpose, but would use the SVG font format which is valid in > CSS, for defining a collection of glyphs). > > If the document is not restricted to be standalone, of course you > can use links to an external shared CSS stylesheet and to this SVG > font referenced by the stylesheet. With such approach, you don't > even need to use classes on elements, you use plain-text with very > compact PUAs (it's up to you to decide if the document must be > standalone (embedding everything it needs) or must use external > references for missing definitions, HTML allows both (and SVG as > well when it contains plain-text elements). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Sun May 31 06:42:41 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 31 May 2015 12:42:41 +0100 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp> Message-ID: On 31 May 2015 at 09:43, gfb hjjhjh wrote: > > As of ??? versus ???, as I don't have much knowledge about Vietnamese and > the character is from chu han instead of chu nom, I don't really know if > there are any semantic difference between the two, but at least the one > usage of ??? shown in the word on that dictionary page would be something > like "dumb, mute" which were not listed as part of the meaning of the > character ? in wiktionary. The way CJK unification works, you don't need to show that there is a semantic difference between the two forms, just that the form is used in a reputable source. Can you send me off-list a scan of the character from the Vietnamese dictionary you mention? > And for the proper name mark and book name mark, while i see the point that > it wiuld be best achieve via word processor styling or markup language, so > is it a good idea to integrating things similar to markup language into > unicode, like create a character ps that indicate start of proper name mark > and pe for end of proper name mark, then typing psPROPERNAMEpe would result > in something similar to PROPERNAME? I think you can achieve the appropriate styling for web pages using CSS: http://www.w3.org/TR/2013/WD-css-text-decor-3-20130103/#text-decoration-style-property > And if using the work around suggested by Andrew, yes the hair space work > but it a distance between characters a gap with width equal to an 'i'. Have > also tried characters like u+200c or u+034f which does not work. Even with OpenType it is not easy to contextually create a gap between two combining underlines as the characters are not adjacent (I don't think it is impossible, but the only way I can think of doing it is rather unpleasant; perhaps other font experts on this list know an easy way of doing it). > and it seem > like babelstone han is not supporting U+1AB6? U+1AB6 is supported in the next release of BabelStone Han (due for release very soon, probably within the next week or two). > and is there any vertical > edition of the two characters... The combining underline and wavy line characters will work OK with a vertically oriented CJK font (they will display on the left). Unfortunately BabelStone does not currently work very well in vertical orientation. Andrew From idou747 at gmail.com Sun May 31 07:33:44 2015 From: idou747 at gmail.com (John) Date: Sun, 31 May 2015 05:33:44 -0700 (PDT) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556AEAE6.2040203@ix.netcom.com> References: <556AEAE6.2040203@ix.netcom.com> Message-ID: <1433075623556.38b645ad@Nodemailer> Yes, Asmus good post. But I don?t really think HTML, even a subset, is really the right solution. I?m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn?t know about that schema could go to that URL and download the schema, and check that the XML ?conforms to that schema. Similarly, imagine a text format that had a header with something like: \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn?t be reliant on today?s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn?t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn?t previously aware of them, and the format would be independent of today?s rendering technologies. Let?s face it, HTML5 changes every few years, and I don?t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don?t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don?t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee. As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined. ? Chris On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) wrote: > John, > reading this discussion, I agree with your reaductio ad absurdum of > infinitely nested HTML. > But I think you are onto something with your hypothetical example of the > "subset that works in ALL textual situations". > There's clearly a use case for something like it, and I believe many > people would intuitively agree on a set of features for it. > What people seem to have in mind is something like "inline" text. > Something beyond a mere stream of plain text (with effectively every > character rendered visibly), but still limited in important ways by > general behavior of inline text: a string of it, laid out, must wrap and > line break, any objects included in it must behave like characters > (albeit of custom width, height and appearance), and so on. Paragraph > formatting, stacked layout, header levels and all those good things > would not be available. > With such a subset clearly defined, many quirky limitations might no > longer be necessary; any container that today only takes plain text > could be upgraded to take "inline text". I can see some inline > containers retaining a nesting limitation, but I could imagine that it > is possible to arrive at a consistent definition of such inline format. > Going further, I can't shake the impression that without a clean > definition of an inline text format along those lines, any attempts at > making stickers and similar solutions "stick" are doomed to failure. > The interesting thing in defining such a format is not how to represent > it in HTML or CSS syntax, but in describing what feature sets it must > (minimally) support. Doing it that way would free existing > implementations of rich text to map native formats onto that minimally > required subset and to add them to their format translators for HMTL or > whatever else they use for interchange. > Only with a definition can you ever hope to develop a processing model. > It won't be as simple as for plain text strings, but it should be able > to support common abstractions (like iteration by logical unit). It > would have to support the management of external resources - if the > inline format allows images, custom fonts, etc. one would need a way to > manage references to them in the local context. > If your skeptical position proves correct in that this is something that > turns out to not be tractable, then I think you've provided conclusive > proof why stickers won't happen and why encoding emoji was the only > sensible decision Unicode could have taken. > A./ > On 5/30/2015 7:14 AM, John wrote: >> >> Hmm, these "once entities" of which you speak, do they require >> javascript? Because I'm not sure what we are looking for here is >> static documents requiring a full programming language. >> >> But let's say for a moment that html5 can, or could do the job here. >> Then to make the dream come true that you could just cut and paste >> text that happened to contain a custom character to somewhere else, >> and nothing untoward would happen, would mean that everything in the >> computing universe should allow full blown html. So every Java Swing >> component, every Apple gui component, every .NET component, every >> windows component, every browser, every Android and IOS component >> would allow text entry of HTML entities. OK, so let's say everyone >> agrees with this course of action, now the universal text format is HTML. >> >> But in this new world where anywhere that previously you could input >> text, you can now input full blown html, does that actually make >> sense? Does it make sense that you can for example, put full blown >> HTML inside a H1 tag in html itself? That's a lot of recursion going >> on there. Or in a MS-Excel cell? Or interspersed in some otherwise >> fairly regular text in a Word document? >> >> I suppose someone could define a strict limited subset of HTML to be >> that subset that makes sense in ALL textual situations. That subset >> would be something like just defining things that act like characters, >> and not like a full blown rendering engine. But who would define that >> subset? Not the HTML groups, because their mandate is to define full >> blown rendering engines. It would be more likely to be something like >> the unicode group. >> >> And also, in this brave new world where HTML5 is the new standard text >> format, what would the binary format of it be? I mean, if I have the >> string of unicode characters > that should be rendered as such? Or would it be text that happens to >> contain greater than symbol, I, M and G? It would have to be the >> former I guess, and thereby there would no longer be a unicode symbol >> for the mathematical greater than symbol. Rather there would be a >> unicode symbol for opening a HTML tag, and the text code for greater >> than would be > Never again would a computer store > to mean >> greater than. Do we want HTML to be so pervasive? Not sure it deserves >> that. >> >> And from a programmers point of view, he wants to be able to iterate >> over an array of characters and treat each one the same way, >> regardless if it is a custom character or not. Without that kind of >> programmatic abstraction, the whole thing can never gain traction. I >> don't think fully blown HTML embedded in your text can fulfill that. A >> very strictly defined subset, possibly could. Sure HTML5 can RENDER >> stuff adquately, if the only aim of the game is provide a correct >> rendering. But to be able to actually treat particular images embedded >> as characters, and have some programming library see that abstraction >> consistently, I'm not sure I'm convinced that is possible. Not without >> nailing down exactly what html elements in what particular >> circumstances constitute a "character". >> >> I guess in summary, yes we have the technology already to render >> anything. But I don't think the whole standards framework does >> anything to allow the computing universe to actually exchange custom >> characters as if they were just any other text. Someone would actually >> have to work on a standard to do that, not just point to html5. >> >> >> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy >> >, wrote: >> >> >> 2015-05-29 4:37 GMT+02:00 John > >: >> >> "Today the world goes very well with HTML(5) which is now the >> bext markup language for document (including for inserting >> embedded images that don?t require any external request? >> If I had a large document that reused a particular character >> thousands of times, would this HTML markup require embedding >> that character thousands of times, or could I define the >> character once at the beginning of the sequence, and then >> refer back to it in a space efficient way? >> >> >> HTML(5) allows defining *once* entities for images that can then >> be reused thousands of times without repeting their definition. >> You can do this as well with CSS styles, just define a class for a >> small element. This element may still be an "image", but the >> semantic is carried by the class you assign to it. You are not >> required to provide an external source URL for that image if the >> CSS style provides the content. >> >> You may also use PUAs for the same purpose (however I have not >> seen how CSS allows to style individual characters in text >> elements as these characters are not elements, and there's no >> defined selector for pseudo-elements matching a single character). >> PUAs are perfectly usable in the situation where you have embedded >> a custom font in your document for assigning glyphs to characters >> (you can still do that, but I would avoid TrueType/OpenType for >> this purpose, but would use the SVG font format which is valid in >> CSS, for defining a collection of glyphs). >> >> If the document is not restricted to be standalone, of course you >> can use links to an external shared CSS stylesheet and to this SVG >> font referenced by the stylesheet. With such approach, you don't >> even need to use classes on elements, you use plain-text with very >> compact PUAs (it's up to you to decide if the document must be >> standalone (embedding everything it needs) or must use external >> references for missing definitions, HTML allows both (and SVG as >> well when it contains plain-text elements). >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Sun May 31 07:55:50 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 31 May 2015 13:55:50 +0100 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp> Message-ID: On 31 May 2015 at 12:42, Andrew West wrote: > > Even with OpenType it is not easy to contextually create a gap between > two combining underlines as the characters are not adjacent (I don't > think it is impossible, but the only way I can think of doing it is > rather unpleasant; perhaps other font experts on this list know an > easy way of doing it). Ignore that, I wasn't thinking straight. It can be done easily using OpenType. Andrew From jsbien at mimuw.edu.pl Sun May 31 09:32:36 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sun, 31 May 2015 16:32:36 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE Message-ID: <86lhg43ji3.fsf@mimuw.edu.pl> I'm curious what was the motivation for adding the character to Unicode. I understand the proposal is somewhere in the archives, perhaps it is available on the Internet? The only usage I'm aware of (with the exception of my own for historical Polish) is that found in Wiktionary: ? is also [?} used for the sign for avo, the small form of Pataca. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From andrewcwest at gmail.com Sun May 31 09:56:32 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 31 May 2015 15:56:32 +0100 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: <86lhg43ji3.fsf@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> Message-ID: On 31 May 2015 at 15:32, Janusz S. Bie? wrote: > > I'm curious what was the motivation for adding the character to > Unicode. I understand the proposal is somewhere in the archives, perhaps > it is available on the Internet? Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc. Andrew From gansmann at uni-bonn.de Sun May 31 10:01:36 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Sun, 31 May 2015 17:01:36 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: <86lhg43ji3.fsf@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> Message-ID: On Sun, 31 May 2015 16:32:36 +0200, Janusz S. Bie? wrote: > I'm curious what was the motivation for adding the character to Unicode. According to the Code Chart for Latin Extended B (http://www.unicode.org/charts/PDF/U0180.pdf), it?s used for Sencoten. It was also used in some old Norwegian texts (for a start, see here: http://en.wikipedia.org/wiki/Christian_K?lle). From jsbien at mimuw.edu.pl Sun May 31 10:03:32 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 31 May 2015 17:03:32 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: References: <86lhg43ji3.fsf@mimuw.edu.pl> Message-ID: <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> Quote/Cytat - Andrew West (Sun 31 May 2015 04:56:32 PM CEST): > On 31 May 2015 at 15:32, Janusz S. Bie? wrote: >> >> I'm curious what was the motivation for adding the character to >> Unicode. I understand the proposal is somewhere in the archives, perhaps >> it is available on the Internet? > > Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc. Thank you very much for your quick answer! Would you so kind to point me to the proposal for the upper case of "A WITH STROKE", or advice me how to look for it in the archive? Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From jsbien at mimuw.edu.pl Sun May 31 10:17:57 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 31 May 2015 17:17:57 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: References: <86lhg43ji3.fsf@mimuw.edu.pl> Message-ID: <20150531171757.141310hr7rh5t4px@mail.mimuw.edu.pl> Quote/Cytat - Gerrit Ansmann (Sun 31 May 2015 05:01:36 PM CEST): > On Sun, 31 May 2015 16:32:36 +0200, Janusz S. Bie? > wrote: > >> I'm curious what was the motivation for adding the character to Unicode. > > According to the Code Chart for Latin Extended B > (http://www.unicode.org/charts/PDF/U0180.pdf), it?s used for > Sencoten. It was also used in some old Norwegian texts (for a start, > see here: http://en.wikipedia.org/wiki/Christian_K?lle). Thank you very much for the link to old Norwegian (I was aware of Sencoten). Best regards JSB -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From asmus-inc at ix.netcom.com Sun May 31 10:50:05 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 31 May 2015 08:50:05 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1433075623556.38b645ad@Nodemailer> References: <556AEAE6.2040203@ix.netcom.com> <1433075623556.38b645ad@Nodemailer> Message-ID: <556B2DAD.6050204@ix.netcom.com> On 5/31/2015 5:33 AM, Chris-as-John wrote: > > Yes, Asmus good post. But I don?t really think HTML, even a subset, is > really the right solution. The longer I think about this, what would be needed would be something like an "abstract" format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification. There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions. The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention). The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like). And finally, there would have to be a way to deal with "one-offs", such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters. And so on. It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such "inline text" as widely and effortlessly interchangeable as plain text is today (or at least nearly so). By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data. But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich "inline text" I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto. A./ > I?m reminded of the design for XML itself, it is supposed to start > with a header that defines what that XML will conform to. Those > definitions contain some unique identifiers of that XML schema, which > happens to be a URL. The URL is partly just a convenient unique > identifier, but also, the XML engine, if it doesn?t know about that > schema could go to that URL and download the schema, and check that > the XML conforms to that schema. > > Similarly, imagine a text format that had a header with something like: > \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 > > Now all the characters following in the text will interpret characters > that start with 12345 with respect to that character set. What would > you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might > find bitmaps, truetype fonts, vector graphics, etc. You might find > many many representations of that character set that your rendering > engine could cache for future use. The text format wouldn?t be reliant > on today?s favorite rendering technology, whether bitmap, truetype > fonts, or whatever. Right now, if you go to a website that references > unicode that your platform doesn?t know about, you see nothing. If a > format like this existed, character sets would be infinitely > extensible, everybody on earth could see characters, even if their > platform wasn?t previously aware of them, and the format would be > independent of today?s rendering technologies. Let?s face it, HTML5 > changes every few years, and I don?t think anybody wants the > fundamental textual representation dependant on an entire layout > engine. And also the whole range of what HTML5 can do, even some > subset, is too much information. You don?t necessarily want your text > to embed the actual character set. Perhaps that might be a useful > option, but I think most people would want to uniquely identify the > character set, in a way that an engine can download it, but without > defining the actual details itself. Of course, certain charsets would > probably become pervasive enough that platforms would just include > them for convenience. Emojis by major messaging platforms. Maybe > characters related to specialised domains like, I don?t know, mapping > or specialised work domains or whatever, But without having to be > subservient to the central unicode committee. > > As someone who is a keen user of Facebook messenger, and who sees them > bring out a new set of emoji almost every week, I think the world will > soon be totally bored with the plain basic emoji that unicode has defined. > > > ? > Chris > > > On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) > > wrote: > > reading this discussion, I agree with your reaductio ad absurdum > of infinitely nested HTML. > > But I think you are onto something with your hypothetical example > of the "subset that works in ALL textual situations". > > There's clearly a use case for something like it, and I believe > many people would intuitively agree on a set of features for it. > > What people seem to have in mind is something like "inline" text. > Something beyond a mere stream of plain text (with effectively > every character rendered visibly), but still limited in important > ways by general behavior of inline text: a string of it, laid out, > must wrap and line break, any objects included in it must behave > like characters (albeit of custom width, height and appearance), > and so on. Paragraph formatting, stacked layout, header levels and > all those good things would not be available. > > With such a subset clearly defined, many quirky limitations might > no longer be necessary; any container that today only takes plain > text could be upgraded to take "inline text". I can see some > inline containers retaining a nesting limitation, but I could > imagine that it is possible to arrive at a consistent definition > of such inline format. > > Going further, I can't shake the impression that without a clean > definition of an inline text format along those lines, any > attempts at making stickers and similar solutions "stick" are > doomed to failure. > > The interesting thing in defining such a format is not how to > represent it in HTML or CSS syntax, but in describing what feature > sets it must (minimally) support. Doing it that way would free > existing implementations of rich text to map native formats onto > that minimally required subset and to add them to their format > translators for HMTL or whatever else they use for interchange. > > Only with a definition can you ever hope to develop a processing > model. It won't be as simple as for plain text strings, but it > should be able to support common abstractions (like iteration by > logical unit). It would have to support the management of external > resources - if the inline format allows images, custom fonts, etc. > one would need a way to manage references to them in the local > context. > > If your skeptical position proves correct in that this is > something that turns out to not be tractable, then I think you've > provided conclusive proof why stickers won't happen and why > encoding emoji was the only sensible decision Unicode could have > taken. > > A./ > > On 5/30/2015 7:14 AM, John wrote: >> >> Hmm, these "once entities" of which you speak, do they require >> javascript? Because I'm not sure what we are looking for here is >> static documents requiring a full programming language. >> >> But let's say for a moment that html5 can, or could do the job >> here. Then to make the dream come true that you could just cut >> and paste text that happened to contain a custom character to >> somewhere else, and nothing untoward would happen, would mean >> that everything in the computing universe should allow full blown >> html. So every Java Swing component, every Apple gui component, >> every .NET component, every windows component, every browser, >> every Android and IOS component would allow text entry of HTML >> entities. OK, so let's say everyone agrees with this course of >> action, now the universal text format is HTML. >> >> But in this new world where anywhere that previously you could >> input text, you can now input full blown html, does that actually >> make sense? Does it make sense that you can for example, put full >> blown HTML inside a H1 tag in html itself? That's a lot of >> recursion going on there. Or in a MS-Excel cell? Or interspersed >> in some otherwise fairly regular text in a Word document? >> >> I suppose someone could define a strict limited subset of HTML to >> be that subset that makes sense in ALL textual situations. That >> subset would be something like just defining things that act like >> characters, and not like a full blown rendering engine. But who >> would define that subset? Not the HTML groups, because their >> mandate is to define full blown rendering engines. It would be >> more likely to be something like the unicode group. >> >> And also, in this brave new world where HTML5 is the new standard >> text format, what would the binary format of it be? I mean, if I >> have the string of unicode characters > image definition that should be rendered as such? Or would it be >> text that happens to contain greater than symbol, I, M and G? It >> would have to be the former I guess, and thereby there would no >> longer be a unicode symbol for the mathematical greater than >> symbol. Rather there would be a unicode symbol for opening a HTML >> tag, and the text code for greater than would be > Never again >> would a computer store > to mean greater than. Do we want HTML to >> be so pervasive? Not sure it deserves that. >> >> And from a programmers point of view, he wants to be able to >> iterate over an array of characters and treat each one the same >> way, regardless if it is a custom character or not. Without that >> kind of programmatic abstraction, the whole thing can never gain >> traction. I don't think fully blown HTML embedded in your text >> can fulfill that. A very strictly defined subset, possibly could. >> Sure HTML5 can RENDER stuff adquately, if the only aim of the >> game is provide a correct rendering. But to be able to actually >> treat particular images embedded as characters, and have some >> programming library see that abstraction consistently, I'm not >> sure I'm convinced that is possible. Not without nailing down >> exactly what html elements in what particular circumstances >> constitute a "character". >> >> I guess in summary, yes we have the technology already to render >> anything. But I don't think the whole standards framework does >> anything to allow the computing universe to actually exchange >> custom characters as if they were just any other text. Someone >> would actually have to work on a standard to do that, not just >> point to html5. >> >> >> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy >> >, wrote: >> >> >> 2015-05-29 4:37 GMT+02:00 John > >: >> >> "Today the world goes very well with HTML(5) which is now >> the bext markup language for document (including for >> inserting embedded images that don?t require any external >> request? >> If I had a large document that reused a particular >> character thousands of times, would this HTML markup >> require embedding that character thousands of times, or >> could I define the character once at the beginning of the >> sequence, and then refer back to it in a space efficient way? >> >> >> HTML(5) allows defining *once* entities for images that can >> then be reused thousands of times without repeting their >> definition. You can do this as well with CSS styles, just >> define a class for a small element. This element may still be >> an "image", but the semantic is carried by the class you >> assign to it. You are not required to provide an external >> source URL for that image if the CSS style provides the content. >> >> You may also use PUAs for the same purpose (however I have >> not seen how CSS allows to style individual characters in >> text elements as these characters are not elements, and >> there's no defined selector for pseudo-elements matching a >> single character). PUAs are perfectly usable in the situation >> where you have embedded a custom font in your document for >> assigning glyphs to characters (you can still do that, but I >> would avoid TrueType/OpenType for this purpose, but would use >> the SVG font format which is valid in CSS, for defining a >> collection of glyphs). >> >> If the document is not restricted to be standalone, of course >> you can use links to an external shared CSS stylesheet and to >> this SVG font referenced by the stylesheet. With such >> approach, you don't even need to use classes on elements, you >> use plain-text with very compact PUAs (it's up to you to >> decide if the document must be standalone (embedding >> everything it needs) or must use external references for >> missing definitions, HTML allows both (and SVG as well when >> it contains plain-text elements). >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Sun May 31 11:20:31 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Sun, 31 May 2015 18:20:31 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> Message-ID: <556B34CF.2040106@gmail.com> Le 31/05/2015 17:03, Janusz S. Bien a ?crit : > Quote/Cytat - Andrew West (Sun 31 May 2015 > 04:56:32 PM CEST): > >> On 31 May 2015 at 15:32, Janusz S. Bie? wrote: >>> >>> I'm curious what was the motivation for adding the character to >>> Unicode. I understand the proposal is somewhere in the archives, >>> perhaps >>> it is available on the Internet? >> >> Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc. > > Thank you very much for your quick answer! > > Would you so kind to point me to the proposal for the upper case of "A > WITH STROKE", or advice me how to look for it in the archive? The upper case was introduces for Sencoten, and the proposal is here http://www.unicode.org/L2/L2004/04170-sencoten.pdf (found by googling sencoten site:unicode.org) Fr?d?ric From doug at ewellic.org Sun May 31 11:44:24 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 31 May 2015 10:44:24 -0600 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net> Message-ID: <4BC2309D56004EFFA43592EBE3248D2E@DougEwell> David Starner wrote: > I would say that a system would conform with Unicode in having yellow > heart red (in a non-monochrome font) as well as if it made it a cross. > Either way it's violating character identity. I'd say that being > monochromatic is now like being monospaced; it's suboptimal for a > Unicode implementation, but hardly something Unicode can condemn as > nonconformant. This seems fair and sensible. My main point was that being monochromatic (i.e. black) is conformant, and was an attempt to challenge the statement about character color "sometimes being a recorded property." I don't see any Unicode character properties that identify color, only character names, which don't carry property information. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Sun May 31 13:05:49 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 31 May 2015 20:05:49 +0200 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: <556B34CF.2040106@gmail.com> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> Message-ID: <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> Quote/Cytat - Fr?d?ric Grosshans (Sun 31 May 2015 06:20:31 PM CEST): > Le 31/05/2015 17:03, Janusz S. Bien a ?crit : >> Quote/Cytat - Andrew West (Sun 31 May 2015 >> 04:56:32 PM CEST): >> >>> On 31 May 2015 at 15:32, Janusz S. Bie? wrote: >>>> >>>> I'm curious what was the motivation for adding the character to >>>> Unicode. I understand the proposal is somewhere in the archives, perhaps >>>> it is available on the Internet? >>> >>> Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc. >> >> Thank you very much for your quick answer! >> >> Would you so kind to point me to the proposal for the upper case of >> "A WITH STROKE", or advice me how to look for it in the archive? > > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf > > (found by googling sencoten site:unicode.org) Thank yout very much for both informations. The proposal makes me curious about past and present Unicode policy, e.g. would it be accepted if submitted now. But this is a completely different question to which I perhaps will return in some future. Thanks again to all who responded. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From verdy_p at wanadoo.fr Sun May 31 16:26:45 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 31 May 2015 23:26:45 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556B2DAD.6050204@ix.netcom.com> References: <556AEAE6.2040203@ix.netcom.com> <1433075623556.38b645ad@Nodemailer> <556B2DAD.6050204@ix.netcom.com> Message-ID: The "abstract format" already exists also for HTML (with MIME "charset" extension of the media-type "text/plain" (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser). It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded "magic" identifiers near the top of the file. You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as "x-www.example.net-mycharset-1" if you own the domain name "example.net"). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or "hacked" codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers). Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example "x-example.net-mycharset-1" would map to an URL like "// www.example.net/mycharset/1/" containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private "charsets" that can be recognized). ---- Full interopability will still not be possible if you need to mix in the same document texts encoded with different private charsets (there's always a risk of collision), without a way to reencode some of them to a joined charset without the collisions) by infering a new private charset (it's not impossible to do, after all this is done already with XML schemas that you can mix together: you just need to rename the XML namespaces, keeping the URLs to which they are bound, when there's a collision on the XML namespace names, a situation that occurs sometimes because of versioning where some features of a schema are not fully upward compatible). Yes this complicate things a bit, but much less than when using documents in which PUA assignments are not negociated at all (even minimally to make sure they are compatible when mixing sources); and for which there exits for now absolutely no protocol defined for such negociation (TUS says that PUAs are usable and interchangeable under "private mutual agreement" but still provides no schemes for supporting such mutual agreement, and for this reason, PUAs are alsmost always rejected, and people want true permanent assignments for characters that are very specific, badly documented, or insufficiently known to have reliable permanent properties). So let's think about securing the use of PUAs with some identification scheme (for plain-text formats, it should just be allowed to negocaite a single charset for the whole, using the "magic" header tricks that re used since long by charset guessers (including for autodetecting UTF-8 encoded files). This would also solve the chicken-and-egg problem where we need more sources to attest an effective usage before encoding new characters, but developping this usages is extremely difficult (and much slower) in our modern technologies where most documents are now handled numerically (in the past it was possible to create a metal font and use it immediately to start editing books, and there were many more people using handwriting and drawings, so it was much less difficult to invent new characters, than it is today, unless you're a big company that has enough resources to develop this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or Microsoft introducing new sets of Emojis for their instant messaging platform, with tons of developers working for them to develop a wide range of services around it...) However I'm not saying that Unicode should specify how such private charset containing private assignments could be inserted in headers (I just think that it should describe a mechanism and give example of how common text formats are already used to convery some "magic" identifiers near the top of the file, and then we could describe a service allowing to locate and retrieve the associated definitions of this identifier, and some interchangeable format for these informations. 2015-05-31 17:50 GMT+02:00 Asmus Freytag (t) : > On 5/31/2015 5:33 AM, Chris-as-John wrote: > > > Yes, Asmus good post. But I don?t really think HTML, even a subset, is > really the right solution. > > > The longer I think about this, what would be needed would be something > like an "abstract" format. A specification of the capabilities to be > supported and the types of properties needed to support them in an > extensible way. HTML and CSS would possibly become an implementation of > such a specification. > > There would still be a place for a character set, that is Unicode, as an > efficient way to implement the most basic and most standard features of > text contents, but perhaps some extension mechanism that can handle various > extensions. > > The first level of extension is support for recent (or rare) code points > in the character set (additional fonts, etc, as you mention). > > The next level of extension could be support for collections of custom > entities that are not available as character sets (stickers and the like). > > And finally, there would have to be a way to deal with "one-offs", such as > actual images that do not form categorizable sets, but are used in an > ad-hoc manner and behave like custom characters. > > And so on. > > It should be possible to describe all of this in a way that allows it to > be mapped to HMTL and CSS or to any other rich text format -- the goal, > after all is to make such "inline text" as widely and effortlessly > interchangeable as plain text is today (or at least nearly so). > > By keeping the specification abstract, you could accommodate both SGML > like formats where ascii-string markup is intermixed with the text, as well > as pure text buffers with place holder code points and links to external > data. > > But, however bored you are with plain Unicode emoji, as long as there > isn't an agreed upon common format for rich "inline text" I see very little > chance that those cute facebook emoji will do anything other than firmly > keep you in that particular ghetto. > > A./ > > I?m reminded of the design for XML itself, it is supposed to start with > a header that defines what that XML will conform to. Those definitions > contain some unique identifiers of that XML schema, which happens to be a > URL. The URL is partly just a convenient unique identifier, but also, the > XML engine, if it doesn?t know about that schema could go to that URL and > download the schema, and check that the XML conforms to that schema. > > Similarly, imagine a text format that had a header with something like: > \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 > > Now all the characters following in the text will interpret characters > that start with 12345 with respect to that character set. What would you > find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find > bitmaps, truetype fonts, vector graphics, etc. You might find many many > representations of that character set that your rendering engine could > cache for future use. The text format wouldn?t be reliant on today?s > favorite rendering technology, whether bitmap, truetype fonts, or whatever. > Right now, if you go to a website that references unicode that your > platform doesn?t know about, you see nothing. If a format like this > existed, character sets would be infinitely extensible, everybody on earth > could see characters, even if their platform wasn?t previously aware of > them, and the format would be independent of today?s rendering > technologies. Let?s face it, HTML5 changes every few years, and I don?t > think anybody wants the fundamental textual representation dependant on an > entire layout engine. And also the whole range of what HTML5 can do, even > some subset, is too much information. You don?t necessarily want your text > to embed the actual character set. Perhaps that might be a useful option, > but I think most people would want to uniquely identify the character set, > in a way that an engine can download it, but without defining the actual > details itself. Of course, certain charsets would probably become pervasive > enough that platforms would just include them for convenience. Emojis by > major messaging platforms. Maybe characters related to specialised domains > like, I don?t know, mapping or specialised work domains or whatever, But > without having to be subservient to the central unicode committee. > > As someone who is a keen user of Facebook messenger, and who sees them > bring out a new set of emoji almost every week, I think the world will soon > be totally bored with the plain basic emoji that unicode has defined. > > > ? > Chris > > > On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) < > asmus-inc at ix.netcom.com> wrote: > >> reading this discussion, I agree with your reaductio ad absurdum of >> infinitely nested HTML. >> >> But I think you are onto something with your hypothetical example of the >> "subset that works in ALL textual situations". >> >> There's clearly a use case for something like it, and I believe many >> people would intuitively agree on a set of features for it. >> >> What people seem to have in mind is something like "inline" text. >> Something beyond a mere stream of plain text (with effectively every >> character rendered visibly), but still limited in important ways by general >> behavior of inline text: a string of it, laid out, must wrap and line >> break, any objects included in it must behave like characters (albeit of >> custom width, height and appearance), and so on. Paragraph formatting, >> stacked layout, header levels and all those good things would not be >> available. >> >> With such a subset clearly defined, many quirky limitations might no >> longer be necessary; any container that today only takes plain text could >> be upgraded to take "inline text". I can see some inline containers >> retaining a nesting limitation, but I could imagine that it is possible to >> arrive at a consistent definition of such inline format. >> >> Going further, I can't shake the impression that without a clean >> definition of an inline text format along those lines, any attempts at >> making stickers and similar solutions "stick" are doomed to failure. >> >> The interesting thing in defining such a format is not how to represent >> it in HTML or CSS syntax, but in describing what feature sets it must >> (minimally) support. Doing it that way would free existing implementations >> of rich text to map native formats onto that minimally required subset and >> to add them to their format translators for HMTL or whatever else they use >> for interchange. >> >> Only with a definition can you ever hope to develop a processing model. >> It won't be as simple as for plain text strings, but it should be able to >> support common abstractions (like iteration by logical unit). It would have >> to support the management of external resources - if the inline format >> allows images, custom fonts, etc. one would need a way to manage references >> to them in the local context. >> >> If your skeptical position proves correct in that this is something that >> turns out to not be tractable, then I think you've provided conclusive >> proof why stickers won't happen and why encoding emoji was the only >> sensible decision Unicode could have taken. >> >> A./ >> >> On 5/30/2015 7:14 AM, John wrote: >> >> >> Hmm, these "once entities" of which you speak, do they require >> javascript? Because I'm not sure what we are looking for here is static >> documents requiring a full programming language. >> >> But let's say for a moment that html5 can, or could do the job here. >> Then to make the dream come true that you could just cut and paste text >> that happened to contain a custom character to somewhere else, and nothing >> untoward would happen, would mean that everything in the computing universe >> should allow full blown html. So every Java Swing component, every Apple >> gui component, every .NET component, every windows component, every >> browser, every Android and IOS component would allow text entry of HTML >> entities. OK, so let's say everyone agrees with this course of action, now >> the universal text format is HTML. >> >> But in this new world where anywhere that previously you could input >> text, you can now input full blown html, does that actually make sense? >> Does it make sense that you can for example, put full blown HTML inside a >> H1 tag in html itself? That's a lot of recursion going on there. Or in a >> MS-Excel cell? Or interspersed in some otherwise fairly regular text in a >> Word document? >> >> I suppose someone could define a strict limited subset of HTML to be >> that subset that makes sense in ALL textual situations. That subset would >> be something like just defining things that act like characters, and not >> like a full blown rendering engine. But who would define that subset? Not >> the HTML groups, because their mandate is to define full blown rendering >> engines. It would be more likely to be something like the unicode group. >> >> And also, in this brave new world where HTML5 is the new standard text >> format, what would the binary format of it be? I mean, if I have the string >> of unicode characters > be rendered as such? Or would it be text that happens to contain greater >> than symbol, I, M and G? It would have to be the former I guess, and >> thereby there would no longer be a unicode symbol for the mathematical >> greater than symbol. Rather there would be a unicode symbol for opening a >> HTML tag, and the text code for greater than would be > Never again >> would a computer store > to mean greater than. Do we want HTML to be so >> pervasive? Not sure it deserves that. >> >> And from a programmers point of view, he wants to be able to iterate >> over an array of characters and treat each one the same way, regardless if >> it is a custom character or not. Without that kind of programmatic >> abstraction, the whole thing can never gain traction. I don't think fully >> blown HTML embedded in your text can fulfill that. A very strictly defined >> subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only >> aim of the game is provide a correct rendering. But to be able to actually >> treat particular images embedded as characters, and have some programming >> library see that abstraction consistently, I'm not sure I'm convinced that >> is possible. Not without nailing down exactly what html elements in what >> particular circumstances constitute a "character". >> >> I guess in summary, yes we have the technology already to render >> anything. But I don't think the whole standards framework does anything to >> allow the computing universe to actually exchange custom characters as if >> they were just any other text. Someone would actually have to work on a >> standard to do that, not just point to html5. >> >> >> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy , >> wrote: >> >>> >>> 2015-05-29 4:37 GMT+02:00 John : >>> >>>> "Today the world goes very well with HTML(5) which is now the bext >>>> markup language for document (including for inserting embedded images that >>>> don?t require any external request? >>>> If I had a large document that reused a particular character thousands >>>> of times, would this HTML markup require embedding that character thousands >>>> of times, or could I define the character once at the beginning of the >>>> sequence, and then refer back to it in a space efficient way? >>>> >>> >>> HTML(5) allows defining *once* entities for images that can then be >>> reused thousands of times without repeting their definition. You can do >>> this as well with CSS styles, just define a class for a small element. This >>> element may still be an "image", but the semantic is carried by the class >>> you assign to it. You are not required to provide an external source URL >>> for that image if the CSS style provides the content. >>> >>> You may also use PUAs for the same purpose (however I have not seen >>> how CSS allows to style individual characters in text elements as these >>> characters are not elements, and there's no defined selector for >>> pseudo-elements matching a single character). PUAs are perfectly usable in >>> the situation where you have embedded a custom font in your document for >>> assigning glyphs to characters (you can still do that, but I would avoid >>> TrueType/OpenType for this purpose, but would use the SVG font format which >>> is valid in CSS, for defining a collection of glyphs). >>> >>> If the document is not restricted to be standalone, of course you can >>> use links to an external shared CSS stylesheet and to this SVG font >>> referenced by the stylesheet. With such approach, you don't even need to >>> use classes on elements, you use plain-text with very compact PUAs (it's up >>> to you to decide if the document must be standalone (embedding everything >>> it needs) or must use external references for missing definitions, HTML >>> allows both (and SVG as well when it contains plain-text elements). >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Sun May 31 18:33:49 2015 From: idou747 at gmail.com (Chris) Date: Mon, 1 Jun 2015 09:33:49 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <556AEAE6.2040203@ix.netcom.com> <1433075623556.38b645ad@Nodemailer> <556B2DAD.6050204@ix.netcom.com> Message-ID: <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com> Of course, anyone can invent a character set. The difficult bit is having a standard way of combining custom character sets. That?s why a standard would be useful. And while stuff like this can, to some extent, be recognised by magic numbers, and unique strings in headers, such things are unreliable. Just because example.net/mycharset/ appears near the start of a document, doesn?t necessarily mean it was meant to define a character set. Maybe it was a document discussing character sets. And while it is tempting to allow the ?container? to define the ?header? information, whether the container be html defining something in its HEAD tag, or some proprietary format (MS-Word), or whatever, that doesn?t really solve anybody?s problem in a standard way. For a start, what if you want to copy text to the clipboard? You want the thing receiving it to be able to interpret it in a self-contained way. The 2 obvious implementations for a standard seem to be: 1) A standard (optional) header. Perhaps if the string starts with a special character, then follows a header defining charsets first. These would allocate character ranges for custom characters, and point to where their renderings can be found. Standard programming libraries on all platforms would invisibly act appropriately on these headers. If you concatenated strings with conflicting namespaces, standard libraries would seamlessly reallocate one of the custom namespaces and merge the headers. 2) Make a new character set, let?s call it UTF-64. 32 bits would be allocated for custom character sets. Anybody could apply to a central authority to be allocated a custom id (32 bits=4 billion ids). A central location, kind of like a domain name system, would map that id to the URL where the canonical definition for that character set is. The 2nd option has the advantage that the file format is fixed width like normal plain text documents. Concatenating custom character set strings is no issue. The canonical location for a character set isn?t forevermore mapped to a particular domain owner. Nothing about the meaning of the characters is defined in the actual bits other than the unique id. The disadvantage is it needs a central authority to maintain the list of ids, and map them to domains. > On 1 Jun 2015, at 7:26 am, Philippe Verdy wrote: > > The "abstract format" already exists also for HTML (with MIME "charset" extension of the media-type "text/plain" (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser). > > It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded "magic" identifiers near the top of the file. > > You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as "x-www.example.net-mycharset-1" if you own the domain name "example.net "). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or "hacked" codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers). > > Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example "x-example.net-mycharset-1" would map to an URL like "//www.example.net/mycharset/1/ " containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private "charsets" that can be recognized). > > ---- > > Full interopability will still not be possible if you need to mix in the same document texts encoded with different private charsets (there's always a risk of collision), without a way to reencode some of them to a joined charset without the collisions) by infering a new private charset (it's not impossible to do, after all this is done already with XML schemas that you can mix together: you just need to rename the XML namespaces, keeping the URLs to which they are bound, when there's a collision on the XML namespace names, a situation that occurs sometimes because of versioning where some features of a schema are not fully upward compatible). > > Yes this complicate things a bit, but much less than when using documents in which PUA assignments are not negociated at all (even minimally to make sure they are compatible when mixing sources); and for which there exits for now absolutely no protocol defined for such negociation (TUS says that PUAs are usable and interchangeable under "private mutual agreement" but still provides no schemes for supporting such mutual agreement, and for this reason, PUAs are alsmost always rejected, and people want true permanent assignments for characters that are very specific, badly documented, or insufficiently known to have reliable permanent properties). > > So let's think about securing the use of PUAs with some identification scheme (for plain-text formats, it should just be allowed to negocaite a single charset for the whole, using the "magic" header tricks that re used since long by charset guessers (including for autodetecting UTF-8 encoded files). > > This would also solve the chicken-and-egg problem where we need more sources to attest an effective usage before encoding new characters, but developping this usages is extremely difficult (and much slower) in our modern technologies where most documents are now handled numerically (in the past it was possible to create a metal font and use it immediately to start editing books, and there were many more people using handwriting and drawings, so it was much less difficult to invent new characters, than it is today, unless you're a big company that has enough resources to develop this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or Microsoft introducing new sets of Emojis for their instant messaging platform, with tons of developers working for them to develop a wide range of services around it...) > > However I'm not saying that Unicode should specify how such private charset containing private assignments could be inserted in headers (I just think that it should describe a mechanism and give example of how common text formats are already used to convery some "magic" identifiers near the top of the file, and then we could describe a service allowing to locate and retrieve the associated definitions of this identifier, and some interchangeable format for these informations. > > > 2015-05-31 17:50 GMT+02:00 Asmus Freytag (t) >: > On 5/31/2015 5:33 AM, Chris-as-John wrote: >> >> Yes, Asmus good post. But I don?t really think HTML, even a subset, is really the right solution. > > The longer I think about this, what would be needed would be something like an "abstract" format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification. > > There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions. > > The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention). > > The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like). > > And finally, there would have to be a way to deal with "one-offs", such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters. > > And so on. > > It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such "inline text" as widely and effortlessly interchangeable as plain text is today (or at least nearly so). > > By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data. > > But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich "inline text" I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto. > > A./ > >> I?m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn?t know about that schema could go to that URL and download the schema, and check that the XML conforms to that schema. >> >> Similarly, imagine a text format that had a header with something like: >> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 >> >> Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at atfacebook.com/charsets/pusheen-the-cat-emoji/ ? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn?t be reliant on today?s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn?t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn?t previously aware of them, and the format would be independent of today?s rendering technologies. Let?s face it, HTML5 changes every few years, and I don?t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don?t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don?t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee. >> >> As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined. >> >> >> ? >> Chris >> >> >> On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) > wrote: >> reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML. >> >> But I think you are onto something with your hypothetical example of the "subset that works in ALL textual situations". >> >> There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it. >> >> What people seem to have in mind is something like "inline" text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available. >> >> With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take "inline text". I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format. >> >> Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions "stick" are doomed to failure. >> >> The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange. >> >> Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context. >> >> If your skeptical position proves correct in that this is something that turns out to not be tractable, then I think you've provided conclusive proof why stickers won't happen and why encoding emoji was the only sensible decision Unicode could have taken. >> >> A./ >> >> On 5/30/2015 7:14 AM, John wrote: >>> >>> Hmm, these "once entities" of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language. >>> >>> But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML. >>> >>> But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document? >>> >>> I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group. >>> >>> And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that. >>> >>> And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a "character". >>> >>> I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to work on a standard to do that, not just point to html5. >>> >>> >>> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy >, wrote: >>> >>> 2015-05-29 4:37 GMT+02:00 John >: >>> "Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request? >>> If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? >>> >>> HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content. >>> >>> You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs). >>> >>> If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both (and SVG as well when it contains plain-text elements). >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun May 31 20:29:27 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 01 Jun 2015 01:29:27 +0000 Subject: the usage of LATIN SMALL LETTER A WITH STROKE In-Reply-To: <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> Message-ID: On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien wrote: > The proposal makes me curious about past and present Unicode policy, > e.g. would it be accepted if submitted now. > Why wouldn't it? Unicode has, if anything, seemed to become more flexible about adding characters that seeing any sort of use. -------------- next part -------------- An HTML attachment was scrubbed... URL: