From unicode at unicode.org Sat Aug 4 11:51:54 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Sat, 4 Aug 2018 18:51:54 +0200 (CEST) Subject: Diacritic marks in parentheses In-Reply-To: <20180727072247.GA1728455@phare.normalesup.org> References: <104663606.43684.1532601618204@ox.hosteurope.de> <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18> <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com> <20180727072247.GA1728455@phare.normalesup.org> Message-ID: <2086677586.6791.1533401514672@ox.hosteurope.de> Arthur Reutenauer: > On Thu, Jul 26, 2018 at 03:41:47PM -0700, Mark Davis ?? via Unicode wrote: > >> Ein??? A???rzt???? hat eine??? Studenti???n gesehen. > > ?eine??? Student?????? gesehen?. I certainly would not advocate to go to such extremes. My issue was with putting parentheses at the level they belong, which would, instead, yield something more like Ein(e) A???rzt(in) hat eine(n) Student(e/i)n gesehen. This is not how it would be actually used, though. Those short forms are mostly used outside proper prose, e.g. in diagrams, tables or forms. Belated thanks to Marcel Schneider for pointing me to the Unicode 7.0 character I had somehow failed to find, U+1ABB COMBINING PARENTHESES ABOVE (not used in the samples above). From unicode at unicode.org Thu Aug 9 00:37:21 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 9 Aug 2018 11:07:21 +0530 Subject: Usage of emoji in coding contexts! Message-ID: First time I'm seeing this (maybe others have seen this already): https://github.com/wei/pull Emoji being used in commit messages for classifying the nature of the commit ? bug fixes, feature additions etc Now *that*'s a nice creative usage of emoji IMO? I see they haven't used them always as the actual emoji characters but sometimes as :coloned-tags: (or what do you call it) but I presume the GitHub system will convert it to the actual characters before displaying? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Thu Aug 9 02:09:57 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 9 Aug 2018 09:09:57 +0200 Subject: Usage of emoji in coding contexts! In-Reply-To: References: Message-ID: Very amusing. But interesting how it catches your eye when scanning a list. Mark On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode < unicode at unicode.org> wrote: > First time I'm seeing this (maybe others have seen this already): > > https://github.com/wei/pull > > Emoji being used in commit messages for classifying the nature of the > commit ? bug fixes, feature additions etc > > Now *that*'s a nice creative usage of emoji IMO? > > I see they haven't used them always as the actual emoji characters but > sometimes as :coloned-tags: (or what do you call it) but I presume the > GitHub system will convert it to the actual characters before > displaying? > > -- > Shriramana Sharma ???????????? ???????????? ???????????????????????? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 9 06:48:59 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 9 Aug 2018 13:48:59 +0200 Subject: Usage of emoji in coding contexts! In-Reply-To: References: Message-ID: It's just complicate to select a coherent Emoji for that (in the edit comment). My opinion is that such icons may be selected from a list as part of the Github "tagging" system, these icons may then appear automatically (but as there are multiple candidate tags, each one configured with its own color, there may as well be multiple emojis). The problem with the approach is that such leading emoji are difficult to edit once the GitHub edit is committed. Some of the emojis selected look very strange, or may be not the best ones (e.g. the pizza slice chosen). Some edits could not have a suitable emoji selected (e.g. merge commits should be icons like an Y-shaped arrow with two trails but one leading arrow: such isonc is already used by GitBug, but not in that description field). I bet this icon/emoji should be a separate field. And it could also allow setting background/foreground color for the text using a convenient palette (tested also in presence of colored links: not all background/text colors are suitable, as seen on color option for "Tags"). This is not just for GitHub: you have an equivalent of GitHub tags, with classification "Labels" in Gmail for example. Emojis start being used too in Email subject lines (but most often only by spammers trying to defeat some antispam filters: most often, emojis in email subjects are strong indicators of spam or very harassing commercial ads! As they have no actual legal meaning, advertizers tend to use these emojis just to avoid publishing a statement that would be legally binding to them: these emojis are almost oalways defective and give false information, they are also too proeminent, as if the email senders were more important than everything else than the recipients are really interested in; they are almost always unnecessarily distractive, and not as important as what senders think). 2018-08-09 9:09 GMT+02:00 Mark Davis ?? via Unicode : > Very amusing. But interesting how it catches your eye when scanning a list. > > Mark > > On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode < > unicode at unicode.org> wrote: > >> First time I'm seeing this (maybe others have seen this already): >> >> https://github.com/wei/pull >> >> Emoji being used in commit messages for classifying the nature of the >> commit ? bug fixes, feature additions etc >> >> Now *that*'s a nice creative usage of emoji IMO? >> >> I see they haven't used them always as the actual emoji characters but >> sometimes as :coloned-tags: (or what do you call it) but I presume the >> GitHub system will convert it to the actual characters before >> displaying? >> >> -- >> Shriramana Sharma ???????????? ???????????? ???????????????????????? >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 9 02:28:17 2018 From: unicode at unicode.org (George Pollard via Unicode) Date: Thu, 9 Aug 2018 19:28:17 +1200 Subject: Usage of emoji in coding contexts! In-Reply-To: References: Message-ID: I've seen this codified a little in other repositories, e.g. ?? means 'only a formatting/stylistic change'. On Thu, 9 Aug 2018 at 19:18 Mark Davis ?? via Unicode wrote: > Very amusing. But interesting how it catches your eye when scanning a list. > > Mark > > On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode < > unicode at unicode.org> wrote: > >> First time I'm seeing this (maybe others have seen this already): >> >> https://github.com/wei/pull >> >> Emoji being used in commit messages for classifying the nature of the >> commit ? bug fixes, feature additions etc >> >> Now *that*'s a nice creative usage of emoji IMO? >> >> I see they haven't used them always as the actual emoji characters but >> sometimes as :coloned-tags: (or what do you call it) but I presume the >> GitHub system will convert it to the actual characters before >> displaying? >> >> -- >> Shriramana Sharma ???????????? ???????????? ???????????????????????? >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Aug 10 15:33:59 2018 From: unicode at unicode.org (Julian Wels via Unicode) Date: Fri, 10 Aug 2018 22:33:59 +0200 Subject: Thoughts on Emoji Selection Process Message-ID: Hi there! In light of the recently featured 179 proposed Emoji Draft Candidates for Emoji 12.0, I'd like to ask if the selection factors for future emojis shouldn't be more restrictive or rather just enforced more strongly. Extreme Specificity For instance, one thing that struck me as odd in previous releases was the tendency to extreme specificity. I always thought of Emoji as symbols and not as concrete images. In a lot of ways Emoji already do that. Every Emoji in the "Smileys"-category represents an emotion that can be used to enrich the meaning of text messages, and that's perfect! Then we have a lot of Objects such as: - Hamburger: Represents fast food. - Apple: Represents (healthy) food. - Bomb: Represents threat. - Wheelchair (12.0): Represents physical disabilities. And those are all great objects because also function as symbols! But there are far more, very specific or redundant objects (just from 12.0 proposal alone): - Guide Dod: Represents specific physical disability. - Service Dog: Represents specific physical disability. - Motorized wheelchair: Represents specific physical disability. - Mechanical arm: Represents specific physical disability. - Mechanical leg: Represents specific physical disability. - Ear with a hearing aid: Represents specific physical disability. For one, I think that it should not be the job of Emojis to express as many words as possible. And - although it seems a bit counterintuitively - having all those symbols for the sake of more inclusivity is extremely exclusionary! What about the hundreds of other disabilities that are not listed here? In case you get the impression this is a problem with disabilities or the 12.0 proposal, let me show you what I like to call the Emoji "Family-Problem". First of all: I don't think that a man and women should represent all couples, and I don't think a family should be represented by a man, woman, and their children. But I also don't believe that we should try to include every possible variant that comes to mind, because as stated before, this will lead to more exclusion through specificity. Currently, we have (among others): 1 father with 2 sons, 2 fathers with 2 sons, 2 fathers with 1 son, 2 fathers with 1 son and 1 daughter, 2 fathers with 1 daughter, and so on. But for instance, single moms or dads with one child are missing, which by the way are in some places a very neglected part of society. And this is exactly my point: There are so many representations that every missing one is basically an insult. Sidenote: I think the solution for the "Family-Problem" should be M-M, F-M and F-F combinations to represent couples, then the same again plus a girl and a boy to represent families. Next add a man-, woman-, boy- and girl-emoji separately, and people can represent their families without any restriction whatsoever if they want. Plus they can express their skin colors, and even pets can be added! Because, as with every written language, symbols (or words) can be linked together and create a new meaning! Cultural Iconography Another thing that is is worrisome is the proposed addition of a traditional Indian piece of clothing in 12.0. This is extremely specific to one culture, and I'm not sure if we want to open the gate for: "Which culture is included in Unicode and which is not?". Maybe we want that! Maybe we don't. But I think there should at least be a discussion about additions that carry such consequences. I know that there are tons of Chinese symbols in there already, but even the selection-factors on the Unicode-website state, that just because there is a lot of stuff in there from former versions, should not be a basis of justification for future additions. For instance, the Tokyo Tower-Emoji does not justify the Eiffel Tower-Emoji. [link] Emotions And my last point, maybe even the most important one: There are currently 63 candidates for Emoji 12.0 and only one (ONE!) is an actual smiley. And I think this category is the most important (and also the most used by far). Because people use symbols of emotions to add meaning that cannot be easily expressed with words to their text messages. I loved the addition of "Face With Raised Eyebrow," "Exploding Head" and "Face With Monocle" in Emoji 10.0 because they add value to texting! Conclusion So what do I mean when I say "future Emoji selection should be more restrictive"? 1) There should be a large push on actual smileys. 2) The "Selection Factors for Exclusion" should be taken a lot more seriously, especially for overly specific submissions. They are pretty comprehensive but apparently just poorly enforced. 3) Very specific submissions should encourage the addition of broader symbols that would still include the initial submission. 4) Additions that through their mere existence would exclude Symbols that are not (currently) present should be discussed (E.g. cultural iconography) 5) Additions that are made for the sake of inclusion, which of course is generally a good thing, should especially be checked against the four statements above because the mindless addition of inclusive emojis can lead to exclusivity. Final thoughts I really love emoji, and I think it's wonderful that everyone at Unicode strives to make it more inclusive and progressive. But for me, it feels like we have symbolism for the sake of symbolism. One being Emoji as a symbol of inclusion and progress in the world, and the other one being that Emoji still have an actual symbolic meaning. Because to hear people say "It's so nice to finally see the introduction of 'Person in Steamy Room'" and then observe how they don't use it can't be a good direction for future Emoji releases. Julian ?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Aug 10 18:14:00 2018 From: unicode at unicode.org (Charlotte Buff via Unicode) Date: Sat, 11 Aug 2018 01:14:00 +0200 Subject: Thoughts on Emoji Selection Process Message-ID: > Extreme Specificity > For instance, one thing that struck me as odd in previous releases was the > tendency to extreme specificity. I always thought of Emoji as symbols and > not as concrete images. Unicode is choke-full of useless, redundant emoji that nobody ever types, all of which were originally justified because of ?high usage expectations?. We?re now stuck with such timeless emoji classics as Water Polo, Raised Back of Hand, Place of Worship, Petri Dish, and Mother Christmas while actual, substantial problems in the emoji standard have remained unaddressed for several years, because the ESC is absolutely bloody clueless about literally everything they do. The trouble is not that the rules for emoji submissions are fundamentally flawed; the trouble is that said rules are completely ignored whenever the ESC feels like it. A squirrel is too similar to a chipmunk, but a softball must be disunified from a baseball. A donkey is too similar to a horse, but we really needed that lab coat emoji because the regular coat that was added just one year prior just doesn?t cut it. There is no system, and I highly suspect that there never was one in the first place. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Aug 10 20:25:46 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 10 Aug 2018 17:25:46 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: Charlotte Buff wrote, > A squirrel is too similar to a chipmunk, but a > softball must be disunified from a baseball. Let's not be too harsh on the ESC. The set of in-line pictures which some might use to adorn text is open-ended. The ESC has to deal with the really tough questions every day. Like, is there a semantic difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so forth. I wouldn't be able to make such difficult decisions without flipping a coin, so I'd doff my hat to the ESC if I wore one. From unicode at unicode.org Fri Aug 10 22:12:31 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 10 Aug 2018 19:12:31 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: AUSTRALOPITHECINE, it was the all-caps that threw me. From unicode at unicode.org Sat Aug 11 06:58:02 2018 From: unicode at unicode.org (Charlotte Buff via Unicode) Date: Sat, 11 Aug 2018 13:58:02 +0200 Subject: Thoughts on Emoji Selection Process Message-ID: [James Kass wrote:] > Let's not be too harsh on the ESC. The set of in-line pictures which > some might use to adorn text is open-ended. The ESC has to deal with > the really tough questions every day. Like, is there a semantic > difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or > BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE > vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so > forth. I wouldn't be able to make such difficult decisions without > flipping a coin, so I'd doff my hat to the ESC if I wore one. There is no semantic difference between a softball and a baseball. They are literally the same object, just in slightly different sizes. There isn?t a semantic difference between a squirrel and a chipmunk either (mainly because they don?t represent anything beyond their own identities just like the majority of modern emoji inventions), but at the very least they are *different things*. Not to mention that the softball was added ? by the ESC?s very own admission ? for the sole and only purpose of ?improving gender representation?, and anyone who has heard of my name in the context of Unicode before can tell you what a massive hypocrisy that is. As I said, there is no system. The ESC only approves emoji submissions if they personally like them, or to make themselves look vaguely more progressive and open-minded than they really are, but not *too* open-minded, you see, because then we would have to put actual, proper thought into the issues we?re dealing with. Mark Davis hates me already for rightfully calling out his many shortcomings, so I might as well say it like it is and alienate the rest of the ESC as well. I have no doubt that many ESC members are competent enough for their job; the point is that, collectively, the ESC is not. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Aug 11 07:21:13 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sat, 11 Aug 2018 13:21:13 +0100 (BST) Subject: Thoughts on Emoji Selection Process References: Message-ID: On 2018-08-11, Charlotte Buff via Unicode wrote: > There is no semantic difference between a softball and a baseball. They are > literally the same object, just in slightly different sizes. There isn?t a > semantic difference between a squirrel and a chipmunk either (mainly > because they don?t represent anything beyond their own identities just like > the majority of modern emoji inventions), but at the very least they are > *different things*. I think you don't understand the meaning of "semantic", "literally", or "the same". Which is a pity, because I'm all in sympathy with your general attitude to emoji and Unicode. I'm not just being pedantic - I can't even work out what you're attempting to say in this paragraph. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sat Aug 11 08:56:37 2018 From: unicode at unicode.org (Julian Wels via Unicode) Date: Sat, 11 Aug 2018 15:56:37 +0200 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: James Kass wrote, > Like, is there a semantic > difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or > BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE > vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so > forth. Yeah they should deal with those questions, but right now I imagine they would just add all of those. Just that the Australopithecine would have all gender, hair-style and ball-holding modifiers. And this is polluting a system where things that once were added can't be removed. It's all so contradictory, here they are encouraging the use of sticker packs as a long-term solution, here they say they want to add 60 new emoji per year. Also - as I just recently discovered - Hair Components were added in 11.0, which will just lead to an absurd amount of complexity. And what is achieved with that? Which gap does this fill? Who will use such specific Emojis frequently (B. Expected usage level )? And if this is not F. Overly specific or close to an L. Exact Image then I don't know what is! I'm not really sure we can cut them any slack here... Julian ??? On Sat, Aug 11, 2018 at 3:32 AM James Kass via Unicode wrote: > Charlotte Buff wrote, > > > A squirrel is too similar to a chipmunk, but a > > softball must be disunified from a baseball. > > Let's not be too harsh on the ESC. The set of in-line pictures which > some might use to adorn text is open-ended. The ESC has to deal with > the really tough questions every day. Like, is there a semantic > difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or > BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE > vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so > forth. I wouldn't be able to make such difficult decisions without > flipping a coin, so I'd doff my hat to the ESC if I wore one. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Aug 11 12:36:34 2018 From: unicode at unicode.org (Charlotte Buff via Unicode) Date: Sat, 11 Aug 2018 19:36:34 +0200 Subject: Thoughts on Emoji Selection Process Message-ID: > I think you don't understand the meaning of "semantic", "literally", > or "the same". Which is a pity, because I'm all in sympathy with your > general attitude to emoji and Unicode. > I'm not just being pedantic - I can't even work out what you're > attempting to say in this paragraph. A softball is just a slightly bigger baseball. There is no other difference between them. We now have two emoji that mean exactly the same thing: A small ball made from cork or rubber, wrapped in leather with a stitched seam, that is hit with a bat and caught with a glove. And if they had some metaphorical meaning (which I don?t think they do), they would also both represent the same concepts, because they are simply minute variations of the same object. Chipmunks and squirrels are clearly different species, but pretty much all characteristics they have that would be relevant to the average emoji user are identical. They are small, furry rodents living in forests that eat and bury nuts. Any meaning you could assign to a pictograph of a chipmunk in a textual conversation is also shared with squirrels and vice versa. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Aug 11 16:58:36 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 11 Aug 2018 13:58:36 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: Charlotte Buff wrote, ? Mark Davis hates me already for rightfully calling ? out his many shortcomings, so I might as well say it ? like it is and alienate the rest of the ESC as well. Nobody's perfect. We all have our strengths and weaknesses; it's part of the human condition. Although alienating people can bring considerable short-term satisfaction, in the long run building bridges trumps building walls. Conventional character encoding concerns may well be of secondary importance with respect to emoji. The driving force may have more to do with sales and marketing. In this regard, emoji are "special". Hence, if we approach emoji encoding issues in the traditional manner, ESC decisions might appear baffling or unreasonable. But if we broaden our horizons and allow that sales and marketing concerns are a factor, we might gain a little clarity and a better understanding. Just sayin'. ? From unicode at unicode.org Sat Aug 11 20:45:03 2018 From: unicode at unicode.org (Julian Wels via Unicode) Date: Sun, 12 Aug 2018 03:45:03 +0200 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: I followed up on the name Charlotte Buff in association with Unicode and found many documents already describing what I said in my original mail. Multiple times in the document registry, together with tons of other helpful suggestions on how to make Emoji better. However, none of these suggestions have apparently ever been taken seriously enough to cause changes. So can understand Charlotte Buffs anger for the most part. James Kass wrote: > The driving force may have more to > do with sales and marketing. In this regard, emoji are "special". > Hence, if we approach emoji encoding issues in the traditional manner, >ESC decisions might appear baffling or unreasonable. But if we > broaden our horizons and allow that sales and marketing concerns are a > factor, we might gain a little clarity and a better understanding. I'm not approaching Emoji in the same manner as other character-sets in Unicode, but they are still part of this industry-wide encoding standard that should not be misused for marketing gags and still be handled like a standard with certain norms and criteria. Also, there was so much useless stuff added to the Emoji-Set, that just cannot be explained by "sales and marketing" alone. Charlotte Buff, for example, made an excellent case against the addition of colored squares and circles in 12.0. There also was a suggestion on how to do gender right in emoji, which I think would have been an easy and smart solution without any compromise regarding marketing. I really wonder if no one in the Emoji Subcommittee has these exact thoughts, because this is not just about correct representation, it's about maintaining an encoding standard in a more or less future-proof way! So maybe Emoji encoding should be approached more traditionally given where we are right now. And I ask you all honestly: Is there no solution in sight, other than being ignored when submitting to the document registry? Julian ?? On Sun, Aug 12, 2018 at 12:06 AM James Kass via Unicode wrote: > Charlotte Buff wrote, > > ? Mark Davis hates me already for rightfully calling > ? out his many shortcomings, so I might as well say it > ? like it is and alienate the rest of the ESC as well. > > Nobody's perfect. We all have our strengths and weaknesses; it's part > of the human condition. Although alienating people can bring > considerable short-term satisfaction, in the long run building bridges > trumps building walls. > > Conventional character encoding concerns may well be of secondary > importance with respect to emoji. The driving force may have more to > do with sales and marketing. In this regard, emoji are "special". > Hence, if we approach emoji encoding issues in the traditional manner, > ESC decisions might appear baffling or unreasonable. But if we > broaden our horizons and allow that sales and marketing concerns are a > factor, we might gain a little clarity and a better understanding. > > Just sayin'. ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 12 02:27:46 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 11 Aug 2018 23:27:46 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: Julian Wels wrote, > Also, there was so much useless stuff added to the > Emoji-Set, that just cannot be explained by "sales > and marketing" alone. Charlotte Buff, for example, > made an excellent case against the addition of > colored squares and circles in 12.0. Sales & Marketing can explain *anything*. That's what they *do*. Some marketing hotshot comes up with a bunch of cool ideas and they try some of them out. If any of them catch on like wildfire, swell! But if one of them fails, it's not because it was a bum idea in the first place, it's due to market trends. Taking a couple of Charlotte Buff's generic concerns from earlier in this thread while keeping Sales & Marketing in mind, >> The ESC only approves emoji submissions if they >> personally like them, ... Naturally. How marketable is something one doesn't like? >> ... or to make themselves look vaguely more >> progressive and open-minded than they really are ... Of course. In the advertising world, image is everything. > ... So maybe Emoji encoding should be approached more > traditionally given where we are right now. Unicode's traditional approach has been to encode what is or what was rather than what might be. But the emoji are an evolving set, so they don't fall within that tradition. Any requests for clarification of the evolving set of encoding practices at any stage of the evolution seem like reasonable requests. It's unfortunate if such requests go unanswered. > And I ask you all honestly: Is there no solution > in sight, other than being ignored when submitting > to the document registry? Committees are somewhat political in nature. There are two proven ways to curry favor with a politician. One is to become a lobbyist, which means finding out what the subject wants and providing it. Everybody likes cash, but we don't call it a "bribe", we call it a "consulting fee". The other way is to become a toady. Since neither of those vocations seems suitable for any of us participating in this thread, perhaps it's time to mend some fences and/or build some bridges. From unicode at unicode.org Sun Aug 12 06:30:29 2018 From: unicode at unicode.org (Charlotte Buff via Unicode) Date: Sun, 12 Aug 2018 13:30:29 +0200 Subject: Thoughts on Emoji Selection Process Message-ID: [James Kass wrote:] > Naturally. How marketable is something one doesn't like? That is the issue. You are supposed to think that the emoji submission process is bureaucratic in nature, when in reality it all hinges on the personal preferences of a handful of unaccountable, largely unknown people. Everything you hear about emoji proposals, from the UTC?s own instructions and guidelines on the Unicode homepage to lazy clickbait articles written by shady (and sadly increasingly also not-so-shady) news outlets, is meant to make you believe that ?everyone?s voice is equal? and that all you need to do to get something you care about implemented in emoji is to write a document addressing a few enumerated issues and mail it to the Consortium, but that is not how it works. The UTC will gladly ignore heaps and heaps of evidence and statistics and whatnot if they feel indifferent towards a proposed emoji, while simultaneously fast-tracking their own pet ideas into the standard without any sort of documentation *cough* Ice Cube *cough*. If you?re gonna be evil, at least have the guts to be open about it. Nobody is forcing you to pretend that there are official procedures still in place. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 12 06:51:43 2018 From: unicode at unicode.org (Charlotte Buff via Unicode) Date: Sun, 12 Aug 2018 13:51:43 +0200 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: [James Kass wrote:] > Nobody's perfect. We all have our strengths and weaknesses; it's part > of the human condition. Although alienating people can bring > considerable short-term satisfaction, in the long run building bridges > trumps building walls. I would be inclined to agree with you, if it weren?t for the fact that I have been dealing with the ESC for two years now. I used to be nice and diplomatic, back when I was still convinced that these people were genuinely interested in developing a decent product. Back when I still thought that they were actually trying to do good, but just didn?t quite know how. Do you want to know what ?building bridges? achieved? Bloody nothing. They ignored literally every single word I had written and marched onward regardless. I am sick of sugarcoating their flaws. They mess up again and again and again, and they refuse to mend or even acknowledge their mistakes. If they can?t deal with criticism straight to their faces then they shouldn?t be in these positions. People like Andrew West, Michael Everson, Cristoph P?per, Eduardo Mar?n Silva, and even myself sacrifice their time to develop and document detailled solutions to many problems the ESC has created, but they simply don?t care. They are too busy churning out these stupid pictographs year after year because that?s what gives them publicity. Who cares that 80% of the emoji standard is horribly broken? What could the Emoji Subcommittee possibly do about that? 2018-08-11 23:58 GMT+02:00 James Kass : > Charlotte Buff wrote, > > ? Mark Davis hates me already for rightfully calling > ? out his many shortcomings, so I might as well say it > ? like it is and alienate the rest of the ESC as well. > > Nobody's perfect. We all have our strengths and weaknesses; it's part > of the human condition. Although alienating people can bring > considerable short-term satisfaction, in the long run building bridges > trumps building walls. > > Conventional character encoding concerns may well be of secondary > importance with respect to emoji. The driving force may have more to > do with sales and marketing. In this regard, emoji are "special". > Hence, if we approach emoji encoding issues in the traditional manner, > ESC decisions might appear baffling or unreasonable. But if we > broaden our horizons and allow that sales and marketing concerns are a > factor, we might gain a little clarity and a better understanding. > > Just sayin'. ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 12 17:03:27 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 12 Aug 2018 14:03:27 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: Charlotte Buff wrote, > Do you want to know what ?building bridges? > achieved? Bloody nothing. They ignored literally > every single word I had written and marched > onward regardless. So much for rhetoric, eh? Sorry if I've underestimated the scope of the dilemma. It's best to understand both sides of an issue. When one faction posts criticisms and questions, and the other side fails to respond, it leaves everyone with a one-sided viewpoint. As speculation, the ESC members probably have other responsibilities besides grinding out pictographs. (Day jobs, real world, etc.) Emoji popularity combined with click-bait news articles urging every yob in town to submit documents suggests that the ESC may simply be overwhelmed with such documents, some of which were probably written in crayon. As the emoji character set evolves, so do procedures. It's possible that people are simply scrambling around trying to do too much at once. I am the most unlikely apologist for the ESC imaginable, I'm just trying to be fair. Alienating the very people who are the only ones competent to respond to your questions and concerns won't get questions answered or concerns addressed. Hoping someone with the answers will respond to this thread, but not holding my breath while waiting. In this particular case, a lack of response might be more informative than an actual one. From unicode at unicode.org Mon Aug 13 06:39:50 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 13 Aug 2018 03:39:50 -0800 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: Charlotte Buff wrote, > ... I have been dealing with the ESC for two > years now. Two years passes in the blink of an eye. Elsewhere you mention several names including Andrew West and Michael Everson. Both of them have been working with, against, or around various committees and members for about two decades now. Infinite patience is essential, if one doesn't have it, it has to be feigned. > I used to be nice ... That may have been a tactical error. This is the 21st century and one has to be rude just to get noticed. Besides, once people find out you are a nice person, they have a tendency to step all over you. > ... Back when I still thought that they were > actually trying to do good, but just didn?t > quite know how. Most people don't perceive themselves as villains. > I am sick of sugarcoating their flaws. They probably didn't like it anyway. Sugarcoating flaws calls attention to them and attracts flies. It's been said that a friend is someone who likes us in spite of our many faults. The friendly thing to do would be to overlook flaws, focus on strengths, and find some kind of common ground. (If any.) > If they can?t deal with criticism straight to their > faces then they shouldn?t be in these positions. Agreed, as long as it's constructive criticism, tolerably polite, offering viable alternatives/solutions, and provides for them to "save face" (because where image is everything, looking good is considered important). Some people deal with criticism by shunning the critic, preventing any recurrence. Some Sales & Marketing people can be *so* hypersensitive. > Who cares that 80% of the emoji standard is > horribly broken? What could the Emoji Subcommittee > possibly do about that? Well, they could break the other 20%. Heh heh. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 13 02:44:02 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 13 Aug 2018 08:44:02 +0100 (BST) Subject: (offline humour) Re: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: <8616567.3155.1534146242186.JavaMail.defaultUser@defaultHost> James Kass wrote: > ... the ESC may simply be overwhelmed with such documents, some of which were probably written in crayon. Maybe a document written in crayon is so that the author can wax lyrical: is it fair to chalk the author off? :-) https://en.oxforddictionaries.com/definition/crayon William Overington Monday 13 August 2018 From unicode at unicode.org Tue Aug 14 06:16:35 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Tue, 14 Aug 2018 11:16:35 +0000 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: On 10 Aug 2018, at 21:33, Julian Wels via Unicode > wrote: Cultural Iconography Another thing that is is worrisome is the proposed addition of a traditional Indian piece of clothing in 12.0. This is extremely specific to one culture, and I'm not sure if we want to open the gate for: "Which culture is included in Unicode and which is not?". Maybe we want that! Maybe we don't. But I think there should at least be a discussion about additions that carry such consequences. I know that there are tons of Chinese symbols in there already, but even the selection-factors on the Unicode-website state, that just because there is a lot of stuff in there from former versions, should not be a basis of justification for future additions. For instance, the Tokyo Tower-Emoji does not justify the Eiffel Tower-Emoji. [link] Unicode is an essential building block for software internationalisation. I consider including cultural icon emoji in Unicode to be an essential part of internationalisation. The more cultures that are included the better. Actually I think a specific aim of ESC could be, in the long term, to encompass all cultures. ESC could encourage cultural icon emoji submissions. Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 14 09:55:08 2018 From: unicode at unicode.org (Julian Wels via Unicode) Date: Tue, 14 Aug 2018 16:55:08 +0200 Subject: Thoughts on Emoji Selection Process In-Reply-To: References: Message-ID: I mean I'd love to have this discussion, and maybe you could even turn me around to your side of this argument if the current way of Emoji development wouldn't be such a hot mess. My initial mail essentially said: "Stop adding random stuff, until you find a way to streamline your process." (in an abstract sense). So I'm just generally against uncontrolled development. If the ESC were to say: "We committed ourselves to add around 10 Emojis each year, representing another culture." then this would be amazing. But it appears that right now they'd say: "We committed ourselves to embrace different cultures with Emoji.", then add 50 western Emoji, 36 Indian, and 12 African and then never speak of it again. So for me, it's all about controlled development that leaves us with a clean and well-organized set of Emojis. Right now it shows that can't have that due to a lack of communication and ambitions on the part of the ESC. Julian ?? On Tue, Aug 14, 2018 at 1:26 PM Andre Schappo via Unicode < unicode at unicode.org> wrote: > > > On 10 Aug 2018, at 21:33, Julian Wels via Unicode > wrote: > > Cultural Iconography > Another thing that is is worrisome is the proposed addition of a > traditional Indian piece of clothing in 12.0. This is extremely specific to > one culture, and I'm not sure if we want to open the gate for: "Which > culture is included in Unicode and which is not?". Maybe we want that! > Maybe we don't. But I think there should at least be a discussion about > additions that carry such consequences. > I know that there are tons of Chinese symbols in there already, but even > the selection-factors on the Unicode-website state, that just because there > is a lot of stuff in there from former versions, should not be a basis of > justification for future additions. For instance, the Tokyo Tower-Emoji > does not justify the Eiffel Tower-Emoji. [link] > > > > Unicode is an essential building block for software internationalisation. > I consider including cultural icon emoji in Unicode to be an essential part > of internationalisation. The more cultures that are included the better. > Actually I think a specific aim of ESC could be, in the long term, to > encompass all cultures. ESC could encourage cultural icon emoji submissions. > > Andr? Schappo > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 15 03:32:41 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 15 Aug 2018 00:32:41 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) Message-ID: Suppose there's someone who has been working with the ESC for a while and whose frustration level has passed the boiling point. Let's call this person "X". X has become so angry that X is distilling recent experiences into an expos? article for submission to the media. The media outlet, if responsible journalists, would fact-check the article. Would the fact-checking find proof, or would it be determined that it is simply a, uh, dissing contest between two or more personalities? (If the latter, one of the tabloids might buy the article. They just *love* dissing contests.) The original thread includes some sweeping allegations concerning competence and integrity, but offers no specific examples. Even though many people do it daily, it's best not to make judgments without evidence. A list member kindly sent me links to a pair of documents. L2/17-147 http://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pdf L2/17-192 http://www.unicode.org/L2/L2017/17192-response-cmts.pdf The first one, L2/17-147 (by West, Buff, and P?per), is a request for more ESC transparency. It raises a couple of legitimate concerns: (1) requests complete public documentation of all incoming submissions, and (2) requests a public roster of ESC members. The requests seem reasonable. The second one, L2/17-192 (by Davis and Edberg), rejects the first one. A superficial analysis might persuade someone that the ESC does things the way they like to do things and are going to continue to do things, so neener-neener. But if we examine the reasoning behind L2/17-192, it does make some sense. For (1), there's too many submissions and the vast majority of them are D.O.A.. Why spend resources documenting non-starters? L2/17-192 goes on to explain the way viable submissions become public. For (2), which the ESC rejected first, the underlying reasoning is clearly stated. The roster is in a perpetual state of flux, there is no fixed membership, there is no membership list. Putting aside any obvious advantages anonymity offers over accountability, any committee with a constantly shifting membership is unstable by definition. Why would any committee want to make its instability a matter of public record? Putting aside any snide humor, it does appear that the ESC responds to requests/suggestions and is willing to work with submitters. (Based on one example, at least.) One one hand, there's a group who is interested in exploiting the emoji ranges to advance corporate commercial concerns. On another hand, there are emoji enthusiasts who want the sterling reputation of excellence Unicode has earned to continue far into the future. There's got to be some common ground here. Why not shake those hands, find that common ground, and explore it together? And have some fun while doing it. Aren't the emoji supposed to be fun? From unicode at unicode.org Fri Aug 17 04:35:54 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 17 Aug 2018 10:35:54 +0100 (BST) Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> Message-ID: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> May I mention please a situation that may be of interest as indicative of some of the issues with the present system. In the discussion after the end of the lecture ?Unicode Emoji: How do we standardize that je ne sais quoi?? at the Internationalization & Unicode Conference 39 conference in October 2015, a gentleman in the audience raised the possibility of emoji for ?I? and for ?You?. https://www.youtube.com/watch?v=9ldSVbXbjl4 I was not there but I have viewed the video several times. I decided that trying to design emoji for 'I' and for 'You' seemed interesting so I decided to have a go at designing some. However pictures of people with arrows seemed to be ambiguous in meaning and also they seemed to need to be too detailed for rendering in mobile telephone messages and in many situations in web pages and emails generally. So eventually I decided that abstract designs would be a good solution to the problem. I designed abstract emoji in two colours, yet such that a monochrome fallback display would still be recognisable. There is a web page in my webspace that has more information about this, including the designs, including links back to several posts in the archives of this mailing list, some by me, some by other people. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm At one stage I sent a copy of the PDF (Portable Document Format) document to 'docsubmit' an was informed that it was not in the correct format for a submission. A problem that I find with 'docsubmit' is that there is no indication of who is running it, replies are just from 'docsubmit' with no name of a human being. I had sent the document with the idea that if it were included in the Document Register that discussion could take place that could lead to progress. So I considered trying to prepare a document compliant with the submission rules, for emoji for 'I' and for 'You' as mentioned in the discussion period in the conference session, together with perhaps a few other personal pronouns as well. I have not produced such a document yet for a number of reasons. It takes time to produce such a document. I am happy to spend that time producing a document but I am somewhat deterred by the possibility that the document might just be discarded and never get anywhere with no explanation. It is not clear whether abstract emoji would be accepted. If I remember correctly at one stage ESC (Emoji Subcommittee) opined that abstract emoji were possible, though possibly the UTC (Unicode Technical Committee) was of the opposite opinion and it there seems no clarity on the matter at present. All proposals for new emoji seem to now require those blue and red charts from Google Trends. I have never understood why these are needed and what they are supposed to prove. That being for any emoji proposal. When it comes to a proposal for emoji for 'I' and for 'You', I cannot decide what Google Trends chart would be of any relevance to support an emoji proposal. I opine that emoji for 'I' and for 'You' are a good idea. I am happy to try to submit a proposal if there is a reasonable prospect of success, but the requirement for Google Trends pictures seems to be a block to me being able to do that at present as I just do not understand it all. If anyone out there would like to help get emoji for 'I' and for 'You' and maybe a few other personal pronouns encoded, whether using my designs or other designs, whether abstract designs or not abstract designs, whether with me being involved or otherwise, then that would be welcome. William Overington Thursday 16 August 2018 From unicode at unicode.org Sat Aug 18 03:07:17 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 18 Aug 2018 00:07:17 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: William Overington wrote, > All proposals for new emoji seem to now require > those blue and red charts from Google Trends. > > I have never understood why these are needed and > what they are supposed to prove. If an emoji being proposed represents a concept which is popular, its potential popularity *as an emoji* can perhaps be estimated by seeing how many people are making web searches for the concept. The red and blue chart should compare the proposed emoji to a "reference emoji", but I don't really understand why, either. No doubt there's some kind of reasoning behind this, but to me it's like comparing apples to oranges. > When it comes to a proposal for emoji for 'I' > and for 'You', I cannot decide what Google Trends > chart would be of any relevance to support an emoji > proposal. https://trends.google.com/trends/explore?q=personal%20pronoun%20emoji&geo=US Well, *those* keywords don't look very promising. > I decided that trying to design emoji for 'I' and > for 'You' seemed interesting so I decided to have > a go at designing some. > > However pictures of people with arrows seemed to > be ambiguous in meaning ... > > So eventually I decided that abstract designs > would be a good solution to the problem. Hand gestures such as an overview of a finger pointing away (for "YOU") and a thumbs-up with the thumb pointing inward at about a hundred degree angle (for "I") might work. But since body language and hand gestures differ between cultures, those gestures might only be recognizable to westerners. For example, I've seen Japanese people refer to themselves with a hand gesture which is their own pointing finger touching their own nose. And the thumbs up gesture which means "everything is jake" or "everything is hunky-dory" in my locale means something vastly different south of the international border between the U.S. and Mexico. Even an emoji pair representing Narcissus gazing fondly upon reflection (with the upper figure emphasized for "I" and the lower for "YOU") might be western-centric. Or too subtle or cerebral. But an abstract design for those pronouns would remain abstract unless people *like* it and use it. > I am happy to spend that time producing a document > but I am somewhat deterred by the possibility that > the document might just be discarded and never get > anywhere with no explanation. Yet your initial submission was rejected with an explanation. > It is not clear whether abstract emoji would be > accepted. If I remember correctly at one stage > ESC (Emoji Subcommittee) opined that abstract > emoji were possible, ... ... but unlikely? Perhaps if a corporate sponsor designed a set of abstract emoji and got some agreement from some of the other corporate sponsors, *those* abstract emoji would be pushed towards the character encoding stage. But such abstractions proposed by a John Doe or Job Lowe strike me as unlikely candidates. Quoting from: http://www.unicode.org/emoji/proposals.html "?Simple words (?NEW?) or abstract symbols (???) would not qualify as emoji." Also, in the section "Selection Factors for Exclusion", the part headed "L. Exact Images" would seem to rule out any such abstractions. From unicode at unicode.org Sat Aug 18 09:13:09 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Sat, 18 Aug 2018 15:13:09 +0100 (BST) Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost> James Kass wrote: > Quoting from: > http://www.unicode.org/emoji/proposals.html > "?Simple words (?NEW?) or abstract symbols (???) would not qualify as emoji." Well, that is quite clear. In order for abstract emoji to become encoded, that rule would need to be either removed, or made waivable in some instances at the discretion of the Unicode Technical Committee. > Also, in the section "Selection Factors for Exclusion", the part headed "L. Exact Images" would seem to rule out any such abstractions. Hmmm, maybe not. What is stated in the document is as follows. > Emoji are by their nature subject to variation in order to have consistent graphic designs for a full set. Precise images (such as from a specific visual meme) are not appropriate as emoji; images such as GIFs or PNGs should be used in such cases, instead of emoji characters. The designs that I have produced for abstract emoji of personal pronouns could be drawn, whilst each retaining enough of their shape information to still convey the intended meaning, in, say, the style of the Comic Sans font. So the designs that I produced are not necessarily subject to that ruling; yet I do need to add that the designs that I produced are somewhat constrained against as much variation as is possible for many emoji. Yet the designs that I produced have about as much flexibility as to glyph design as do letters of the English alphabet. > Once an emoji is released, it is typically used for a wide variety of items that have similar visual appearance. Well, if some people use the same code point for a variety of things then that is a matter for them! One can only do so much in trying to convey meaning without distortion of meaning. Referring to the designs in the following document, http://www.users.globalnet.co.uk/~ngo/Some_designs_for_emoji_of_personal_pronouns.pdf some readers may be interested to know of how I arrived at the general structure of those designs. It all goes back to when I was new to learning French. The present tense of the verb ?tre ("to be") was set out something like in the text diagram below, though the underscore characters are added here by me so as to try to produce a fairly reasonable display in a text diagram that may become displayed in a variety of fonts. Hopefully the diagram will look good with a monospaced font. je suis ______ nous sommes tu es ________ vous ?tes il est _______ ils sont elle est _____ elles sont So horizontally there is singular and plural, and vertically there is first person, second person and then third person on each of two rows for two genders. So my designs are based on that layout. One square for singular, two squares horizontally side by side for plural. The location of the square or squares is then upper left corner for first person, middle left not in any corner for second person, and lower left corner for third person. Then there are a few additional lines for various third person personal pronouns so as to distinguish male from female and from both genders together, together with a slightly anomalous location of a square at lower middle not in any corner for the personal pronoun 'one'. William Overington Saturday 18 August 2018 From unicode at unicode.org Sat Aug 18 20:55:42 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 18 Aug 2018 17:55:42 -0800 Subject: Tales from the Archives Message-ID: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html Back in 2000, William Overington asked about ligation for Latin and mentioned something about preserving older texts digitally. John Cowan replied with some information about ZWJ/ZWNJ and I offered a link to a Unicode-based font, Junicode, which had (at that time) coverage for archaic letters already encoded, and which used the PUA for unencoded ligatures. At that time, OpenType support was primitive and not generally available. If I'm not mistaken, the word "ligation" for typographic ligature forming had not yet been coined. IIRC John Hudson borrowed the medical word some time after that particular Unicode e-mail thread. (One poster in that thread called it "ligaturing".) Peter Constable replied and explained clearly how ligation was expected to work for Latin in Unicode. John Cowan posted again and augmented the information which Peter Constable had provided. The information from Peter and John was instructional and helpful and furthered the education of at least one neophyte. Back then, display issues were on everyone's mind. Many questions about display issues were posted to this list. Unicode provided some novel methods of encoding complex scripts, such as for Indic, but those methods didn't actually work anywhere in the real world, so users stuck to the "ASCII-hack" fonts that actually did work. When questions about display issues and other technical aspects of Unicode were posted, experts from everywhere quickly responded with helpful pointers and explanations. Eighteen years pass, display issues have mostly gone away, nearly everything works "out-of-the-box", and list traffic has dropped dramatically. Today's questions are usually either highly technical or emoji-related. Recent threads related to emoji included some questions and issues which remain unanswered in spite of the fact that there are list members who know the answers. This gives the impression that the Unicode public list has become pass?. That's almost as sad as looking down the archive posts, seeing the names of the posters, and remembering colleagues who no longer post. So I'm wondering what changed, but I don't expect an answer. From unicode at unicode.org Sun Aug 19 01:37:44 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 18 Aug 2018 22:37:44 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost> References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost> Message-ID: William Overington wrote, > The designs that I have produced for abstract emoji of > personal pronouns could be drawn, whilst each retaining > enough of their shape information to still convey the > intended meaning, in, say, the style of the Comic Sans > font. So the designs that I produced are not necessarily > subject to that ruling; yet I do need to add that the > designs that I produced are somewhat constrained against > as much variation as is possible for many emoji. Yet the > designs that I produced have about as much flexibility as > to glyph design as do letters of the English alphabet. Exactly, except for the part about 'not necessarily subject to that ruling'. Quoting from, https://www.thoughtco.com/what-is-alphabet-1689080 ... which is quoting from Mitchell Stephens, The Rise of the Image, the Fall of the Word. Oxford University Press, 1998 ... "In about 1500 B.C., the world's first alphabet appeared among the Semites in Canaan. It featured a limited number of abstract symbols (at one point thirty-two, later reduced to twenty-two) out of which most of the sounds of speech could be represented. The Old Testament was written in a version of this alphabet. ..." (Of course, nobody called it "The Old Testament" back then.) Do you consider alphabetic letters to be anything other than abstract symbols? You've devised a set of abstract symbols to depict personal pronouns based on typical verb conjugation diagrams. It's my opinion that such symbols aren't emoji candidates, but I am not an emoji expert. From unicode at unicode.org Sun Aug 19 04:20:56 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 19 Aug 2018 01:20:56 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost> Message-ID: My apologies for my last post. I realize now that William Overington was referring to "exact images" rather than "abstract symbols" exclusions. My opinion stands, though, FWIW. From unicode at unicode.org Sun Aug 19 09:25:47 2018 From: unicode at unicode.org (Marius Spix via Unicode) Date: Sun, 19 Aug 2018 16:25:47 +0200 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <20180819162547.39117490@spixxi> William Overington wrote: > > I decided that trying to design emoji for 'I' and for 'You' seemed > interesting so I decided to have a go at designing some. > > However pictures of people with arrows seemed to be ambiguous in > meaning and also they seemed to need to be too detailed for rendering > in mobile telephone messages and in many situations in web pages and > emails generally. So eventually I decided that abstract designs would > be a good solution to the problem. > I also played with a similar idea, which requires a new GSUB LookupType, let?s call it 9: Reader-dependent substitution. The idea is that the reader of the text will see another glyph when he/she is the author of the text. For example if you use the codepoint for for ME, all other readers see the glpyh for YOU and vice versa. This is for example usable in instant messaging and social networking services. In the attachment you find some ideas for the following emoji IDEOGRAM FOR ME / IDEOGRAM FOR YOU IDEOGRAM FOR TWO OF US / IDEOGRAM FOR YOU TWO IDEOGRAM FOR WE ALL / IDEOGRAM FOR YOU ALL IDEOGRAM FOR ME AND ANOTHER PERSON / IDEOGRAM FOR YOU AND ANOTHER PERSON IDEOGRAM FOR ME AND MULTIPLE OTHER PERSONS / IDEOGRAM FOR YOU AND MULTIPLE OTHER PERSONS IDEOGRAM FOR YOU AND ME (the counterpart has no own codepoint, but is mirrored, as you may arrange other emoji to the left or right) The following emoji may look equal independent of the reader: IDEOGRAM FOR ANOTHER PERSON IDEOGRAM FOR TWO OTHER PERSONS IDEOGRAM FOR MULTIPLE OTHER PERSONS The rendering engine requires a flag if the user is the author or not. I think it would be possible to implement. What about this idea? Regards, Marius Spix -------------- next part -------------- A non-text attachment was scrubbed... Name: youme.png Type: image/png Size: 3035 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: Digitale Signatur von OpenPGP URL: From unicode at unicode.org Sun Aug 19 10:01:28 2018 From: unicode at unicode.org (Alan Wood via Unicode) Date: Sun, 19 Aug 2018 15:01:28 +0000 (UTC) Subject: Tales from the Archives In-Reply-To: References: Message-ID: <2096117.2386243.1534690888302@mail.yahoo.com> James I think you have answered your own question: nearly everything works "out-of-the-box". Unicode is just there, and most computer users have probably never heard of it.? I routinely produce web pages with English, French, Russian and Chinese text and a few symbols, and don't even think whether other people can see everything displayed properly. Long ago, the response to the question "Why can't I see character x" was often to install a copy of the Code2000 font and send the fee ($10 ?) to James Kass by airmail. These days, Windows 10 can display all of the major living languages (and I expect Macs can too, but I can't afford one now that I have retired). Some of the frequent posters have probably passed away, while others (like me) have got older, and slowed down and/or developed new interests. Best regards Alan Wood http://www.alanwood.net (Unicode, special characters, pesticide names) On Sunday, 19 August 2018, 03:05:41 GMT+1, James Kass via Unicode wrote: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html Back in 2000, William Overington asked about ligation for Latin and mentioned something about preserving older texts digitally.? John Cowan replied with some information about ZWJ/ZWNJ and I offered a link to a Unicode-based font, Junicode, which had (at that time) coverage for archaic letters already encoded, and which used the PUA for unencoded ligatures. At that time, OpenType support was primitive and not generally available.? If I'm not mistaken, the word "ligation" for typographic ligature forming had not yet been coined. IIRC John Hudson borrowed the medical word some time after that particular Unicode e-mail thread.? (One poster in that thread called it "ligaturing".) Peter Constable replied and explained clearly how ligation was expected to work for Latin in Unicode.? John Cowan posted again and augmented the information which Peter Constable had provided.? The information from Peter and John was instructional and helpful and furthered the education of at least one neophyte. Back then, display issues were on everyone's mind.? Many questions about display issues were posted to this list.? Unicode provided some novel methods of encoding complex scripts, such as for Indic, but those methods didn't actually work anywhere in the real world, so users stuck to the "ASCII-hack" fonts that actually did work. When questions about display issues and other technical aspects of Unicode were posted, experts from everywhere quickly responded with helpful pointers and explanations. Eighteen years pass, display issues have mostly gone away, nearly everything works "out-of-the-box", and list traffic has dropped dramatically.? Today's questions are usually either highly technical or emoji-related. Recent threads related to emoji included some questions and issues which remain unanswered in spite of the fact that there are list members who know the answers. This gives the impression that the Unicode public list has become pass?.? That's almost as sad as looking down the archive posts, seeing the names of the posters, and remembering colleagues who no longer post. So I'm wondering what changed, but I don't expect an answer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 19 14:03:19 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 19 Aug 2018 21:03:19 +0200 Subject: Tales from the Archives In-Reply-To: <2096117.2386243.1534690888302@mail.yahoo.com> References: <2096117.2386243.1534690888302@mail.yahoo.com> Message-ID: You and Alan both raise good issues and make good points. I'd mention a couple of others. When we started Unicode, there were not a lot of alternatives to a general-purpose discussion email list for internationalization, but now there are many. Often the technical discussions are moved to more specific forums. There are interesting discussions on the identification of Unicode spoofing (because of look-alikes) on a variety of forums dealing with security, for example. I suspect many of the font rendering issues have widespread solutions now (as Alan notes) and that discussions of remaining issues have shifted to forums on OpenType. There are some very intense discussions of Mongolian model issues, but those also tend to be handled in different venues. Work on ICU / CLDR also tend to take place in many cases in the comments on particular tickets, rather than in email lists. The work of the consortium has also broadened significantly beyond encoding and issues closely related to encoding. Here's a slide to illustrate that. (The first 24 slides in the deck are to give people some context and perspective on what the Unicode Consortium does before focusing on a narrower issue.) https://docs.google.com/presentation/d/1QAyfwAn_0SZJ1yd0WiQgoJdG7djzDiq2Isb254ymDZc/edit#slide=id.g38b1fcd632_0_166 Mark On Sun, Aug 19, 2018 at 5:06 PM Alan Wood via Unicode wrote: > James > > I think you have answered your own question: nearly everything works > "out-of-the-box". > > Unicode is just there, and most computer users have probably never heard > of it. I routinely produce web pages with English, French, Russian and > Chinese text and a few symbols, and don't even think whether other people > can see everything displayed properly. > > Long ago, the response to the question "Why can't I see character x" was > often to install a copy of the Code2000 font and send the fee ($10 ?) to > James Kass by airmail. > > These days, Windows 10 can display all of the major living languages (and > I expect Macs can too, but I can't afford one now that I have retired). > > Some of the frequent posters have probably passed away, while others (like > me) have got older, and slowed down and/or developed new interests. > > Best regards > > Alan Wood > http://www.alanwood.net (Unicode, special characters, pesticide names) > > > On Sunday, 19 August 2018, 03:05:41 GMT+1, James Kass via Unicode < > unicode at unicode.org> wrote: > > > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html > > Back in 2000, William Overington asked about ligation for Latin and > mentioned something about preserving older texts digitally. John > Cowan replied with some information about ZWJ/ZWNJ and I offered a > link to a Unicode-based font, Junicode, which had (at that time) > coverage for archaic letters already encoded, and which used the PUA > for unencoded ligatures. > > At that time, OpenType support was primitive and not generally > available. If I'm not mistaken, the word "ligation" for typographic > ligature forming had not yet been coined. IIRC John Hudson borrowed > the medical word some time after that particular Unicode e-mail > thread. (One poster in that thread called it "ligaturing".) > > Peter Constable replied and explained clearly how ligation was > expected to work for Latin in Unicode. John Cowan posted again and > augmented the information which Peter Constable had provided. The > information from Peter and John was instructional and helpful and > furthered the education of at least one neophyte. > > Back then, display issues were on everyone's mind. Many questions > about display issues were posted to this list. Unicode provided some > novel methods of encoding complex scripts, such as for Indic, but > those methods didn't actually work anywhere in the real world, so > users stuck to the "ASCII-hack" fonts that actually did work. > > When questions about display issues and other technical aspects of > Unicode were posted, experts from everywhere quickly responded with > helpful pointers and explanations. > > Eighteen years pass, display issues have mostly gone away, nearly > everything works "out-of-the-box", and list traffic has dropped > dramatically. Today's questions are usually either highly technical > or emoji-related. > > Recent threads related to emoji included some questions and issues > which remain unanswered in spite of the fact that there are list > members who know the answers. > > This gives the impression that the Unicode public list has become > pass?. That's almost as sad as looking down the archive posts, seeing > the names of the posters, and remembering colleagues who no longer > post. > > So I'm wondering what changed, but I don't expect an answer. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 19 17:41:55 2018 From: unicode at unicode.org (Leo Broukhis via Unicode) Date: Sun, 19 Aug 2018 15:41:55 -0700 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: On Fri, Aug 17, 2018 at 2:35 AM, William_J_G Overington via Unicode < unicode at unicode.org> wrote: > > I decided that trying to design emoji for 'I' and for 'You' seemed > interesting so I decided to have a go at designing some. > Why don't we just encode Blissymbolics, where pronouns are already expressible as abstract symbols, and emojify them? Leo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 08:08:46 2018 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Mon, 20 Aug 2018 15:08:46 +0200 Subject: Tales from the Archives In-Reply-To: References: Message-ID: <20180820130846.h7mij%steffen@sdaoden.eu> James Kass via Unicode wrote in : ... |Eighteen years pass, display issues have mostly gone away, nearly |everything works "out-of-the-box", and list traffic has dropped |dramatically. Today's questions are usually either highly technical |or emoji-related. | |Recent threads related to emoji included some questions and issues |which remain unanswered in spite of the fact that there are list |members who know the answers. | |This gives the impression that the Unicode public list has become |pass?. That's almost as sad as looking down the archive posts, seeing |the names of the posters, and remembering colleagues who no longer |post. | |So I'm wondering what changed, but I don't expect an answer. I have the impression that many things which have been posted here some years ago are now only available via some Forums or other browser based services. What is posted here seems to be mostly a duplicate of the blog only. (And the website has its pitfalls too, for example [1] is linked from [2], but does not exist.) [1] http://www.unicode.org/resources/readinglist.html [2] http://www.unicode.org/publications/ --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Aug 20 09:09:21 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 06:09:21 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: Leo Broukhis responded to William Overington: >> I decided that trying to design emoji for 'I' and for 'You' seemed >> interesting so I decided to have a go at designing some. > > Why don't we just encode Blissymbolics, where pronouns are already > expressible as abstract symbols, and emojify them? Emoji enthusiasts seeking to devise a universal pictographic set might be well-advised to build from existing work such as Blissymbolics. I think William Overington's designs are clever, though. Anyone who has ever studied a foreign language (or even their own language) would easily and quickly recognize the intended meanings of the symbols once they understand the derivation. From unicode at unicode.org Mon Aug 20 09:20:59 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 20 Aug 2018 07:20:59 -0700 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 09:30:12 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 06:30:12 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: There are enthusiasts who want to add many cool emoji to the set and who may be frustrated by the process and new character limits. There are other enthusiasts who apparently want to add even more emoji with the idea of producing some kind of universal pictographic system. They'd likely need personal pronouns for something like that and are probably even more frustrated. Then there's the corporate interests who also want to add more cool emoji, as long as they are cool enough, and within limits. There's some common ground there, but it's easy to understand that the enthusiasts are stymied by the pace. With a limit of sixty new emoji per year, it would take quite a while before the regular enthusiasts are satisfied and it would take decades to encode any kind of universal pictographic system. What the enthusiasts need is a large block of characters in which to experiment. A place where proposed and pending (or rejected) emoji could be sorted, stored, mapped, documented, and published without any lengthy delays. A range from which such emoji could be transmitted to other enthusiasts as computer plain text, and by prior agreement the recipient could display the emoji as the sender intended. Would two complete planes of Unicode be large enough for that? Deseret and Phaistos, as two examples, were being used in a Unicode environment way before they were added to The Unicode Standard. There were web pages published in Deseret before Deseret was accepted into Unicode, and newer pages weren't using "ASCII-hack" fonts. Enthusiasts could form their own ad-hoc committee and set up some form of registry for pre-Unicode emoji using the Private Use Planes of Unicode. Vendor support wouldn't be likely, at least not right away, but vendor support isn't happening any time soon for most proposed emoji anyway. Since emoji enthusiasts come from all walks of life, there's surely someone who can whip up an app or an add-on. Plus, conventional fonts can be made for the black and white fallback glyphs, and those would get things going while awaiting apps/add-ons. If usage of these new emoji snowballs as much as the enthusiasts expect it to, then the search engine trending might be tuned to the individual PUA character and give an *exact* reading of just how popular any particular proposed emoji is. And *those* figures would tend to support the promotion of specific candidates into regular Unicode if the figures were high enough. And if these new emoji turn out to be just a passing fad, no harm done. As The Universal Character Set, it should be able to support the needs of all users. And with the Private Use Areas, it does. As a caveat, some Unicode cognoscenti express disdain for the PUA, so there would be some people who would call a PUA solution either batty or crazy. But such PUA solutions have the advantage of getting things up-and-running and allowing specialists and enthusiasts to exchange exactly the kind of information they want to exchange, such as the anarchy symbol, without needing anybody's approval or permission. Which might explain the disdain. ? https://en.wikipedia.org/wiki/Private_Use_Areas From unicode at unicode.org Mon Aug 20 13:22:13 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 20 Aug 2018 11:22:13 -0700 Subject: Tales from the Archives In-Reply-To: <20180820130846.h7mij%steffen@sdaoden.eu> References: <20180820130846.h7mij%steffen@sdaoden.eu> Message-ID: <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net> Steffen, Are you looking for the Unicode list email archives? https://www.unicode.org/mail-arch/ Those contain list content going back all the way to 1994. --Ken On 8/20/2018 6:08 AM, Steffen Nurpmeso via Unicode wrote: > I have the impression that many things which have been posted here > some years ago are now only available via some Forums or other > browser based services. What is posted here seems to be mostly > a duplicate of the blog only. From unicode at unicode.org Mon Aug 20 13:47:49 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 20 Aug 2018 11:47:49 -0700 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) Message-ID: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> James Kass wrote: > As a caveat, some Unicode cognoscenti express disdain for the PUA, so > there would be some people who would call a PUA solution either batty > or crazy. I'm concerned that the constant "health warnings" about avoiding the PUA may have scared everyone away from this primary use case. Yes, you run the risk of someone else's PUA implementation colliding with yours. That's why you create a Private Use Agreement, and make sure it's prominently available to people who want to use your solution. It's not like there are hundreds of PUA schemes anyway. Yes, you will have to convert any existing data if the solution ever gets encoded in Unicode. That happened for Deseret and Shavian, and maybe others, and the sky didn't fall. People forget that it was the PUA in Shift-JIS, by Japanese mobile providers, that provided the platform for emoji to take off to such an extent that... well, we know the rest. If private-use is good enough for a legacy encoding, it ought to be good enough for Unicode. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Aug 20 14:12:42 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 20 Aug 2018 21:12:42 +0200 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) In-Reply-To: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> Message-ID: > ... some people who would call a PUA solution either batty > or crazy. I don't think it is either batty or crazy. People can certainly use the PUA to interchange text (assuming that they have downloaded fonts and keyboards or some other input method beforehand), and it can definitely serve as a proof of concept . Plain symbols ? with no interactions between them (like changing shape with complex scripts), no combining/non-spacing marks, no case mappings, and so on ? are the best possible case for PUA. The only caution I would give is that people shouldn't expect general purpose software to do anything with PUA text that depends on character properties. Mark On Mon, Aug 20, 2018 at 8:52 PM Doug Ewell via Unicode wrote: > James Kass wrote: > > > As a caveat, some Unicode cognoscenti express disdain for the PUA, so > > there would be some people who would call a PUA solution either batty > > or crazy. > > I'm concerned that the constant "health warnings" about avoiding the PUA > may have scared everyone away from this primary use case. > > Yes, you run the risk of someone else's PUA implementation colliding > with yours. That's why you create a Private Use Agreement, and make sure > it's prominently available to people who want to use your solution. It's > not like there are hundreds of PUA schemes anyway. > > Yes, you will have to convert any existing data if the solution ever > gets encoded in Unicode. That happened for Deseret and Shavian, and > maybe others, and the sky didn't fall. > > People forget that it was the PUA in Shift-JIS, by Japanese mobile > providers, that provided the platform for emoji to take off to such an > extent that... well, we know the rest. If private-use is good enough for > a legacy encoding, it ought to be good enough for Unicode. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 14:38:30 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 20 Aug 2018 12:38:30 -0700 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) Message-ID: <20180820123830.665a7a7059d7ee80bb4d670165c8327d.2bbe13127f.wbe@email03.godaddy.com> Mark Davis wrote: > The only caution I would give is that people shouldn't expect general > purpose software to do anything with PUA text that depends on > character properties. Very true, and a good point. People with creative PUA ideas do sometimes expect this to magically work. I have anecdotes, if anyone is interested off-list. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Aug 20 17:22:33 2018 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Tue, 21 Aug 2018 00:22:33 +0200 Subject: Tales from the Archives In-Reply-To: <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net> References: <20180820130846.h7mij%steffen@sdaoden.eu> <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net> Message-ID: <20180820222233.iwl8c%steffen@sdaoden.eu> Terrible! Ken Whistler wrote in <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631 at att.net>: |Steffen, | |Are you looking for the Unicode list email archives? | |https://www.unicode.org/mail-arch/ | |Those contain list content going back all the way to 1994. Dear Ken Whistler, no, and yes, having an archive is very good, though your statement from 1997-07-16 ("Plan 9 (a Unix OS) uses UTF-8") i cannot agree with (it feels very different from Unix). It was just that i have read on one of the mailing-lists i am subscribed to a cite of a Unicode statement that i have never read of anything on the Unicode mailing-list. It is very awkward, but i _again_ cannot find what attracted my attention, even with the help of a search machine. I think "faith alone will reveal the true name of shuruq" (1997-07-18). --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Aug 20 18:23:13 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 20 Aug 2018 16:23:13 -0700 Subject: Tales from the Archives In-Reply-To: <20180820222233.iwl8c%steffen@sdaoden.eu> References: <20180820130846.h7mij%steffen@sdaoden.eu> <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net> <20180820222233.iwl8c%steffen@sdaoden.eu> Message-ID: Steffen noted: On 8/20/2018 3:22 PM, Steffen Nurpmeso via Unicode wrote: > It was just that i have read on one of the mailing-lists i am > subscribed to a cite of a Unicode statement that i have never read > of anything on the Unicode mailing-list. It is very awkward, but > i_again_ cannot find what attracted my attention, even with the > help of a search machine. I think "faith alone will reveal the > true name of shuruq" (1997-07-18). > > --steffen Fortunately, since I collect everything, this one has not been lost to the mists of history yet. So here you go, another "tale from the archives", aka "every character has a story". --Ken =================================================================== From kenw Thu Sep 18 14:23 PDT 1997 Date: Thu, 18 Sep 1997 14:20:29 -0700 From: kenw (Kenneth Whistler) Message-Id: <9709182120.AA16670 at birdie.sybase.com> To: unicode at unicode.org Subject: War over 'shuruq' narrowly averted Cc: kenw Dateline: Geneva, Thursday, September 18, 1997 The ISOnominalists and the SInominalists met today at the bargaining table in their long-running dispute over whether the correct name of U+05BC should be: HEBREW POINT DAGESH OR MAPIQ (shuruq) or HEBREW POINT DAGESH OR MAPIQ OR SHURUQ After considerable posturing and threats by both sides, opposing camps reluctantly agreed that a compromise solution was preferable to open flamewar. Unnamed sources state that the new name to be revealed in a press conference this evening is: HEBREW POINT DAGESH OR MAPIQ (or shuruq) Both sides have also now agreed to focus their attention jointly at countering the antinomianist camp, which claims that no names can be imposed by human moral strictures, and that faith alone will reveal the true name of shuruq. ============================================================= From unicode at unicode.org Mon Aug 20 18:49:45 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 20 Aug 2018 19:49:45 -0400 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: On 08/20/2018 10:20 AM, Asmus Freytag via Unicode wrote: > On 8/20/2018 7:09 AM, James Kass via Unicode wrote: >> Leo Broukhis responded to William Overington: >> >>>> I decided that trying to design emoji for 'I' and for 'You' seemed >>>> interesting so I decided to have a go at designing some. >>> Why don't we just encode Blissymbolics, where pronouns are already >>> expressible as abstract symbols, and emojify them? >> Emoji enthusiasts seeking to devise a universal pictographic set might >> be well-advised to build from existing work such as Blissymbolics. >> >> I think William Overington's designs are clever, though. Anyone who >> has ever studied a foreign language (or even their own language) would >> easily and quickly recognize the intended meanings of the symbols once >> they understand the derivation. >> > What about languages that don't have or don't use personal pronouns. > Their speakers might find their use odd or awkward. > > The same for many other grammatical concepts: they work reasonably > well if used by someone from a related language, or for linguists > trained in general concepts, but languages differ so much in what they > express explicitly that if any native speaker transcribes the features > that are exposed (and not implied) in their native language it may not > be what a reader used to a different language is expecting to see. > Most of the emoji are heavily dependent on a presumed culture anyway.? The smiley-faces maybe could be argued to be cross-cultural (facial expressions are the same for all people?well, mostly), though even then the styling is cultural.? But a lot of the rest are culture-dependent?and that's fine and how it should be, IMO. That said, I think William Overington's designs are generally opaque and incomprehensible.? James Kass says, "Anyone who has ever studied a foreign language (or even their own language) would easily and quickly recognize the intended meanings of the symbols *once they understand the derivation*." (emphasis added).? Well, yeah, once you tell me what something means, I know what it means!? The point of emoji is that they already make some sort of "obvious" sense?admittedly, to those who are in the covered culture.? (You can't say the same would be true of pronoun emoji for linguists, because no linguist would ever look at those symbols and think, "Oh right!? Pronouns!"? Yes, they'll make sense *once explained* and once you're told they're pronouns, but that's not the same thing.) Moreover, they are once again an attempt to shoehorn Overington's pet project, "language-independent sentences/words," which are still generally deemed out of scope for Unicode. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 18:55:27 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 20 Aug 2018 19:55:27 -0400 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <0124f1a2-01e5-80c6-35b7-8143b71437da@kli.org> On 08/20/2018 10:30 AM, James Kass via Unicode wrote: > As The Universal Character Set, it should be able to support the needs > of all users. And with the Private Use Areas, it does. Here, I agree with you.? This kind of experimentation is exactly what the PUA is for, especially for these putative "universal pictographic systems" which will need space to hold the whole system, since the individual signs won't mean much unless you understand the system (which I know I said was an argument against encoding them at all, but that's the point of the PUA: see if you can get some traction, if people really DO find it useful, etc. Then you can make me eat my words.)? I think it's been suggested a few times. Go forth into the PUA, and make it yours, then! ~mark From unicode at unicode.org Mon Aug 20 19:04:34 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 20 Aug 2018 20:04:34 -0400 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> Message-ID: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> On 08/20/2018 03:12 PM, Mark Davis ?? via Unicode wrote: > > ... some people who would call a PUA solution either batty > > or crazy. > > I don't think it is either batty or crazy. People can certainly use > the PUA to interchange text (assuming that they have downloaded fonts > and keyboards or some other input method beforehand), and > it > ?can definitely serve as a proof of concept > . Plain symbols ? with no interactions between them (like changing > shape with complex scripts), no combining/non-spacing marks, no case > mappings, and so on ? are the best possible case for PUA. It is kind of a bummer, though, that you can't experiment (easily? or at all?) in the PUA with scripts that have complex behavior, or even not-so-complex behavior like accents & combining marks, or RTL direction (here, also, am I speaking true?? Is there a block of RTL PUA also?? I guess there's always RLO, but meh.)? Still, maybe it doesn't really matter much: your special-purpose font can treat any codepoint any way it likes, right? ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 19:17:21 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 20 Aug 2018 17:17:21 -0700 Subject: Private Use areas In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> Message-ID: On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > Is there a block of RTL PUA also? No. --Ken From unicode at unicode.org Mon Aug 20 19:53:18 2018 From: unicode at unicode.org (via Unicode) Date: Tue, 21 Aug 2018 08:53:18 +0800 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> Message-ID: <444142b31601a3fbbdbb765e47cbd125@koremail.com> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: > On 08/20/2018 03:12 PM, Mark Davis ?? via Unicode wrote: > >>> ... some people who would call a PUA solution either batty > or >> crazy. >> >> I don't think it is either batty or crazy. People can certainly use >> the PUA to interchange text (assuming that they have downloaded >> fonts and keyboards or some other input method beforehand), and >> it can definitely serve as a proof of concept >> . Plain symbols ? with no interactions between them (like changing >> shape with complex scripts), no combining/non-spacing marks, no case >> mappings, and so on ? are the best possible case for PUA. > > It is kind of a bummer, though, that you can't experiment (easily? or > at all?) in the PUA with scripts that have complex behavior, or even > not-so-complex behavior like accents & combining marks, or RTL > direction (here, also, am I speaking true? Is there a block of RTL > PUA also? I guess there's always RLO, but meh.) Still, maybe it > doesn't really matter much: your special-purpose font can treat any > codepoint any way it likes, right? > Not all properties come from the font. For example a Zhuang character PUA font, which supplements CJK ideographs, does not rotate characters 90 degrees, when change from RTL to vertical display of text. John Knightley > ~mark From unicode at unicode.org Mon Aug 20 19:53:58 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 16:53:58 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: Mark E. Shoulson wrote, > ... James Kass says, "Anyone who has ever studied a > foreign language (or even their own language) would > easily and quickly recognize the intended meanings > of the symbols once they understand the derivation." > ... Well, yeah, once you tell me what something > means, I know what it means! The point of emoji is > that they already make some sort of "obvious" > sense?admittedly, to those who are in the covered > culture. To be clear, I do not think William Overington's personal pronoun symbol designs would make valid emoji candidates. I'm only talking about the symbols as abstract symbols. Blissymbolics, as pointed out by Leo Broukhis, might be good candidates for "emojification". Emoji are pictographic. Abstract symbols are not. From unicode at unicode.org Mon Aug 20 20:47:30 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 17:47:30 -0800 Subject: Tales from the Archives In-Reply-To: <2096117.2386243.1534690888302@mail.yahoo.com> References: <2096117.2386243.1534690888302@mail.yahoo.com> Message-ID: Alan Wood wrote, > Long ago, the response to the question "Why can't I > see character x" was often to install a copy of the > Code2000 font and send the fee ($10 ?) to James Kass > by airmail. It was always only $5. (About twenty years ago, Alan was the first to register it. I still have the envelope.) > Some of the frequent posters have probably passed away, > while others (like me) have got older, and slowed down > and/or developed new interests. We don't get any younger, it's true. Both time and people move on. Best regards, James Kass From unicode at unicode.org Mon Aug 20 20:57:42 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 17:57:42 -0800 Subject: Tales from the Archives In-Reply-To: References: <2096117.2386243.1534690888302@mail.yahoo.com> Message-ID: Mark Davis wrote, > https://docs.google.com/presentation/d/1QAyfwAn_0SZJ1yd0WiQgoJdG7djzDiq2Isb254ymDZc/edit#slide=id.g38b1fcd632_0_166 > That's an effective presentation. I especially liked the two Stephen Colbert clips. Mark makes a good point here about how specialized Unicode technical issues have their own special forums now. I'd really not taken that into account, and should have. The public list is geared towards being a forum for developers and Unicoders getting together to discuss various aspects of the Standard and its implementation, along with general public users with specific questions/concerns. Another feature of this list is that new character proposals can be vetted here, whether for a single new character or an entire new script. The display issues of yore no longer exist. Technical and specific aspects of Unicode have special forums. New character proposals are usually written by pros who have been through the process many times before. It doesn't really leave us much to discuss, does it? I'll be standing by in case anyone posts a question about how to display "character-x" on their Windows 98 system. Sigh. From unicode at unicode.org Mon Aug 20 22:20:48 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Mon, 20 Aug 2018 20:20:48 -0700 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: On Mon, Aug 20, 2018 at 5:53 PM, James Kass via Unicode wrote: > Blissymbolics, as pointed out > by Leo Broukhis, might be good candidates for "emojification". > Why don't we just get Blissymbolics encoded as it is? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 20 14:32:20 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 20 Aug 2018 20:32:20 +0100 (BST) Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) Message-ID: <18901449.46712.1534793540040.JavaMail.defaultUser@defaultHost> Doug Ewell wrote: > Yes, you run the risk of someone else's PUA implementation colliding with yours. That's why you create a Private Use Agreement, and make sure it's prominently available to people who want to use your solution. It's not like there are hundreds of PUA schemes anyway. Yes, that is generally true. However, a situation where that does not matter is if one just wishes to include some specially designed glyphs of one's own design in a PDF (Portable Document Format) document and one uses a Private Use Area encoding simply so that the PDF document with a subset of the glyphs of the font embedded in the PDF can be produced using a desktop publishing program. That is, one makes the font, one installs the font, one uses the font within the desktop publishing package. I have used that technique and the technique worked very well as the Windows operating system treated my font the same way as it did other fonts. With the desktop publishing package that I am using (Serif PagePlus version X7) that is only using the plane zero Private Use Area. Thus the providing of information to anyone reading the PDF document is as displayed glyphs rather than as code points. The availability of the Private Use Area allowed me to make such code point assignments for the glyphs that I had designed and then use those code points in a manner entirely compatible with The Unicode Standard. William Overington Monday 20 August 2018 From unicode at unicode.org Tue Aug 21 01:50:39 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 20 Aug 2018 22:50:39 -0800 Subject: Unicode 11 Georgian uppercase vs. fonts In-Reply-To: References: <20180726204652.39387370@JRWUBU2> <3C168136-067E-4FE3-AB25-F8CED964A035@evertype.com> Message-ID: (from 2018-07-27) > Michael Everson responded, > >>> If members of the Georgian user community want to consider this a stylistic difference, they are free to do so. >> >> It isn?t a stylistic difference. It is a different use of capital letters than Latin, Cyrillic and other scripts use them. suppose that english was written with a bicameral script, but english users only used the upper case letters for emphasis. in other words, personal names (like bela lugosi), place names (like bechuanaland), and book titles (like "the bridge over the river kwai") would always be in lower case. if someone needed to emphasize something by SHOUTING, they would use all-caps to make this stylistic distinction. if english users called upper case "harcourt" and lower case "fenton", there would be no earthly reason for them to consider switching from fenton to harcourt to be anything other than a stylistic difference. along comes a consortium with script experts and computer encoding experts who rightfully determine that the difference between harcourt and fenton is actually a casing difference, even though the english writing system does not actually use casing in a manner consistent with other bicameral scripts. so the consortium, tasked with breaking down elements of text for computer entry, exchange, and storage, encodes the english script as a casing script. would that action by the consortium alter my perception (as a typical member of the english user community) that the difference between harcourt and fenton is simply stylistic? HECK, no! the same applies to georgian. or any script. whatever the consortium does for computer text processing purposes should NEVER be interpreted as an effort to make the users change their perceptions of their OWN writing systems. we've been through this kind of thing before, with tamil as a notorious example. best regards, james kass From unicode at unicode.org Tue Aug 21 03:01:46 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Tue, 21 Aug 2018 09:01:46 +0100 (BST) Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: On 2018-08-20, Mark E. Shoulson via Unicode wrote: > Moreover, they [William's pronoun symbols] are once again an attempt to shoehorn Overington's pet > project, "language-independent sentences/words," which are still > generally deemed out of scope for Unicode. I find it increasingly hard to understand why William's project is out of scope (apart from the "demonstrate use first, then encode" principle, which is in any case not applied to emoji), when emoji are language-independent words - or even sentences: the GROWING HEART emoji is (I presume) supposed to be a language-independent way of saying "I love you more every day". Which seems rather more fatuous as a thing to put in a writing-systems standard than the things I think William would want. Not that I want to hear any more about William's unmentionables; I just wish emoji were equally unmentionable. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Tue Aug 21 05:01:56 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 21 Aug 2018 11:01:56 +0100 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) In-Reply-To: <444142b31601a3fbbdbb765e47cbd125@koremail.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> Message-ID: <20180821110156.453c129a@JRWUBU2> On Tue, 21 Aug 2018 08:53:18 +0800 via Unicode wrote: > On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: > > Still, maybe it > > doesn't really matter much: your special-purpose font can treat any > > codepoint any way it likes, right? > Not all properties come from the font. For example a Zhuang character > PUA font, which supplements CJK ideographs, does not rotate > characters 90 degrees, when change from RTL to vertical display of > text. Isn't that supposed to be treated by an OpenType feature such as 'vert'? Or does the rendering stack get in the way? However, one might need reflowing text to be about 40% WJ. Richard. From unicode at unicode.org Tue Aug 21 05:08:16 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 21 Aug 2018 03:08:16 -0700 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <1317044b-7de2-f4b2-9baf-18ccc85a475e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 21 08:17:21 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 21 Aug 2018 05:17:21 -0800 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: Rebecca Bettencourt wrote, > Why don't we just get Blissymbolics encoded as it is? The Pipeline still has the Everson proposal from 1998, but Blissymbols are still in the Pipeline. Scripts Encoding Initiative ( http://linguistics.berkeley.edu/sei/ ) page, http://linguistics.berkeley.edu/sei/scripts-not-encoded.html shows Blissymbols and links the same proposal. Blissymbolics Communication International, http://www.blissymbolics.org/ will likely produce the next proposal. Both Scripts Encoding Initiative and Blissymbolics Communication International depend upon funding. From unicode at unicode.org Tue Aug 21 09:56:51 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Tue, 21 Aug 2018 16:56:51 +0200 Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> Message-ID: <20180821145651.75orx5kfrtlzhfel@angband.pl> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > Is there a block of RTL PUA also? > > No. Perhaps there should be? What about designating a part of the PUA to have a specific property? Only certain properties matter enough: * wide * RTL * combining as most others are better represented in the font itself. This could be done either by parceling one of existing PUA ranges: planes 15 and 16 are virtually unused thus any damage would be negligible; or perhaps by allocating a new range elsewhere. Meow! -- ??????? What Would Jesus Do, MUD/MMORPG edition: ??????? ? multiplay with an admin char to benefit your mortal [Mt3:16-17] ??????? ? abuse item cloning bugs [Mt14:17-20, Mt15:34-37] ??????? ? use glitches to walk on water [Mt14:25-26] From unicode at unicode.org Tue Aug 21 12:21:43 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 21 Aug 2018 19:21:43 +0200 Subject: Private Use areas In-Reply-To: <20180821145651.75orx5kfrtlzhfel@angband.pl> (Adam Borowski via Unicode's message of "Tue, 21 Aug 2018 16:56:51 +0200") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> Message-ID: <86h8jnab4o.fsf@mimuw.edu.pl> On Tue, Aug 21 2018 at 16:56 +0200, unicode at unicode.org writes: > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: >> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: >> > Is there a block of RTL PUA also? >> >> No. > > Perhaps there should be? > > What about designating a part of the PUA to have a specific property? Only > certain properties matter enough: > * wide > * RTL > * combining > as most others are better represented in the font itself. > > This could be done either by parceling one of existing PUA ranges: planes 15 > and 16 are virtually unused thus any damage would be negligible; or perhaps > by allocating a new range elsewhere. I don't think it's a good idea. I think PUA users should provide the properties of the characters used in a form analogical to the Unicode itself, and the software should be able to use this additional information. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Tue Aug 21 12:36:04 2018 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Tue, 21 Aug 2018 10:36:04 -0700 Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> Message-ID: 2011 Thread: https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0124.html Please read in particular these two: - https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0174.html - https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0212.html (tl;dr: 1. the PUA set is fixed, 2. being private, the properties may be overridable by conformant implementations.) On Mon, Aug 20, 2018 at 5:17 PM Ken Whistler via Unicode < unicode at unicode.org> wrote: > > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > Is there a block of RTL PUA also? > > No. > > --Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 21 13:03:41 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 21 Aug 2018 11:03:41 -0700 Subject: Private Use areas In-Reply-To: <20180821145651.75orx5kfrtlzhfel@angband.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> Message-ID: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: >> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: >>> Is there a block of RTL PUA also? >> No. > Perhaps there should be? This is a periodic suggestion that never goes anywhere--for good reason. (You can search the email archives and see that it keeps coming up.) Presuming that this question was asked in good faith... > > What about designating a part of the PUA to have a specific property? The problem with that is that assigning *any* non-default property to any PUA code point would break existing implementations' assumptions about PUA character properties and potentially create havoc with existing use. > Only certain properties matter enough: That is an un-demonstrated assertion that I don't think you have thought through sufficiently. > * wide > * RTL RTL is not some binary counterpart of LTR. There are 23 values of Bidi_Class, and anyone who wanted to implement a right-to-left script in PUA might well have to make use of multiple values of Bidi_Class. Also, there are two major types of strong right-to-leftness: Bidi_Class=R and Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or non-Arabic type behavior? > * combining Also not a binary switch. Canonical_Combining_Class is a numeric value, and any value but ccc=0 for a PUA character would break normalization. Then for the General_Category, there are three types of "marks" that count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored in any PUA assignment? > as most others are better represented in the font itself. Really? Suppose someone wants to implement a bicameral script in PUA. They would need case mappings for that, and how would those be "better represented in the font itself"? Or how about digits? Would numeric values for digits be "better represented in the font itself"? How about implementation of punctuation? Would segmentation properties and behavior be "better represented in the font itself"? > > This could be done either by parceling one of existing PUA ranges: planes 15 > and 16 are virtually unused thus any damage would be negligible; That is simply an assertion -- and not the kind of assertion that the UTC tends to accept on spec. I rather suspect that there are multiple participants on this email list, for example, who *do* have implementations making extensive use of Planes 15/16 PUA code points for one thing or another. > or perhaps > by allocating a new range elsewhere. See: https://www.unicode.org/policies/stability_policy.html The General_Category property value Private_Use (Co) is immutable: the set of code points with that value will never change. That guarantee has been in place since 1996, and is a rule that binds the UTC. So nope, sorry, no more PUA ranges. > Meow! Grrr! ;-) As I see it, the only feasible way for people to get specialized behavior for PUA ranges involves first ceasing to assume that somehow they can jawbone the UTC into *standardizing* some ranges for some particular use or another. That simply isn't going to happen. People who assume this is somehow easy, and that the UTC are a bunch of boneheads who stand in the way of obvious solutions, do not -- I contend -- understand the complicated interplay of character properties, stability guarantees, and implementation behavior baked into system support libraries for the Unicode Standard. The way forward for folks who want to do this kind thing is: 1. Define a *protocol* for reliable interchange of custom character property information about PUA code points. 2. Convince more than one party to actually *use* that protocol to define sets of interchangeable character property definitions. 3. Convince at least one implementer to support that protocol to create some relevant interchangeable *behavior* for those PUA characters. And if the goal for #3 is to get some *system* implementer to support the protocol in widespread software, then before starting any of #1, #2, or #3, you had better start instead with: 0. Create a consortium (or other ongoing organization) with a 10-year time horizon and participation by at least one major software implementer, to define, publicize, and advocate for support of the protocol. (And if you expect a major software implementer to participate, you might need to make sure you have a business case defined that would warrant such a 10-year effort!) --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 21 13:23:48 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Tue, 21 Aug 2018 11:23:48 -0700 Subject: Private Use areas In-Reply-To: <86h8jnab4o.fsf@mimuw.edu.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> Message-ID: On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode < unicode at unicode.org> wrote: > I think PUA users should provide the > properties of the characters used in a form analogical to the Unicode > itself, and the software should be able to use this additional > information. > I already provide this myself for my uses of the PUA as well as the CSUR and any vendor-specific agreements I can find: http://www.kreativekorp.com/charset/PUADATA/ Of course there is no way to get software to use this information. I have entertained the idea of being able to embed this information into the font itself as OpenType tables, e.g.: PUAB -> Blocks.txt PUAC -> CaseFolding.txt PUAW -> EastAsianWidth.txt PUAL -> LineBreak.txt PUAD -> UnicodeData.txt I've actually invented table names for the majority of UCD files, but those are probably the most relevant. The table names for the more obscure files get rather... creative, e.g.: PUA[ -> BidiBrackets.txt PUA] -> BidiMirroring.txt That alone may get some people to think twice about this idea. :P -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 21 15:08:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 21 Aug 2018 21:08:49 +0100 Subject: Private Use areas In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> Message-ID: <20180821210849.56aef231@JRWUBU2> On Tue, 21 Aug 2018 11:03:41 -0700 Ken Whistler via Unicode wrote: > On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: > Really? Suppose someone wants to implement a bicameral script in PUA. > They would need case mappings for that, and how would those be > "better represented in the font itself"? Or how about digits? Would > numeric values for digits be "better represented in the font itself"? > How about implementation of punctuation? Would segmentation > properties and behavior be "better represented in the font itself"? The least intrusive way of defining the meaning of a graphic (sensu lato) character is by a font, in a very wide sense that would interpret a Unicode code chart as a font. Without a font in this sense, normal characters in the PUA have no meaning. If one insists on a font to have an interpretation, then: (1) PUA characters in plain text are meaningless - I believe that's pretty much the position now. (2) Different schemes can co-exist, even within the same formatted document, by having different formats. This is the case now. It then makes sense to store the properties in the font, which needs to be saved with or in the document for the document to continue to make sense. Casing and digits are luxuries. Are we not told that searching should be done by collation? We then do not need case-folding! Interpreting the preferred representation of Roman numerals does not use Unicode properties beyond the approximate principle of one character, one codepoint. As to segmentation, my understanding was that there were no characters available to indicate word boundaries in scriptio continua; the closest one has is line-breaking suggestions. If my memory serves me right, SIL Graphite fonts can hold line-breaking information. Richard. From unicode at unicode.org Tue Aug 21 15:15:35 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Tue, 21 Aug 2018 22:15:35 +0200 Subject: Private Use areas In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> Message-ID: <20180821201535.mfgzkrszsqweps23@angband.pl> On Tue, Aug 21, 2018 at 11:03:41AM -0700, Ken Whistler via Unicode wrote: > > On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: > > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: > > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > > > Is there a block of RTL PUA also? > > > No. > > Perhaps there should be? > > This is a periodic suggestion that never goes anywhere--for good reason. > (You can search the email archives and see that it keeps coming up.) > > Presuming that this question was asked in good faith... Oif, looks like mere months of inattentive lurking are not enough (the thread I got pointed to was from 2011). Apologies. > > or perhaps by allocating a new range elsewhere. > See: > > https://www.unicode.org/policies/stability_policy.html > > The General_Category property value Private_Use (Co) is immutable: the set > of code points with that value will never change. > > That guarantee has been in place since 1996, and is a rule that binds the > UTC. So nope, sorry, no more PUA ranges. Right. > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character property > information about PUA code points. [...] > And if the goal for #3 is to get some *system* implementer to support the > protocol in widespread software, then before starting any of #1, #2, or #3, > you had better start instead with: > > 0. Create a consortium (or other ongoing organization) with a 10-year time > horizon and participation by at least one major software implementer, to > define, publicize, and advocate for support of the protocol. Heh, good point. I wonder, perhaps a long-lived consortium tasked with assigning properties to characters already exists? So your answer _does_ provide a way to go: any PUA use that's no longer private, or any problem someone has with character properties, should go through official channels here instead of inventing an own standard. With my existing hats on (Debian fonts team member, and someone who messes with terminals in general) I already have two such itches to scratch. Thus, it sounds like I should do the research, prepare a write-up, and then come back to harass you folks with inane questions. Inventing new solutions that work around instead of with you is a bad idea... Meow! -- ????????????????????? From unicode at unicode.org Tue Aug 21 16:59:19 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 21 Aug 2018 14:59:19 -0700 Subject: Private Use areas Message-ID: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com> Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Aug 21 17:23:00 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Tue, 21 Aug 2018 15:23:00 -0700 Subject: Private Use areas In-Reply-To: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com> References: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com> Message-ID: On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode wrote: > Ken Whistler wrote: > > > The way forward for folks who want to do this kind thing is: > > > > 1. Define a *protocol* for reliable interchange of custom character > > property information about PUA code points. > > I've often thought that would be a great idea. You can't get to steps 2 > and 3 without step 1. I'd gladly participate in such a project. > As would I. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 21 18:45:10 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Tue, 21 Aug 2018 19:45:10 -0400 Subject: Private Use areas In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> Message-ID: <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: > > > On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: >> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: >>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: >>>> Is there a block of RTL PUA also? >>> No. >> Perhaps there should be? > > This is a periodic suggestion that never goes anywhere--for good > reason. (You can search the email archives and see that it keeps > coming up.) > > Presuming that this question was asked in good faith... Yeah, I know there has been talk about such things, and I also knew that whether or not there was an RTL block (which I did not remember for certain), there weren't going to be any *changes* in the PUA, and we were going to have to make do with what there was.? There's no way to anticipate all the possible properties people would want in the PUA, though I remember thinking it was probably wrong to make the PUA *strongly* LTR; I know there's a not-strongly flavor too. Best we can do is shout loudly at OpenType tables and hope to cram in behavior (or at least appearance, which is more likely all we can get) that vaguely resembles what we're after.? And that's not SO awful, given what we're dealing with. > > As I see it, the only feasible way for people to get specialized > behavior for PUA ranges involves first ceasing to assume that somehow > they can jawbone the UTC into *standardizing* some ranges for some > particular use or another. That simply isn't going to happen. People > who assume this is somehow easy, and that the UTC are a bunch of > boneheads who stand in the way of obvious solutions, do not -- I > contend -- understand the complicated interplay of character > properties, stability guarantees, and implementation behavior baked > into system support libraries for the Unicode Standard. The whole point of the PUA is that it *isn't* standardized (by the UTC).? It might have been nice to make some more varied choices of things that couldn't be left unspecified, but you're still going to wind up with "but there aren't any PUA codepoints that are JUST what I need!"? And, as said, it's too late now. ~mark From unicode at unicode.org Tue Aug 21 21:50:18 2018 From: unicode at unicode.org (Andrew Cunningham via Unicode) Date: Wed, 22 Aug 2018 12:50:18 +1000 Subject: Private Use areas In-Reply-To: <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> Message-ID: On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode < unicode at unicode.org> wrote: > On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: > >> >> > Best we can do is shout loudly at OpenType tables and hope to cram in > behavior (or at least appearance, which is more likely all we can get) that > vaguely resembles what we're after. And that's not SO awful, given what > we're dealing with. > >> >> At the moment I am looking at implementing three unencoded Arabic characters in the PUA. For the foreseeable future OpenType is a non-starter, so I will look at implementing them in Graphite tables in a font. Andrew -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 22 04:58:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 22 Aug 2018 11:58:58 +0200 Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> Message-ID: May be this debate could find an end if there was a way to encode "private use variants", so that we can override an existing character with correct properties by creating a custom variant, which would immediately inherit the properties of the base character on which it is encoded. But for now there's no private use variant codes (PUV). I think that a small block of 16 codes (may be even less) would be largely enough (given that it would be used only in pairs after any standard character). They could be used after any base character, possibly even after a combining character (so the default combining class for these PUV should be 0). For now there's still no way to have variant sequences unless they are registered and standardized by Unicode but registration should be not needed (forbidden) for sequences containing PUV. I think there's a usage pattern for such schemes. Their default (spacing) glyph could be a dotted circle with a single hex digit inside, it would be itself non-joining, it would be itself bidi-neutral and used only after a base character from which it would inherit the directionality (so the glyph would appear automatically on the correct side). Actual fonts implementing these PUV sequences would treat the PUV sequences as distinct unbreakable entities mapped to their own abstract character, and subject to common ligation. Le mer. 22 ao?t 2018 ? 04:58, Andrew Cunningham via Unicode < unicode at unicode.org> a ?crit : > > > On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode < > unicode at unicode.org> wrote: > >> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: >> >>> >>> >> Best we can do is shout loudly at OpenType tables and hope to cram in >> behavior (or at least appearance, which is more likely all we can get) that >> vaguely resembles what we're after. And that's not SO awful, given what >> we're dealing with. >> >>> >>> > At the moment I am looking at implementing three unencoded Arabic > characters in the PUA. > > For the foreseeable future OpenType is a non-starter, so I will look at > implementing them in Graphite tables in a font. > > Andrew > > > > -- > Andrew Cunningham > lang.support at gmail.com > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 04:31:34 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 23 Aug 2018 10:31:34 +0100 Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> Message-ID: <20180823103134.00645f90@JRWUBU2> On Wed, 22 Aug 2018 11:58:58 +0200 Philippe Verdy via Unicode wrote: > For now there's still no way to have variant sequences unless they are > registered and standardized by Unicode but registration should be not > needed (forbidden) for sequences containing PUV. I believe this scheme is no worse than hack encodings that using Latin character codes for other characters. These schemes often work. (Indeed, the currently best method of getting Tai Tham displayed as rich text that I can find is to use a transliteration-type encoding and a special font, though I can now get pretty close using the proper character codes in the order laid down in the proposals.) The major problems I can see with appropriating variation sequences are: (1) It might be restricted to base characters - I have no experimental evidence on whether this would happen. Fonts can happily convert base characters to combining characters, though this works best if Latin line-breaking rules take effect. (2) The appropriated variation sequence might be assigned a meaning - but this is no worse than the general ambiguity of PUA characters. (3) Some base characters get special treatment. For example, I had to change my transliteration scheme because hyphen-minus is treated specially by MS Edge - I was using it as a digraph disjunctor - and so clusters were not being formed. In this case, I would have come unstuck as soon as line-wrapping started, so it was a bad choice anyway. Or are there significant renderers that deliberately ignore variation selectors in unregistered, unstandardised variation sequences? I don't recall any problems from when we were discussing variation sequences for chess pieces. For supplementing a script, it might be best to start at VARIATION-SELECTOR-256, and work down if need be with specialist characters. Richard. From unicode at unicode.org Thu Aug 23 05:28:00 2018 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Thu, 23 Aug 2018 12:28:00 +0200 Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 05:48:52 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Thu, 23 Aug 2018 03:48:52 -0700 Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: On 8/23/2018 3:28 AM, "J?rg Knappen" wrote: > Asmus, > I know your style of humor, but to keep it straight: > All known human languages, even Piraha, have pronouns for "I" and "you". And languages like Japanese, tend to use them - mostly not. Even if the concepts are known, and can be named, there are deep differences across languages concerning the need? or conventions for demarcating them with words in any given context. Replacing words by symbols is not going to fix this - the only way to get a 'universal' system of symbolic expression is to invent a new language, with its own conventions for use of these symbols in any given context. A./ > --J?rg Knappen > *Gesendet:*?Montag, 20. August 2018 um 16:20 Uhr > *Von:*?"Asmus Freytag via Unicode" > *An:*?unicode at unicode.org > *Betreff:*?Re: Thoughts on working with the Emoji Subcommittee (was > Re: Thoughts on Emoji Selection Process) > > What about languages that don't have or don't use personal pronouns. > Their speakers might find their use odd or awkward. > > The same for many other grammatical concepts: they work reasonably > well if used by someone from a related language, or for linguists > trained in general concepts, but languages differ so much in what they > express explicitly that if any native speaker transcribes the features > that are exposed (and not implied) in their native language it may not > be what a reader used to a different language is expecting to see. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 07:10:35 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 23 Aug 2018 14:10:35 +0200 Subject: Private Use areas In-Reply-To: (Rebecca Bettencourt via Unicode's message of "Tue, 21 Aug 2018 11:23:48 -0700") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> Message-ID: <86ftz5cmh0.fsf@mimuw.edu.pl> On Tue, Aug 21 2018 at 11:23 -0700, unicode at unicode.org writes: > On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode wrote: > > I think PUA users should provide the > properties of the characters used in a form analogical to the Unicode > itself, and the software should be able to use this additional > information. > > I already provide this myself for my uses of the PUA as well as the > CSUR and any vendor-specific agreements I can find: > > http://www.kreativekorp.com/charset/PUADATA/ I would prefer to see the data in a repository, so others can can comment and contribute. As for "any vendor-specific agreements", do MUFI and LINCUA qualify? https://folk.uib.no/hnooh/mufi/ http://andron-typeforum.xobor.de/t10f13-Towards-a-linguistic-corporate-use-area-LINCUA.html > > Of course there is no way to get software to use this information. What kind of software do you have in mind? I'm primarily interested in the locally developed programs https://bitbucket.org/jsbien/unihistext/ https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ and in Emacs - to my disappointed it looks like the Unicode data are set at the compile time, but perhaps this can be negotiated with the developers. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Aug 23 10:39:15 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 23 Aug 2018 17:39:15 +0200 Subject: Private Use areas In-Reply-To: <20180823103134.00645f90@JRWUBU2> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> <20180823103134.00645f90@JRWUBU2> Message-ID: You make a confusion: I do not propose "hacking" existing codes, but instead adding new codes for private variations. It's then up to PUV sequence authors to choose an appropropriate base character that can have the properties they want to be inherited by the private-use variation sequence, or to choose a base character that will provide some reasonnable reading if rendererd as is (by renderers or fonts not implementing the pricate viaration sequence, give nthat they will also append a symbol for the PUV itself after the standard character). Also I do not want to change anything to any existing variation sequences (using VS1 and so on) and their encoding policies, requiring a prior registration and standardisation. Le jeu. 23 ao?t 2018 ? 11:42, Richard Wordingham via Unicode < unicode at unicode.org> a ?crit : > On Wed, 22 Aug 2018 11:58:58 +0200 > Philippe Verdy via Unicode wrote: > > > For now there's still no way to have variant sequences unless they are > > registered and standardized by Unicode but registration should be not > > needed (forbidden) for sequences containing PUV. > > I believe this scheme is no worse than hack encodings that using Latin > character codes for other characters. These schemes often work. > (Indeed, the currently best method of getting Tai Tham displayed as rich > text that I can find is to use a transliteration-type encoding and a > special font, though I can now get pretty close using the proper > character codes in the order laid down in the proposals.) > > The major problems I can see with appropriating variation sequences > are: > (1) It might be restricted to base characters - I have no > experimental evidence on whether this would happen. Fonts can happily > convert base characters to combining characters, though this works > best if Latin line-breaking rules take effect. > > (2) The appropriated variation sequence might be assigned a meaning - > but this is no worse than the general ambiguity of PUA characters. > > (3) Some base characters get special treatment. For example, I had > to change my transliteration scheme because hyphen-minus is treated > specially by MS Edge - I was using it as a digraph disjunctor - and > so clusters were not being formed. In this case, I would have come > unstuck as soon as line-wrapping started, so it was a bad choice anyway. > > Or are there significant renderers that deliberately ignore variation > selectors in unregistered, unstandardised variation sequences? I don't > recall any problems from when we were discussing variation > sequences for chess pieces. > > For supplementing a script, it might be best to start at > VARIATION-SELECTOR-256, and work down if need be with specialist > characters. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 11:11:05 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 23 Aug 2018 17:11:05 +0100 Subject: Private Use areas In-Reply-To: <86ftz5cmh0.fsf@mimuw.edu.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> Message-ID: <20180823171105.058ac317@JRWUBU2> On Thu, 23 Aug 2018 14:10:35 +0200 "Janusz S. Bie? via Unicode" wrote: > What kind of software do you have in mind? > > I'm primarily interested in the locally developed programs > > https://bitbucket.org/jsbien/unihistext/ > > https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ It looks as though the security certificates are awry - has someone forgotten to pay the protection money to the right people? (Firefox objects with "The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.") > and in Emacs - to my disappointed it looks like the Unicode data are > set at the compile time, but perhaps this can be negotiated with the > developers. Can you be more specific? For Indic rearrangement I had to define syllables myself with definitions which I then added to composition-function-table. Unfortunately, I then hit the problem that I had to define Indic rearrangement myself, and OpenType fonts fall into several incompatible families, which is why I haven't released a general solution. My emacs kit for Tai Tham is given via http://www.wrdingham.co.uk/lanna/toolkit.html (a probable kinsman got the 'o'), but there are a lot of odds and ends that need sorting out. I would expect that you would be able to override any relevant 'compiler' settings via your Emacs start up file - I expect Eli Zaretski will be along soon with more details. Of course, you could always revert to the old tradition and recompile Emacs yourself - though it may need something like MinGW to compile for Windows. Richard. From unicode at unicode.org Thu Aug 23 11:26:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 23 Aug 2018 17:26:42 +0100 Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> <20180823103134.00645f90@JRWUBU2> Message-ID: <20180823172642.55f167a6@JRWUBU2> On Thu, 23 Aug 2018 17:39:15 +0200 Philippe Verdy via Unicode wrote: > You make a confusion: I do not propose "hacking" existing codes, but > instead adding new codes for private variations. It's then up to PUV > sequence authors to choose an appropropriate base character that can > have the properties they want to be inherited by the private-use > variation sequence, or to choose a base character that will provide > some reasonnable reading if rendererd as is (by renderers or fonts > not implementing the pricate viaration sequence, give nthat they will > also append a symbol for the PUV itself after the standard character). Variation sequences cannot be used to add new characters. Most PUA characters are used to represent new characters. A standard-conformant private variation sequence would generally achieve the same effect as could be achieved by a font feature (typically one of the cvxx, though possibly one of the ssxx), though using font features would be fiddlier and have more limited support, and variation sequences would facilitate data processing. Richard. From unicode at unicode.org Thu Aug 23 11:46:50 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 23 Aug 2018 18:46:50 +0200 Subject: Private Use areas In-Reply-To: <20180823172642.55f167a6@JRWUBU2> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> <20180823103134.00645f90@JRWUBU2> <20180823172642.55f167a6@JRWUBU2> Message-ID: Le jeu. 23 ao?t 2018 ? 18:31, Richard Wordingham via Unicode < unicode at unicode.org> a ?crit : > On Thu, 23 Aug 2018 17:39:15 +0200 > Philippe Verdy via Unicode wrote: > > > You make a confusion: I do not propose "hacking" existing codes, but > > instead adding new codes for private variations. It's then up to PUV > > sequence authors to choose an appropropriate base character that can > > have the properties they want to be inherited by the private-use > > variation sequence, or to choose a base character that will provide > > some reasonnable reading if rendererd as is (by renderers or fonts > > not implementing the pricate viaration sequence, give nthat they will > > also append a symbol for the PUV itself after the standard character). > > Variation sequences cannot be used to add new characters. Did you remember I did not speak about existing variation sequences ? Only about the new encocing do provite use variation sequences which do not have to obey the policy of exising VS, and whose purpose whould be to inherit most properties (notably direction, breaking, spacing, general category of another existing character). > Most PUA > characters are used to represent new characters. I did not speak as well about PUAs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 13:30:52 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 23 Aug 2018 20:30:52 +0200 Subject: Private Use areas In-Reply-To: <20180823171105.058ac317@JRWUBU2> (Richard Wordingham via Unicode's message of "Thu, 23 Aug 2018 17:11:05 +0100") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> Message-ID: <86lg8x9bqb.fsf@mimuw.edu.pl> On Thu, Aug 23 2018 at 17:11 +0100, unicode at unicode.org writes: > On Thu, 23 Aug 2018 14:10:35 +0200 > "Janusz S. Bie? via Unicode" wrote: > >> What kind of software do you have in mind? >> >> I'm primarily interested in the locally developed programs >> >> https://bitbucket.org/jsbien/unihistext/ >> >> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ > > It looks as though the security certificates are awry - has someone > forgotten to pay the protection money to the right people? (Firefox > objects with "The page you are trying to view cannot be shown because > the authenticity of the received data could not be verified.") I see no such problems with Firefox ESR 52.9.0 on Debian testing. Moreover the program reports that the certificate is valid till 04/21/2020. > >> and in Emacs - to my disappointed it looks like the Unicode data are >> set at the compile time, but perhaps this can be negotiated with the >> developers. > > Can you be more specific? I often search characters by name with C-x 8 Return. I would like to use it also for MUFI characters, I have already the name list (the example directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked very closely into the problem and don't remember now the details, but my impression was that it's not simple. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Aug 23 13:34:20 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 23 Aug 2018 20:34:20 +0200 Subject: Private Use areas In-Reply-To: <20180823172642.55f167a6@JRWUBU2> (Richard Wordingham via Unicode's message of "Thu, 23 Aug 2018 17:26:42 +0100") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> <20180823103134.00645f90@JRWUBU2> <20180823172642.55f167a6@JRWUBU2> Message-ID: <86h8jl9bkj.fsf@mimuw.edu.pl> On Thu, Aug 23 2018 at 17:26 +0100, unicode at unicode.org writes: > On Thu, 23 Aug 2018 17:39:15 +0200 > Philippe Verdy via Unicode wrote: > >> You make a confusion: I do not propose "hacking" existing codes, but >> instead adding new codes for private variations. It's then up to PUV >> sequence authors to choose an appropropriate base character that can >> have the properties they want to be inherited by the private-use >> variation sequence, or to choose a base character that will provide >> some reasonnable reading if rendererd as is (by renderers or fonts >> not implementing the pricate viaration sequence, give nthat they will >> also append a symbol for the PUV itself after the standard character). > > Variation sequences cannot be used to add new characters. Most PUA > characters are used to represent new characters. A > standard-conformant private variation sequence would generally achieve > the same effect as could be achieved by a font feature (typically one > of the cvxx, though possibly one of the ssxx), This is a typical but IMHO obsolete perspective. Fonts are for *rendering*, new characters and variants are more and more often needed for *input* of real life old texts with sufficient precision. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Aug 23 13:49:31 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Thu, 23 Aug 2018 11:49:31 -0700 Subject: Private Use areas In-Reply-To: <86ftz5cmh0.fsf@mimuw.edu.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> Message-ID: On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bie? wrote: > > I already provide this myself for my uses of the PUA as well as the > > CSUR and any vendor-specific agreements I can find: > > > > http://www.kreativekorp.com/charset/PUADATA/ > > I would prefer to see the data in a repository, so others can can > comment and contribute. > That is actually my intent for the future. Though it's not quite ready yet: https://github.com/kreativekorp/charset/tree/master/puadata That's the data in a "pre-compiled" form; it's turned into a "proper" PUADATA directory using this script: https://github.com/kreativekorp/charset/blob/master/bin/build-public.py As for "any vendor-specific agreements", do MUFI and LINCUA qualify? > I certainly do want to see MUFI and LINCUA provided in this form, but I put them in a different category along with CSUR. I basically have three categories of PUA agreements: Fonts - PUA assignments specific to a font family, e.g. Constructium, Fairfax, Nishiki-teki, Quivira, Junicode, etc. Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, MUFI, LINCUA, etc. Vendors - PUA assignments meant to be used by a single vendor or platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc. Thank you for those links by the way. I had tried to find charts for MUFI in the past but had somehow been unsuccessful. > Of course there is no way to get software to use this information. > > What kind of software do you have in mind? > Unicode-related utilities, text editors to start with. You pretty much hit the nail on the head with uniname and emacs as examples. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 14:17:15 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 23 Aug 2018 22:17:15 +0300 Subject: Private Use areas In-Reply-To: <86lg8x9bqb.fsf@mimuw.edu.pl> (unicode@unicode.org) References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> Message-ID: <83r2ioao5g.fsf@gnu.org> > Date: Thu, 23 Aug 2018 20:30:52 +0200 > Cc: Richard Wordingham > From: "Janusz S. Bie? via Unicode" > > >> and in Emacs - to my disappointed it looks like the Unicode data are > >> set at the compile time, but perhaps this can be negotiated with the > >> developers. > > > > Can you be more specific? > > I often search characters by name with C-x 8 Return. I would like to use > it also for MUFI characters, I have already the name list (the example > directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked > very closely into the problem and don't remember now the details, but my > impression was that it's not simple. What is "it" in the last sentence? IOW, what is not simple about that with Emacs? It is true that the Unicode related data is produced at build time, but only some of that is actually recorded in the Emacs binary, the rest is loaded upon demand. But all the data is stored in data structures that are mutable, given some Lisp programming. (It is not clear to me which part of the Unicode data you would like to change; are you talking about adding characters to the list of those defined by Unicode? If you are using the PUA codepoints, it's possible that you will need to update Emacs's notion of PUA as well.) From unicode at unicode.org Thu Aug 23 14:47:03 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 23 Aug 2018 21:47:03 +0200 Subject: Private Use areas In-Reply-To: <83r2ioao5g.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 23 Aug 2018 22:17:15 +0300") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> Message-ID: <86va80987c.fsf@mimuw.edu.pl> On Thu, Aug 23 2018 at 22:17 +0300, eliz at gnu.org writes: >> Date: Thu, 23 Aug 2018 20:30:52 +0200 >> Cc: Richard Wordingham >> From: "Janusz S. Bie? via Unicode" >> >> >> and in Emacs - to my disappointed it looks like the Unicode data are >> >> set at the compile time, but perhaps this can be negotiated with the >> >> developers. >> > >> > Can you be more specific? >> >> I often search characters by name with C-x 8 Return. I would like to use >> it also for MUFI characters, I have already the name list (the example >> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked >> very closely into the problem and don't remember now the details, but my >> impression was that it's not simple. > > What is "it" in the last sentence? IOW, what is not simple about that > with Emacs? I'm very glad you join the discussion. My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with the code E010. I can provide the list of names and codes. > > It is true that the Unicode related data is produced at build time, > but only some of that is actually recorded in the Emacs binary, the > rest is loaded upon demand. But all the data is stored in data > structures that are mutable, given some Lisp programming. I never was fluent in Lisp programming and by now I forgot almost everything I knew, so it's not a task for me. I was thinking about submitting a feature request, but I forgot also the proper procedures to do it. Moreover I had the impression that I'm the only person who needs it... > > (It is not clear to me which part of the Unicode data you would like > to change; are you talking about adding characters to the list of > those defined by Unicode? If you are using the PUA codepoints, it's > possible that you will need to update Emacs's notion of PUA as well.) Yes, I would like the PUA codepoints to be handled analogically as the proper ones. What do you mean by Emacs's notion of PUA? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Aug 23 15:37:39 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 23 Aug 2018 21:37:39 +0100 Subject: Private Use areas In-Reply-To: <86h8jl9bkj.fsf@mimuw.edu.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net> <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org> <20180823103134.00645f90@JRWUBU2> <20180823172642.55f167a6@JRWUBU2> <86h8jl9bkj.fsf@mimuw.edu.pl> Message-ID: <20180823213739.24365c81@JRWUBU2> On Thu, 23 Aug 2018 20:34:20 +0200 "Janusz S. Bie? via Unicode" wrote: > This is a typical but IMHO obsolete perspective. Fonts are for > *rendering*, new characters and variants are more and more often > needed for *input* of real life old texts with sufficient precision. If we're talking about glyphs which don't actually correspond to new characters, then that sounds like a good case for private use variation selectors. To quote Tully, "Abusus non tollit usum". Richard. From unicode at unicode.org Thu Aug 23 16:15:10 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 23 Aug 2018 22:15:10 +0100 Subject: Emacs Verbose Character Entry (was Private Use Areas) In-Reply-To: <86va80987c.fsf@mimuw.edu.pl> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> <86va80987c.fsf@mimuw.edu.pl> Message-ID: <20180823221510.54c6c43f@JRWUBU2> On Thu, 23 Aug 2018 21:47:03 +0200 "Janusz S. Bie? via Unicode" wrote: > My needs are very simple, for example C-x 8 Return LATIN CAPITAL > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with > the code E010. I can provide the list of names and codes. While it should obviously yield, if anything, or for 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE', it would probably be more important to recognise formal aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING. For , I prefer to type "A\_M_X", but then I learnt XSAMPA. Richard. From unicode at unicode.org Thu Aug 23 20:43:39 2018 From: unicode at unicode.org (Julian Wels via Unicode) Date: Fri, 24 Aug 2018 03:43:39 +0200 Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: I think Blissymbols could be a separate, well-defined script in Unicode because they are already more or less well defined by their respective groups. This community of interest can lobby for these implementations as a whole instead of multiple individuals separately. Emoji were born in quite a different way and are in no way as well defined as Blissymbols are for example. There is no self-governing forum of people to discuss the future of emoji and forthcoming additions. Obviously, because they gained international attention just as they were added to Unicode-Standard but also maybe because "working with the Emoji Subcommittee" is rather hard. The conversation about Blissymbols made me think about a solution on how to solve the current communication problem, although it might be a bit radical: Why not remove the authority to propose new emojis from the ESC and give it to a dedicated, public Emoji-Community. Such a community could formulate additional guidelines for upcoming emojis, draft roadmaps and send a quarterly proposal to the ESC for individual approval. Unicode Members could still express ideas and exercise power through participating in the community and appointing people to the ESC. [image: diagram.png] This change would remove pressure and workload from the ESC while retaining most of the control, especially the last word, but the Emoji-Standart would benefit from a dedicated community. I'm just putting this out there. What are your thoughts on this? Do you think this is unreasonable, or achievable? Julian ?? On Tue, Aug 21, 2018 at 3:25 PM James Kass via Unicode wrote: > Rebecca Bettencourt wrote, > > > Why don't we just get Blissymbolics encoded as it is? > > The Pipeline still has the Everson proposal from 1998, but Blissymbols > are still in the Pipeline. > > Scripts Encoding Initiative > ( http://linguistics.berkeley.edu/sei/ ) > page, > http://linguistics.berkeley.edu/sei/scripts-not-encoded.html > shows Blissymbols and links the same proposal. > > Blissymbolics Communication International, > http://www.blissymbolics.org/ > will likely produce the next proposal. > > Both Scripts Encoding Initiative and Blissymbolics Communication > International depend upon funding. > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: diagram.png Type: image/png Size: 52833 bytes Desc: not available URL: From unicode at unicode.org Thu Aug 23 20:58:11 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Thu, 23 Aug 2018 21:58:11 -0400 Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <4000079c-7d37-7526-dad0-f955e50aaa76@kli.org> Still, pronouns may be universal, but their features aren't... Pronouns in Japanese are not a closed class, and it is not uncommon to use a person's name/title instead of "you".? Happens in English and other languages too, with extremely formal speech, even down to conjugating with 3rd-person verb forms.? (it's really cool to see the mid-sentence back-and-forth shifting in Biblical Hebrew, e.g. Genesis chapter 44.)? All of which is to say, as Asmus did, that even "I" and "you" are not interchangeable pieces between languages, easily symbolized by a single "fits-all-languages" placeholder. ~mark On 08/23/2018 06:28 AM, "J?rg Knappen" via Unicode wrote: > Asmus, > I know your style of humor, but to keep it straight: > All known human languages, even Piraha, have pronouns for "I" and "you". > --J?rg Knappen > *Gesendet:*?Montag, 20. August 2018 um 16:20 Uhr > *Von:*?"Asmus Freytag via Unicode" > *An:*?unicode at unicode.org > *Betreff:*?Re: Thoughts on working with the Emoji Subcommittee (was > Re: Thoughts on Emoji Selection Process) > > What about languages that don't have or don't use personal pronouns. > Their speakers might find their use odd or awkward. > > The same for many other grammatical concepts: they work reasonably > well if used by someone from a related language, or for linguists > trained in general concepts, but languages differ so much in what they > express explicitly that if any native speaker transcribes the features > that are exposed (and not implied) in their native language it may not > be what a reader used to a different language is expecting to see. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 23 21:03:05 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Thu, 23 Aug 2018 22:03:05 -0400 Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: References: <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost> <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost> <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost> Message-ID: <7eac15eb-7342-1f97-e5a6-ef42d371423e@kli.org> On 08/23/2018 06:48 AM, Asmus Freytag (c) via Unicode wrote: > On 8/23/2018 3:28 AM, "J?rg Knappen" wrote: >> Asmus, >> I know your style of humor, but to keep it straight: >> All known human languages, even Piraha, have pronouns for "I" and "you". > > And languages like Japanese, tend to use them - mostly not. > > Even if the concepts are known, and can be named, there are deep > differences across languages concerning the need? or conventions for > demarcating them with words in any given context. > > Replacing words by symbols is not going to fix this - the only way to > get a 'universal' system of symbolic expression is to invent a new > language, with its own conventions for use of these symbols in any > given context. > It isn't like replacing words with symbols hasn't been tried... I think Francis Lodwick had a "universal symbology" like this in the works in the 1600s. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Aug 24 03:01:14 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 24 Aug 2018 10:01:14 +0200 Subject: Private Use areas In-Reply-To: (Rebecca Bettencourt's message of "Thu, 23 Aug 2018 11:49:31 -0700") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> Message-ID: <864lfk2nxx.fsf@mimuw.edu.pl> On Thu, Aug 23 2018 at 11:49 -0700, beckiergb at gmail.com writes: > On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bie? wrote: > > > I already provide this myself for my uses of the PUA as well as the > > CSUR and any vendor-specific agreements I can find: > > > > http://www.kreativekorp.com/charset/PUADATA/ > > I would prefer to see the data in a repository, so others can can > comment and contribute. > > That is actually my intent for the future. Though it's not quite ready yet: > > https://github.com/kreativekorp/charset/tree/master/puadata Great! > > That's the data in a "pre-compiled" form; it's turned into a "proper" > PUADATA directory using this script: > > https://github.com/kreativekorp/charset/blob/master/bin/build-public.py > > As for "any vendor-specific agreements", do MUFI and LINCUA qualify? > > I certainly do want to see MUFI and LINCUA provided in this form, but > I put them in a different category along with CSUR. I basically have > three categories of PUA agreements: > > Fonts - PUA assignments specific to a font family, e.g. Constructium, Fairfax, Nishiki-teki, Quivira, Junicode, etc. You are probably aware that Junicode 1.000, released in September 2017, supports in full MUFI 4.0 (released in December 2015). I don't know whether Junicode contains now any PUA characters which are not in MUFI. > > Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, > MUFI, LINCUA, etc. > > Vendors - PUA assignments meant to be used by a single vendor or > platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc. > > Thank you for those links by the way. I had tried to find charts for > MUFI in the past but had somehow been unsuccessful. Similar files for different purpose has been created by Mikkel Eide Eriksen: https://github.com/mikkelee/mufi-latex An earlier version of MUFI was incorporated in the ENRICH Gaiji bank: http://v2.manuscriptorium.com/apps/gbank/ You can download the source but it doesn't seem useful. A version of MUFI is available also as a searchable character database created by the present single-person MUFI board, i.e. Tarrin Wills, as a part of the beta version of a new MUFI site: http://skaldic.abdn.ac.uk/m.php?p=mufi Some time ago I wrote on the mufi-fonts list: --8<---------------cut here---------------start------------->8--- On Sun, Dec 03 2017 at 6:55 +0100, jsbien at mimuw.edu.pl writes: [...] > I wanted the file quickly to get an overview of the recently released > corpus of 16th century Polish, and it's seemed to me that the simplest > and fastest way is to convert the PDF recommendation in a semi-automatic > way. It was more cumbersome than I expected, but thanks to this approach > I've discovered a typo in the recommendation: letter I instead of digit > 1 in EAFI, the code for LATIN ENLARGED LETTER SMALL LIGATURE AE (p. 93 > in the code chart order version). > > For the planned extension of the program I need more info on MUFI > characters, preferably in the format of the UnicodeData.txt. This time > however I intend to make haste slowly, so I have a question: > > Is it possible to make publicly available for download the database > underlying http://skaldic.abdn.ac.uk/db.php?if=mufi&table=mufi_char? --8<---------------cut here---------------end--------------->8--- Unfortunately I got no answer to the question. > > Of course there is no way to get software to use this information. > > What kind of software do you have in mind? > > Unicode-related utilities, text editors to start with. You pretty much > hit the nail on the head with uniname and emacs as examples. :) Thanks! As for uniname by Bill Poser, I exchanged mails with him in 2011: --8<---------------cut here---------------start------------->8--- On Sun, Aug 28 2011 at 12:01 +0200, jsbien at mimuw.edu.pl writes: [...] > A student of mine wrote an alternative program according to my > specification. The program is GPLed and available with > > git clone http://students.mimuw.edu.pl/~findepi/unihistext unihistext Now https://bitbucket.org/jsbien/unihistext > > The source is ready for Debian packaging. > > I think the program is worth better distribution, but its author is no > longer interested in it. Would you be so kind to consider including > either the program itself in your uniutils or extend your unidesc with > its features? > > Best regards > > Janusz On Sun, Aug 28 2011 at 16:03 -0700, billposer2 at gmail.com writes: > In principle, sure. I'll have a look at it. --8<---------------cut here---------------end--------------->8--- Unfortunatelly nothing happened, and I thought I should not press the point. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Aug 24 08:12:15 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 24 Aug 2018 16:12:15 +0300 Subject: Private Use areas In-Reply-To: <86va80987c.fsf@mimuw.edu.pl> (jsbien@mimuw.edu.pl) References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> <86va80987c.fsf@mimuw.edu.pl> Message-ID: <83d0u7aoy8.fsf@gnu.org> > From: jsbien at mimuw.edu.pl (Janusz S. Bie?) > Cc: unicode at unicode.org, richard.wordingham at ntlworld.com > Date: Thu, 23 Aug 2018 21:47:03 +0200 > > I'm very glad you join the discussion. I'm sorry for not joining sooner. In my defense, I missed the reference to Emacs, and the rest of the discussion is not really interesting for me, as using PUA for new characters is not something I have interest in or experience with. > My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER > A WITH MACRON AND BREVE [MUFI] should yield the character with the code > E010. I can provide the list of names and codes. So you'd like to extend "C-x 8 RET" to recognize names of additional characters and associate them with codepoints in the PUA area? That shouldn't be hard to add. But is that all? won't you also want to tell Emacs about the properties of those characters? or be able to set up fonts for displaying them? IOW, would it be okay to have these characters be "second-class citizens" in Emacs? > > It is true that the Unicode related data is produced at build time, > > but only some of that is actually recorded in the Emacs binary, the > > rest is loaded upon demand. But all the data is stored in data > > structures that are mutable, given some Lisp programming. > > I never was fluent in Lisp programming and by now I forgot almost > everything I knew, so it's not a task for me. I was thinking about > submitting a feature request, but I forgot also the proper procedures to > do it. The proper procedure is to type "M-x report-emacs-bug RET" and then describe the feature(s) you'd like to see added/improved. > Moreover I had the impression that I'm the only person who needs > it... That shouldn't stop you. Many a feature in Emacs started as a request from a single individual. > > (It is not clear to me which part of the Unicode data you would like > > to change; are you talking about adding characters to the list of > > those defined by Unicode? If you are using the PUA codepoints, it's > > possible that you will need to update Emacs's notion of PUA as well.) > > Yes, I would like the PUA codepoints to be handled analogically as the > proper ones. What do you mean by Emacs's notion of PUA? Emacs knows about the PUA regions of the Unicode code-space, and treats those codepoints specially. The features you request will probably need to affect the PUA region as well, because the codepoints you use should no longer be treated as PUA. From unicode at unicode.org Fri Aug 24 09:05:34 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 24 Aug 2018 17:05:34 +0300 Subject: Emacs Verbose Character Entry (was Private Use Areas) In-Reply-To: <20180823221510.54c6c43f@JRWUBU2> (message from Richard Wordingham via Unicode on Thu, 23 Aug 2018 22:15:10 +0100) References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> <86va80987c.fsf@mimuw.edu.pl> <20180823221510.54c6c43f@JRWUBU2> Message-ID: <83a7pbamhd.fsf@gnu.org> > Date: Thu, 23 Aug 2018 22:15:10 +0100 > From: Richard Wordingham via Unicode > > On Thu, 23 Aug 2018 21:47:03 +0200 > "Janusz S. Bie? via Unicode" wrote: > > > My needs are very simple, for example C-x 8 Return LATIN CAPITAL > > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with > > the code E010. I can provide the list of names and codes. > > While it should obviously yield, if anything, or > for 'LATIN CAPITAL LETTER A WITH MACRON AND > BREVE', it would probably be more important to recognise formal > aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo > ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao > letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING. > > For , I prefer to type "A\_M_X", but then I learnt > XSAMPA. The Emacs command "C-x 8 RET" expects the name of a single codepoint. It should be possible to extend it (or perhaps provide a separate command) that produced named sequence of codepoints, such as those in the above examples, but there's no such feature as of now. If this would be a useful addition, please suggest that on the Emacs issue tracker (using "M-x report-emacs-bug"), and please include with your request the sources where we could find such named sequences to support. Thanks. From unicode at unicode.org Fri Aug 24 10:09:02 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 24 Aug 2018 16:09:02 +0100 (BST) Subject: Private Use areas In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> Message-ID: <17627212.30661.1535123342455.JavaMail.defaultUser@defaultHost> Hi An approach that you might like to consider in relation to fonts is that it is possible to have in a font a Description field that consists of plain text. It is stored twice in the font, in two different ways, one of which is just plain text, possibly just ASCII. So if you had text such as $$$PUAB and so on in that Description field than a software application could search for all occurrences of $$$ and gather information for each set of data in that way, without needing separate OpenType tables. As an example of how information can be stored in the Description field here is a link to a font that I made years ago. If you download the font and open it is WordPad, the text can be read. The direct link is as follows. www.users.globalnet.co.uk/~ngo/SPANGBLU.TTF The font is also linked from the following web page, about a quarter of the way down the page. http://www.users.globalnet.co.uk/~ngo/fonts.htm The web pages encoded in the font are for three of the songs linked from the following page. http://www.users.globalnet.co.uk/~ngo/song0001.htm Best regards, William Overington Friday 24 August 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/08/21 - 19:23 (GMTDT) To : unicode at unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode wrote: I think PUA users should provide the properties of the characters used in a form analogical to the Unicode itself, and the software should be able to use this additional information. I already provide this myself for my uses of the PUA as well as the CSUR and any vendor-specific agreements I can find: http://www.kreativekorp.com/charset/PUADATA/ Of course there is no way to get software to use this information. I have entertained the idea of being able to embed this information into the font itself as OpenType tables, e.g.: PUAB -> Blocks.txt PUAC -> CaseFolding.txt PUAW -> EastAsianWidth.txt PUAL -> LineBreak.txt PUAD -> UnicodeData.txt I've actually invented table names for the majority of UCD files, but those are probably the most relevant. The table names for the more obscure files get rather... creative, e.g.: PUA[ -> BidiBrackets.txt PUA] -> BidiMirroring.txt That alone may get some people to think twice about this idea. :P -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Aug 24 11:40:07 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 24 Aug 2018 18:40:07 +0200 Subject: Private Use areas In-Reply-To: <83d0u7aoy8.fsf@gnu.org> (Eli Zaretskii's message of "Fri, 24 Aug 2018 16:12:15 +0300") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> <86va80987c.fsf@mimuw.edu.pl> <83d0u7aoy8.fsf@gnu.org> Message-ID: <86va7z90rc.fsf@mimuw.edu.pl> On Fri, Aug 24 2018 at 16:12 +0300, eliz at gnu.org writes: >> From: jsbien at mimuw.edu.pl (Janusz S. Bie?) >> Cc: unicode at unicode.org, richard.wordingham at ntlworld.com >> Date: Thu, 23 Aug 2018 21:47:03 +0200 >> >> I'm very glad you join the discussion. > > I'm sorry for not joining sooner. In my defense, I missed the > reference to Emacs, and the rest of the discussion is not really > interesting for me, as using PUA for new characters is not something I > have interest in or experience with. I don't think you missed anything important. > >> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER >> A WITH MACRON AND BREVE [MUFI] should yield the character with the code >> E010. I can provide the list of names and codes. > > So you'd like to extend "C-x 8 RET" to recognize names of additional > characters and associate them with codepoints in the PUA area? That > shouldn't be hard to add. I would prefer extensibility over efficiency, I don't mind loading PUA information from a source declared somehow in .emacs.d., so I can change/expand the list of characters from time to time. > But is that all? won't you also want to tell Emacs about the > properties of those characters? Personally I would like additionally to be able to change the case of a letter or string, and I am willing to prepare the necessary information for MUFI characters. Displaying other properties would be nice, but for me this is not crucial. Moreover, somebody has to prepare the data... > or be able to set up fonts for displaying them? It would be nice. I haven't asked for it because I typeset my texst with XeTeX or LuaTeX and the input is more important for me than rendering. > IOW, would it be okay to have these > characters be "second-class citizens" in Emacs? For me it would be acceptable. BTW, I just got perhaps a crazy idea: what about treating a PUA declaration (as you probably noticed, there may be conficting ones) as a separate coding system? Of course some mechanism for escaping the standard PUA interpretation would be needed. > >> > It is true that the Unicode related data is produced at build time, >> > but only some of that is actually recorded in the Emacs binary, the >> > rest is loaded upon demand. But all the data is stored in data >> > structures that are mutable, given some Lisp programming. >> >> I never was fluent in Lisp programming and by now I forgot almost >> everything I knew, so it's not a task for me. I was thinking about >> submitting a feature request, but I forgot also the proper procedures to >> do it. > > The proper procedure is to type "M-x report-emacs-bug RET" and then > describe the feature(s) you'd like to see added/improved. I will definitely remember now :-) > >> Moreover I had the impression that I'm the only person who needs >> it... > > That shouldn't stop you. Many a feature in Emacs started as a request > from a single individual. > >> > (It is not clear to me which part of the Unicode data you would like >> > to change; are you talking about adding characters to the list of >> > those defined by Unicode? If you are using the PUA codepoints, it's >> > possible that you will need to update Emacs's notion of PUA as well.) >> >> Yes, I would like the PUA codepoints to be handled analogically as the >> proper ones. What do you mean by Emacs's notion of PUA? > > Emacs knows about the PUA regions of the Unicode code-space, and > treats those codepoints specially. The features you request will > probably need to affect the PUA region as well, because the codepoints > you use should no longer be treated as PUA. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Aug 24 12:10:02 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 24 Aug 2018 19:10:02 +0200 Subject: Emacs Verbose Character Entry (was Private Use Areas) In-Reply-To: <83a7pbamhd.fsf@gnu.org> (Eli Zaretskii via Unicode's message of "Fri, 24 Aug 2018 17:05:34 +0300") References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl> <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2> <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org> <86va80987c.fsf@mimuw.edu.pl> <20180823221510.54c6c43f@JRWUBU2> <83a7pbamhd.fsf@gnu.org> Message-ID: <86in3z8zdh.fsf@mimuw.edu.pl> On Thu, Aug 23 2018 at 22:15 +0100, unicode at unicode.org writes: > On Thu, 23 Aug 2018 21:47:03 +0200 > "Janusz S. Bie? via Unicode" wrote: > >> My needs are very simple, for example C-x 8 Return LATIN CAPITAL >> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with >> the code E010. I can provide the list of names and codes. > > While it should obviously yield, if anything, or > for 'LATIN CAPITAL LETTER A WITH MACRON AND > BREVE', In my opinion there is no question what 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE' should yield, because the name should be absent on the name list. My example concerns names like 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI]' 'COMBINING ABBREVIATION MARK SUPERSCRIPT UR ROUND R FORM [MUFI]' etc. [...] > The Emacs command "C-x 8 RET" expects the name of a single codepoint. It's OK and in my opinion it should stay this way. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Aug 24 14:09:37 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 24 Aug 2018 20:09:37 +0100 (BST) Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process) In-Reply-To: <17939746.41561.1535136460751.JavaMail.root@webmail17.bt.ext.cpcloud.co.uk> References: <17939746.41561.1535136460751.JavaMail.root@webmail17.bt.ext.cpcloud.co.uk> Message-ID: <5778280.42338.1535137777588.JavaMail.defaultUser@defaultHost> Julian Bradfield wrote: > Not that I want to hear any more about William's unmentionables; I just wish emoji were equally unmentionable. Well, as you mention them perhaps the moderator will allow the following, particularly as it relates to Japanese and Japanese has been mentioned elsewhere in this thread. In Chapter 34 of my novel there is a poem and it is at one time described as being performed in Japanese. http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_034.pdf I know almost nothing about Japanese, yet as Japanese script is so very different from Latin script I feel that it provides a good test to include in my research. I am trying to learn more about Japanese so replies to this post are welcome please. I wondered about round-tripping the poem from English to Japanese and back to English. So I tried two experiments, designed so that the round-tripping was specifically not using the same translation method in each of the two directions. Experiment one. English to Japanese in Bing Translate and then copy and paste so as to translate from Japanese to English using Google Translate. Experiment two. English to Japanese in Google Translate and then copy and paste so as to translate from Japanese to English using Bing Translate. These worked well. Experiment two had the additional benefit of a lady reading out the poem. I am wondering if Chapter 34 could be the basis for a short play as part of the evening entertainment at the Internationalization & Unicode? Conference (IUC) 42, with the parts played by various delegates to the conference. That could be great and maybe a video could be made of the performance and the video published. The performance of the poem in Japanese could be spectacular. Clearly, expert translation would be needed so as to have a good show. William Overington Friday 24 August 2018 From unicode at unicode.org Sun Aug 26 18:10:23 2018 From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode) Date: Mon, 27 Aug 2018 00:10:23 +0100 (BST) Subject: Private Use areas In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> Message-ID: <792311568.784627.1535325024023@mail2.virginmedia.com> > On 21 August 2018 at 01:04 "Mark E. Shoulson via Unicode" wrote: > > It is kind of a bummer, though, that you can't experiment (easily? or at all?) in the PUA with scripts that have complex behavior, or even not-so-complex behavior like accents & combining marks, or RTL direction (here, also, am I speaking true? Is there a block of RTL PUA also? I guess there's always RLO, but meh.) Still, maybe it doesn't really matter much: your special-purpose font can treat any codepoint any way it likes, right? > > ~mark > > Back in 2006, I was typing the Tai Tham script (then being proposed as the Lanna script) using the PUA and exploring the issue of selecting between what are now and based on the preceding character and between what are now and based on the preceding base character and its subscripts. I was also looking at using variation selectors to override the rules. I was using SIL Graphite fonts when they was getting intermittent support in OpenOffice and Firefox - my main display engine was WorldPad. Nowadays, SIL Graphite seems to be securely supported in LibreOffice and Firefox. Now, back then, Graphite was at least attempting to support RTL; I would expect the RTL support to work well by now. On the other hand, experimenting with OpenType is much harder. The best I've found is transcoding to a Latin range and using an ssxx feature to convert the Latin glyphs back to those for the complex script. I do that to render Tai Tham in Internet Explorer 11 on Windows 7; this complex scheme is a fallback for when the rendering engine fails. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 27 09:22:15 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Mon, 27 Aug 2018 14:22:15 +0000 Subject: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) In-Reply-To: <20180821110156.453c129a@JRWUBU2> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> Message-ID: Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90? and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case). Cf. UAX 50. Peter -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, August 21, 2018 3:02 AM To: unicode at unicode.org Subject: Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) On Tue, 21 Aug 2018 08:53:18 +0800 via Unicode wrote: > On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: > > Still, maybe it > > doesn't really matter much: your special-purpose font can treat any > > codepoint any way it likes, right? > Not all properties come from the font. For example a Zhuang character > PUA font, which supplements CJK ideographs, does not rotate characters > 90 degrees, when change from RTL to vertical display of text. Isn't that supposed to be treated by an OpenType feature such as 'vert'? Or does the rendering stack get in the way? However, one might need reflowing text to be about 40% WJ. Richard. From unicode at unicode.org Mon Aug 27 03:59:43 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 27 Aug 2018 09:59:43 +0100 (BST) Subject: Private Use areas In-Reply-To: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> Message-ID: <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it. Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted. Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters. Maybe other circled numbers in the range 10 through to 19 would have special meanings. This method would keep everything within plane zero. William Overington Monday 27 August 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/08/21 - 23:23 (GMTDT) To : doug at ewellic.org Cc : unicode at unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode wrote: Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. As would I. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 27 15:20:31 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Mon, 27 Aug 2018 20:20:31 +0000 Subject: Private Use areas In-Reply-To: References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: This was meant to go to the list. From: Peter Constable Sent: Monday, August 27, 2018 12:33 PM To: wjgo_10009 at btinternet.com; jameskasskrv at gmail.com; richard.wordingham at ntlworld.com; mark at kli.org; beckiergb at gmail.com; verdy_p at wanadoo.fr Subject: RE: Private Use areas That sounds like a non-conformant use of characters in the U+24xx block. Peter From: Unicode > On Behalf Of William_J_G Overington via Unicode Sent: Monday, August 27, 2018 2:00 AM To: jameskasskrv at gmail.com; richard.wordingham at ntlworld.com; mark at kli.org; beckiergb at gmail.com; verdy_p at wanadoo.fr Cc: unicode at unicode.org Subject: Re: Private Use areas Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it. Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted. Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters. Maybe other circled numbers in the range 10 through to 19 would have special meanings. This method would keep everything within plane zero. William Overington Monday 27 August 2018 ----Original message---- From : unicode at unicode.org Date : 2018/08/21 - 23:23 (GMTDT) To : doug at ewellic.org Cc : unicode at unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode > wrote: Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. As would I. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 27 15:31:08 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 27 Aug 2018 12:31:08 -0800 Subject: Private Use areas In-Reply-To: References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: Peter Constable wrote, > That sounds like a non-conformant use of characters in the U+24xx block. Non-conformant? Well, it's probably overkill anyway. A simpler method of identifying which PUA convention is being used for a file would be to either have the first line of the file being something like [PUA00001] or to have the file name be something like MYFILE.TXTPUA00001. Where "PUA00001" equals the CSUR. Other numbers (PUA00002, PUA00003, etc.) for other PUA conventions. If a user has thousands of files using PUA characters, and all the files are using the same PUA convention, why would each file need to contain metadata for each PUA character used within? (Rhetorical) The "prior agreement" part about PUA usage means the user would know in advance how to display the text properly. From unicode at unicode.org Mon Aug 27 15:44:39 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 27 Aug 2018 21:44:39 +0100 (BST) Subject: Private Use areas In-Reply-To: <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> Message-ID: <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> Here is the reply that I sent to Peter Constable and to the other people to whom he wrote. Unlike for Mr Constable and for many other people, all of my posts have to be passed by the moderator, and I know why that is the situation. Though that situation was not imposed by a named official of Unicode Inc. acting in a stated official capacity. So my opportunities to defend my ideas are conditional. William Overington Monday 27 August 2018 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 2018/08/27 - 21:18 (GMTDT) To : beckiergb at gmail.com, verdy_p at wanadoo.fr, petercon at microsoft.com, wjgo_10009 at btinternet.com, mark at kli.org, kenwhistler at att.net, richard.wordingham at ntlworld.com, jameskasskrv at gmail.com Subject : Re: Private Use areas Well, it is a pity that you did not send your reply to the Unicode mailing list. > That sounds like a non-conformant use of characters in the U+24xx block. Well, you are an expert on these things and I do not understand as to with what it would be non-conformant. It seems to me that for many years some people have wanted a way to convey information about the meaning of Private Use Area characters used in a document in an unobtrusive way within the document. The format that I am suggesting could be the basis of a way to do that. I really do not understand the problem. Ken Whistler wrote: >>> > 1. Define a *protocol* for reliable interchange of custom character property information about PUA code points. Some people use XML for things where two characters are used in a different manner. A quick downbeat quip comment about my ideas with no explanation is not helpful and might because of your standing cause some people not to consider the idea even-handedly for concern of offending you. I am reminded of a British film of the 1955 called The Colditz Story. It used to be one of the regular films on the television years ago. I do not know whether it was ever shown in America, maybe, or maybe it is just a British thing. https://www.youtube.com/results?search_query=The+Colditz+story https://en.wikipedia.org/wiki/The_Colditz_Story The reason why I am reminded of that film is that one of the British prisoners devises a plan for a group of British prisoners to escape from Colditz disguised as German officers and just walk out of the gate. This is ridiculed as impossible because it has been tried before at various prisoner of war camps and the people have always been detected as British prisoners. The man suggesting the scheme then points out that the detection is because there is clearly something questionable about the direction from which the disguised prisoners arrive, such as from a prisoners' hut, that is the problem, not the quality of the disguises or the basic soundness of the idea. The man then suggests that they walk out of the German Officers' mess building. Please bear in mind that walking out of the door of the mess building does not mean actually being in the mess, it is a matter of going down the flight of stairs from a storage area, (the stairs having been accessed from under the stage of the castle theatre) walking past the entrance to the dining room and then out of the door, supposedly on their way back, after dinner, to their billets in the village. This done while a concert put on by some others of the prisoners, and attended by the senior German officers, is going on in the castle theatre. So, it is the bit about an idea coming from the wrong direction that reminds me of the film. https://www.youtube.com/watch?v=0eeSYvxVFUw https://www.youtube.com/watch?v=iY8jMkIbwDM https://www.youtube.com/watch?v=QxHsElyFsTI William Overington Monday 27 August 2018 ----Original message---- >From : petercon at microsoft.com Date : 2018/08/27 - 20:33 (GMTDT) To : wjgo_10009 at btinternet.com, jameskasskrv at gmail.com, richard.wordingham at ntlworld.com, mark at kli.org, beckiergb at gmail.com, verdy_p at wanadoo.fr Subject : RE: Private Use areas That sounds like a non-conformant use of characters in the U+24xx block. Peter From: Unicode On Behalf Of William_J_G Overington via Unicode Sent: Monday, August 27, 2018 2:00 AM To: jameskasskrv at gmail.com; richard.wordingham at ntlworld.com; mark at kli.org; beckiergb at gmail.com; verdy_p at wanadoo.fr Cc: unicode at unicode.org Subject: Re: Private Use areas Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it. Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted. Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters. Maybe other circled numbers in the range 10 through to 19 would have special meanings. This method would keep everything within plane zero. William Overington Monday 27 August 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/08/21 - 23:23 (GMTDT) To : doug at ewellic.org Cc : unicode at unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode wrote: Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. As would I. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 27 16:18:31 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 27 Aug 2018 13:18:31 -0800 Subject: Private Use areas In-Reply-To: <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: William Overington wrote, On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington wrote: > Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters > U+24B6 .. U+24E9. > > Use U+2473 as if it were a circled space. ?????????????????????????????? ?????????????????????? From unicode at unicode.org Mon Aug 27 16:20:26 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Mon, 27 Aug 2018 14:20:26 -0700 Subject: Private Use areas In-Reply-To: <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> Message-ID: > > > That sounds like a non-conformant use of characters in the U+24xx block. > > Well, you are an expert on these things and I do not understand as to with > what it would be non-conformant. > > A conformant process must interpret ??????? as the characters ??????? and not as a signal to process what follows as anything other than plain text. What you are proposing is a higher-level protocol, whether you realize it or not. Unfortunately your higher-level protocol has a serious flaw in that it cannot represent the string "???????". Also, seeing a bunch of circled alphanumeric characters in a document ???????????????????????. There are plenty of already-existing higher-level protocols (you mentioned one: XML) that could be used to provide information about PUA characters, and they are all much better suited to that purpose than what you are proposing. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 27 16:26:14 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 27 Aug 2018 22:26:14 +0100 (BST) Subject: Private Use areas In-Reply-To: References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: <20519295.43788.1535405174739.JavaMail.defaultUser@defaultHost> James Kass wrote: > If a user has thousands of files using PUA characters, and all the files are using the same PUA convention, why would each file need to contain metadata for each PUA character used within? (Rhetorical) Because each such file would then be self-contained and free-standing. Such metadata need not necessarily be a huge quantity of data. William Overington Monday 27 August 2018 From unicode at unicode.org Mon Aug 27 19:09:17 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 27 Aug 2018 20:09:17 -0400 Subject: Private Use areas In-Reply-To: References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: <6a83a5f9-5127-cfe4-9ca2-dc4f25d9b1dd@kli.org> On 08/27/2018 05:18 PM, James Kass via Unicode wrote: > William Overington wrote, > > > > On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington > wrote: > >> Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters >> U+24B6 .. U+24E9. >> >> Use U+2473 as if it were a circled space. > ?????????????????????????????? > ?????????????????????? And what's wrong with the ASCII digits? ~mark From unicode at unicode.org Mon Aug 27 19:44:57 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 27 Aug 2018 20:44:57 -0400 Subject: Private Use areas In-Reply-To: References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> Message-ID: But there's nothing wrong with proposing a higher-level protocol; indeed, that's what Ken Whistler was saying: you need a protocol to transmit? this information.? It's metadata, so it will perforce be a higher-level protocol of some kind, whether transmitting actually out-of-band or reserving a piece of the file for metadata.? That's fine.? I'm not sure what the advantage is of using circled characters instead of plain old ascii.? You have to set off your reserved area somehow, and I don't think using circled chars is the least obtrusive way to do it.? You could use XML; that would be pretty well-suited to the task, but maybe it's overkill.? If all you need is to reference some "standard" PUA interpretation (per James Kass' take on this, not William Overington's), then just a header like "[PUA00001]" would work just fine.? (Compare emacs with things like "-*- encoding: utf-8 -*-" or whatever.) For larger chunks of meta-info, XML might be a good choice, but even then, it could be an XML *header* to an otherwise ordinary text file.? Yes, you'd have to delimit it somehow, and probably have a top header (a "magic number") to signal the protocol, but that's doable.? For applications not supporting this protocol, such a setup is probably easier for the eye to skip past (even if it's long) than a bunch of circled letters. A protocol like that is outside of Unicode's scope (just like XML is), but it's certainly something you could write up and try to standardize and get used, with or without the support of ISO. People are coming up with file formats all the time (and if you really want to used circled characters, go ahead.? That's something for you to consider in the design phase of the project). ~mark On 08/27/2018 05:20 PM, Rebecca Bettencourt via Unicode wrote: > > > That sounds like a non-conformant use of characters in > the U+24xx block. > > Well, you are an expert on these things and I do not > understand as to with what it would be non-conformant. > > > A conformant process must interpret ??????? as the characters???????? > and not as a signal to process what follows as anything other than > plain text. > > What you are proposing is a higher-level protocol, whether you realize > it or not. Unfortunately your higher-level protocol has a serious flaw > in that it cannot represent the string "???????". Also, seeing a bunch > of circled alphanumeric characters in a document ???????????????????????. > > There are plenty of already-existing higher-level protocols (you > mentioned one: XML) that could be used to provide information about > PUA characters, and they are all much better suited to that purpose > than what you are proposing. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 28 05:27:28 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 28 Aug 2018 03:27:28 -0700 Subject: Private Use areas In-Reply-To: References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 28 05:44:58 2018 From: unicode at unicode.org (Cosmin Apreutesei via Unicode) Date: Tue, 28 Aug 2018 13:44:58 +0300 Subject: Line wrapping of mixed LTR/RTL text Message-ID: Hello everyone, I'm having a bit of trouble implementing line wrapping with bidi and I would like to ask for some advice or hints on what is the proper way to do this. UAX#9 section 3.4 says that bidi reordering should be done after line wrapping. But in order to do line wrapping correctly I need to be able to visually ignore some whitespace, and I'm not sure exactly which whitespace must be ignored. There is this sentence in UAX#9 which provides a clue: "[...] trailing whitespace will appear at the visual end of the line (in the paragraph direction).". I'm not sure what that means, but by doing some tests with fribidi and libunibreak I noticed that the whitespace always sticks to the logical end of the word (so visually to the right for LTR runs and to the left for RTL runs), regardless of the base paragraph direction. Is it safe to use this assumption and always remove the whitespace at the logical end of the last word of the line? Or is it more complicated than that? Quick example showing the problem. The following text: ??????? ABC DEF with RTL base direction would wrap (for a certain line width) as: ABC ??????? DEF with two spaces between the Latin and Arabic text, one from the Latin text and one from the Arabic text. Since the line logically ends with the "C" and LTR direction, I should have to probably remove the space after the "C" (and, as a rule, just remove the whitespace at the logical end of the word, regardless of paragraph's direction or word's direction). Is this the right way to do it? Screenshots attached. Thanks! -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.png Type: image/png Size: 12005 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2.png Type: image/png Size: 14359 bytes Desc: not available URL: From unicode at unicode.org Tue Aug 28 03:26:12 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 28 Aug 2018 09:26:12 +0100 (BST) Subject: Private Use areas In-Reply-To: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk> References: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk> Message-ID: <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost> Hi Mark E. Shoulson wrote: > I'm not sure what the advantage is of using circled characters instead of plain old ascii. My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters. My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format. William Overington Tuesday 28 August 2018 From unicode at unicode.org Tue Aug 28 10:24:28 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 28 Aug 2018 16:24:28 +0100 (BST) Subject: Private Use areas In-Reply-To: References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost> Message-ID: <31723478.26849.1535469868137.JavaMail.defaultUser@defaultHost> James Kass wrote: > Non-conformant? Well, it's probably overkill anyway. A simpler method of identifying which PUA convention is being used for a file would be to either have the first line of the file being something like [PUA00001] or to have the file name be something like MYFILE.TXTPUA00001. Where "PUA00001" equals the CSUR. Other numbers (PUA00002, PUA00003, etc.) for other PUA conventions. The problem that then arises is that a registry is needed for what those numbers mean, such as PUA01728. So what if someone writes explaining his designs for glyphs for the language of the people who live in the northern part of the fifth planet from the sun in the science fiction novel he is writing? Is registration granted instantly upon request or is there a threshold of some sort? What if lots of people do that, including some people wanting a registry code number for the various emoji that they want? If there is a threshold of proving usage and so on, or of showing that the designs have been produced AT a business or AT a college or whatever, then the system will only work for some users of the Private Use Areas. My opinion is that the system needs to be free-standing, with each usage possibly self-contained or with an external reference to a document that is available. Care would need to be taken to send a copy of any such document to deposit libraries such as The British Library so as to ensure long-term conservation. William Overington Tuesday 28 August 2018 From unicode at unicode.org Tue Aug 28 10:58:54 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 28 Aug 2018 16:58:54 +0100 (BST) Subject: Private Use areas In-Reply-To: References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> Message-ID: <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost> Asmus Freytag wrote: > There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome. I am thinking of such an ad-hoc special purpose markup language. I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display. I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property. It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10. I am wondering how many PUA property variables there would need to be set for the system to be useful. The sequence could start with all of those PUA property values set at their default values so only those that needed changing need be explicitly set, though others could be explicitly set to the default values if a record were desired. William Overington Tuesday 28 August 2018 From unicode at unicode.org Tue Aug 28 11:28:25 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 28 Aug 2018 19:28:25 +0300 Subject: Line wrapping of mixed LTR/RTL text In-Reply-To: (message from Cosmin Apreutesei via Unicode on Tue, 28 Aug 2018 13:44:58 +0300) References: Message-ID: <834lfe4frq.fsf@gnu.org> > Date: Tue, 28 Aug 2018 13:44:58 +0300 > From: Cosmin Apreutesei via Unicode > > There is this sentence in UAX#9 which provides a clue: "[...] trailing > whitespace will appear at the visual end of the line (in the paragraph > direction).". I'm not sure what that means, but by doing some tests > with fribidi and libunibreak I noticed that the whitespace always > sticks to the logical end of the word (so visually to the right for > LTR runs and to the left for RTL runs), regardless of the base > paragraph direction. That is not so if the line ends after the whitespace: in that case the whitespace is trailing, and will appear at the visual end of the line. Only if you add some character after the whitespace will the whitespace "jump" to the other side of the word. > Quick example showing the problem. The following text: > > ??????? ABC DEF > > with RTL base direction would wrap (for a certain line width) as: > > ABC ??????? > DEF > > with two spaces between the Latin and Arabic text, one from the Latin > text and one from the Arabic text. No, it should show the space after ABC to the left of ABC, i.e. immediately before the line end. What UAX#9 tells you is that you need to decide that the line will wrap after the space that follows "ABC", the reorder the line as if it ended after that space, which will produce this: ??????? ABC (with the trailing space to the left of "ABC"). Then you should display "DEF" on the next line. IOW, the correct order is: . find levels . wrap in logical order . reorder wrapped lines From unicode at unicode.org Tue Aug 28 11:43:01 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 28 Aug 2018 09:43:01 -0700 Subject: Private Use areas Message-ID: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com> On August 23, 2011, Asmus Freytag wrote: > On 8/23/2011 7:22 AM, Doug Ewell wrote: >> Of all applications, a word processor or DTP application would want >> to know more about the properties of characters than just whether >> they are RTL. Line breaking, word breaking, and case mapping come to >> mind. >> >> I would think the format used by standard UCD files, or the XML >> equivalent, would be preferable to making one up: > > The right answer would follow the XML format of the UCD. > > That's the only format that allows all necessary information contained > in one file, and it would leverage of any effort that users of the > main UCD have made in parsing the XML format. > > An XML format shold also be flexible in that you can add/remove not > just characters, but properties as needed. > > The worst thing do do, other than designing something from scratch, > would be to replicate the UnicodeData.txt layout with its random, but > fixed collection of properties and insanely many semi-colons. None of > the existing UCD txt files carries all the needed data in a single > file. I don't know if or how I responded 7 years ago, but at least today, I think this is an excellent suggestion. If the goal is to encourage vendors to support PUA assignments, using an exceedingly well-defined format (UAX #42) sitting atop one of the most widely used base formats ever (XML), with all property information in a single repository (per PUA scheme), would be great encouragement. I've devised lots of novel file formats and I think this is one use case where that would be a real hindrance. Storing this information in a font, by hook or crook, would lock users of those PUA characters into that font. At that rate, you might as well use ASCII-hacked fonts, as we did 25 years ago. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Aug 28 12:07:51 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 28 Aug 2018 19:07:51 +0200 Subject: Line wrapping of mixed LTR/RTL text In-Reply-To: References: Message-ID: The space encoded just before the logical end of line or linewrap (in the middle of the displayed line) has to be moved at end of the physical line (in the paragraph direction), it should not be kept in the middle. If you need to force a linewrap on a non-breaking space (because there's no other break opportunity to wrap the line elsewhere), then treat that non-breaking space as a regular breaking space which will also be moved at end of the row (after the margin on the ending side of the paragraph), and choose the last non-breaking space on the row; usually, all spaces present at linewraps (including non-breaking spaces) are compacted. But there are other style policies that will force the linewrap preferably after a trailing punctuation or a separator punctuation, or before a leading punctuation, or just after the last unbreakable cluster that can fit the row (including ion the middle of words at arbitrary position if there's no hyphenation process or the script does not support hyphenation, such as sinograms and kanas). Where to insert linewraps is very fuzzy and depends on the rendering context and capabilities of the target device (you cannot scroll a piece of printed paper, but you can scroll a display with a scrollbar or using navigation cursors in a width-restricted input field) Le mar. 28 ao?t 2018 ? 16:34, Cosmin Apreutesei via Unicode < unicode at unicode.org> a ?crit : > Hello everyone, > > I'm having a bit of trouble implementing line wrapping with bidi and I > would like to ask for some advice or hints on what is the proper way > to do this. > > UAX#9 section 3.4 says that bidi reordering should be done after line > wrapping. But in order to do line wrapping correctly I need to be able > to visually ignore some whitespace, and I'm not sure exactly which > whitespace must be ignored. > > There is this sentence in UAX#9 which provides a clue: "[...] trailing > whitespace will appear at the visual end of the line (in the paragraph > direction).". I'm not sure what that means, but by doing some tests > with fribidi and libunibreak I noticed that the whitespace always > sticks to the logical end of the word (so visually to the right for > LTR runs and to the left for RTL runs), regardless of the base > paragraph direction. Is it safe to use this assumption and always > remove the whitespace at the logical end of the last word of the line? > Or is it more complicated than that? > > Quick example showing the problem. The following text: > > ??????? ABC DEF > > with RTL base direction would wrap (for a certain line width) as: > > ABC ??????? > DEF > > with two spaces between the Latin and Arabic text, one from the Latin > text and one from the Arabic text. Since the line logically ends with > the "C" and LTR direction, I should have to probably remove the space > after the "C" (and, as a rule, just remove the whitespace at the > logical end of the word, regardless of paragraph's direction or word's > direction). Is this the right way to do it? > > Screenshots attached. > > Thanks! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 28 12:13:49 2018 From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode) Date: Tue, 28 Aug 2018 18:13:49 +0100 (BST) Subject: Private Use areas - Vertical Text In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> Message-ID: <1421005745.806742.1535476429135@mail2.virginmedia.com> > > On 27 August 2018 at 15:22 Peter Constable via Unicode wrote: > > Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90? and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case). > > Cf. UAX 50. > There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right. Unicode is doing what it can in this matter: (a) Zhuang PUA characters are being made individually obsolete. (b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs. For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied. Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed ?upright? in vertical text? The vrtr feature (if present) would be applied when glyphs are to be rotated. Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)? It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated. The truly difficult case would be when there is no attempt to apply a look-up ? possibly vrtr would not apply to /p{vo=r}. I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem. This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 28 13:28:58 2018 From: unicode at unicode.org (Cosmin Apreutesei via Unicode) Date: Tue, 28 Aug 2018 21:28:58 +0300 Subject: Line wrapping of mixed LTR/RTL text In-Reply-To: <834lfe4frq.fsf@gnu.org> References: <834lfe4frq.fsf@gnu.org> Message-ID: Hi Eli, thanks for answering! I think I'm getting closer. Just a few more clarifications if you please. > That is not so if the line ends after the whitespace: in that case the > whitespace is trailing, and will appear at the visual end of the > line. So only if it's a soft break I should indeed remove the last logical space, if it's before a hard break then leave it alone. > Only if you add some character after the whitespace will the > whitespace "jump" to the other side of the word. ... because the hard break just turned into a soft break and the newly typed character will appear on the next line with a hard line break after it, right? > No, it should show the space after ABC to the left of ABC, > i.e. immediately before the line end. Just to make sure, this moving of the last space at the visual end of the line can only be experienced with a moving cursor, right? I mean as far as displaying goes (and as far as line width computation for the purposes of line wrapping goes), that space is just removed, right? I'm trying to infer the purpose of moving that space to the end of the line instead of just removing it: is the idea to always provide a cursor at the visual end of the line so that typing can continue there or is there more to it? > What UAX#9 tells you is that you need to decide that the line will > wrap after the space that follows "ABC" ... but when computing the line width I should not include the width of that space, right? since it will not take space in the box in the end. >, then reorder the line as if it > ended after that space, which will produce this: > > ??????? ABC > > (with the trailing space to the left of "ABC"). Then you should > display "DEF" on the next line. You mean it will produce this: " ABC ???????" From unicode at unicode.org Tue Aug 28 13:33:14 2018 From: unicode at unicode.org (Cosmin Apreutesei via Unicode) Date: Tue, 28 Aug 2018 21:33:14 +0300 Subject: Line wrapping of mixed LTR/RTL text In-Reply-To: References: Message-ID: Hi Philippe, > The space encoded just before the logical end of line or linewrap (in the middle of the displayed line) has to be moved at end of the physical line (in the paragraph direction), it should not be kept in the middle. Ok, that seem to confirm what Eli is saying and it clarifies that sentence from UAX#9. Thanks! From unicode at unicode.org Tue Aug 28 13:48:10 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 28 Aug 2018 21:48:10 +0300 Subject: Line wrapping of mixed LTR/RTL text In-Reply-To: (message from Cosmin Apreutesei on Tue, 28 Aug 2018 21:28:58 +0300) References: <834lfe4frq.fsf@gnu.org> Message-ID: <83tvne2uqd.fsf@gnu.org> > From: Cosmin Apreutesei > Date: Tue, 28 Aug 2018 21:28:58 +0300 > Cc: unicode at unicode.org > > > That is not so if the line ends after the whitespace: in that case the > > whitespace is trailing, and will appear at the visual end of the > > line. > > So only if it's a soft break I should indeed remove the last logical > space, if it's before a hard break then leave it alone. Actually, you don't have to remove it, you could leave it. It's only an aesthetic issue. > > No, it should show the space after ABC to the left of ABC, > > i.e. immediately before the line end. > > Just to make sure, this moving of the last space at the visual end of > the line can only be experienced with a moving cursor, right? I mean > as far as displaying goes (and as far as line width computation for > the purposes of line wrapping goes), that space is just removed, > right? As I said, not necessarily. But it is definitely there when you reorder characters for display. > I'm trying to infer the purpose of moving that space to the > end of the line instead of just removing it If you remove trailing space, then you need to see it being trailing before you remove it. That is the purpose of moving it. > > What UAX#9 tells you is that you need to decide that the line will > > wrap after the space that follows "ABC" > > ... but when computing the line width I should not include the width > of that space, right? since it will not take space in the box in the > end. If you will remove the space, then yes. > You mean it will produce this: > > " ABC ???????" Yes. From unicode at unicode.org Tue Aug 28 23:04:31 2018 From: unicode at unicode.org (via Unicode) Date: Wed, 29 Aug 2018 12:04:31 +0800 Subject: Private Use areas - Vertical Text In-Reply-To: <1421005745.806742.1535476429135@mail2.virginmedia.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> Message-ID: <970787d82640279efdd541f02e39a1bd@koremail.com> Dear Richard and Peter, apologies for the lack of clarity. Let me try to explain below. On 2018-08-29 01:13, WORDINGHAM RICHARD via Unicode wrote: >> On 27 August 2018 at 15:22 Peter Constable via Unicode >> wrote: >> >> Layout engines that support CJK vertical layout do not rely on the >> 'vert' feature to rotate glyphs for CJK ideographs, but rather >> rotate the glyph 90? and switch to using vertical glyph metrics. >> The 'vert' feature is used to substitute vertical alternate glyphs >> as needed, such as for punctuation that isn't automatically rotated >> (and would probably need a differently-positioned alternate in any >> case). >> >> Cf. UAX 50. > > There have been some pretty confused statements. I believe the > observed problem is that PUA characters for Zhuang CJK ideographs get > rotated when displayed vertically rather than left-to-right. > Yes, as Richard says when CJK Zhuang text is displayed vertically whilst the Zhuang characters in Unicode remain upright, but those with PUA codepoints are rotated 90?. This is because the PUA characters are treated like English text, which are correctly rotated 90?. The orientation of the CJK characters in this case appears to depend on which block they belong to. As Peter points out this does not seem to match UAX 50. > Unicode is doing what it can in this matter: > > (a) Zhuang PUA characters are being made individually obsolete. > Yes and No. Whilst a thousand Zhuang characters have been enocoded and two thousand have been submitted via IRG, however the number of PUA Zhuang characters is about the same or increasing. In 2006 when started just under 6k PUA points were used, presently there are over 8k, over 6k of which have not been submitted, and the earliest any future submissions can be encoded is 2026. That being said the number of more common Zhuang characters needing PUA support is coming down. So whilst individual characters are being resolved, the need for PUA Zhuang characters remains, and will so for decades to come. > (b) By default, PUA characters have the value of > Vertical_orientation=upright as do CJK ideographs. > Noted above. Regards John > For CJK ideographs, it is not clear to me when the vert feature (if > present) would be applied. Is it only for some codepoints (vo=tu), or > is it for all that the engine expects to be displayed 'upright' in > vertical text? The vrtr feature (if present) would be applied when > glyphs are to be rotated. Is it for all such glyphs, or only those > for which rotation is expected to be inadequate (vo=tr)? It seems > that feature vrt2 is to be applied to all glyphs; perhaps rotation is > the default behaviour when there is no look-up value for a glyph that > the engine expects to be rotated. The truly difficult case would be > when there is no attempt to apply a look-up - possibly vrtr would not > apply to /p{vo=r}. > > I would expect that defining the lookup vrt2 or vrtr to map Zhuang > glyphs to themselves (or something prerotated) would cure the problem. > This would not work for sequences of Zhuang ideographs treated as RTL > text - but that is unlikely to happen. > > Richard. From unicode at unicode.org Wed Aug 29 00:47:47 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 29 Aug 2018 07:47:47 +0200 Subject: Private Use areas References: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com> Message-ID: <86h8jdaflo.fsf@mimuw.edu.pl> On Tue, Aug 28 2018 at 9:43 -0700, unicode at unicode.org writes: > On August 23, 2011, Asmus Freytag wrote: > >> On 8/23/2011 7:22 AM, Doug Ewell wrote: >>> Of all applications, a word processor or DTP application would want >>> to know more about the properties of characters than just whether >>> they are RTL. Line breaking, word breaking, and case mapping come to >>> mind. >>> >>> I would think the format used by standard UCD files, or the XML >>> equivalent, would be preferable to making one up: Right. I was not so quick to state this so early, but 2 years ago I wrote to the MUFI list: --8<---------------cut here---------------start------------->8--- On Sat, Jan 02 2016 at 12:35 CET, odd.haugen at uib.no writes: [...] > Note the permanent URI at the University Library in Bergen. This will > in all likelihood be the last recommendation of its kind (and > certainly the last edited by the undersigned), so please look out for > new solutions (databases or the like) on the MUFI web site! I think that one of the forms, perhaps even the primary one, should follow the original Unicode Character Database and the output of Unibook (http://www.unicode.org/unibook/). The idea can be tested by converting the present recommendation to this form. Unfortunately I'm unable to contribute myself to this task. One of the advantages would be that the various character browsers can be adapted relatively easily to provide info about the MUFI characters. A simpler variant of this idea is to use Unibook-like format to document fonts. A quick-and-dirty tools for this purpose has been prepared by a student of mine: https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ https://bitbucket.org/jsbien/unicode-ucd-parser A sample output of the tools is available at https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf (the font is also quick-and-dirty and unfinished work). --8<---------------cut here---------------end--------------->8--- Unfortunately there was no reaction. >> >> The right answer would follow the XML format of the UCD. >> >> That's the only format that allows all necessary information contained >> in one file, For me necessary are also comments and crossreferences contained in NamesList.txt. Do I understand correctly that only "ISO Comment properties" are included in the file? >> and it would leverage of any effort that users of the >> main UCD have made in parsing the XML format. >> >> An XML format shold also be flexible in that you can add/remove not >> just characters, but properties as needed. >> >> The worst thing do do, other than designing something from scratch, >> would be to replicate the UnicodeData.txt layout with its random, but >> fixed collection of properties and insanely many semi-colons. None of >> the existing UCD txt files carries all the needed data in a single >> file. > > I don't know if or how I responded 7 years ago, but at least today, I > think this is an excellent suggestion. > > If the goal is to encourage vendors to support PUA assignments, using an > exceedingly well-defined format (UAX #42) sitting atop one of the most > widely used base formats ever (XML), with all property information in a > single repository (per PUA scheme), would be great encouragement. I think we need also the data in the format acceptable by UniBook. > I've devised lots of novel file formats and I think this is one use > case where that would be a real hindrance. > Storing this information in a font, by hook or crook, would lock users > of those PUA characters into that font. At that rate, you might as well > use ASCII-hacked fonts, as we did 25 years ago. Storing the information in a font is inappropriate not only for the technical reasons, as I wrote recently (on Thu, Aug 23 2018) > Fonts are for *rendering*, new characters and variants are more and > more often needed for *input* of real life old texts with sufficient > precision. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Aug 29 03:06:36 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 29 Aug 2018 00:06:36 -0800 Subject: Private Use areas - Vertical Text In-Reply-To: <970787d82640279efdd541f02e39a1bd@koremail.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> <970787d82640279efdd541f02e39a1bd@koremail.com> Message-ID: John Knightley wrote, > Yes, as Richard says when CJK Zhuang text is displayed > vertically whilst the Zhuang characters in Unicode remain > upright, but those with PUA codepoints are rotated 90?. > This is because the PUA characters are treated like English > text, which are correctly rotated 90?. ... > > ... > ... the need for PUA Zhuang characters remains, and will > so for decades to come. A possible work-around would be to have two fonts for PUA Zhuang, one for horizontal text and one for vertical. The one for the vertical text would have the glyphs in the font pre-rotated 90? anti-clockwise. This would require font switching when switching from horizontal to vertical layout, of course. From unicode at unicode.org Wed Aug 29 03:25:43 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 29 Aug 2018 09:25:43 +0100 Subject: Private Use areas - Vertical Text In-Reply-To: <1421005745.806742.1535476429135@mail2.virginmedia.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> Message-ID: On Tue, 28 Aug 2018 at 18:15, WORDINGHAM RICHARD via Unicode wrote: > > Unicode is doing what it can in this matter: > > (a) Zhuang PUA characters are being made individually obsolete. Not by a nebulous entity called "Unicode", or even by the Unicode Consortium per se, but by the hard work over many years by individual experts such as John Knightley. Andrew From unicode at unicode.org Wed Aug 29 03:32:57 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 29 Aug 2018 09:32:57 +0100 Subject: Private Use areas - Vertical Text In-Reply-To: <970787d82640279efdd541f02e39a1bd@koremail.com> References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> <970787d82640279efdd541f02e39a1bd@koremail.com> Message-ID: On Wed, 29 Aug 2018 at 05:07, via Unicode wrote: > > Yes, as Richard says when CJK Zhuang text is displayed vertically whilst > the Zhuang characters in Unicode remain upright, but those with PUA > codepoints are rotated 90?. John, you did not explain by what mechanism you were trying to display vertical PUA Zhuang text. I can display vertically-oriented PUA-encoded CJKVZ ideographs in vertical layout in web pages using CSS, as demonstrated in this test page: http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html The PUA characters display with correct orientation under Windows 10 on the Edge, Chrome and Firefox browsers. The test page only fails under IE, but we are not meant to use IE anymore anyway. Andrew From unicode at unicode.org Wed Aug 29 05:18:19 2018 From: unicode at unicode.org (via Unicode) Date: Wed, 29 Aug 2018 18:18:19 +0800 Subject: Private Use areas - Vertical Text In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> <970787d82640279efdd541f02e39a1bd@koremail.com> Message-ID: Dear Andrew, I was using a change horizontal to vertical text feature in office, the PUA characters being from plane 15. Regards John On 2018-08-29 16:32, Andrew West via Unicode wrote: > On Wed, 29 Aug 2018 at 05:07, via Unicode wrote: >> >> Yes, as Richard says when CJK Zhuang text is displayed vertically >> whilst >> the Zhuang characters in Unicode remain upright, but those with PUA >> codepoints are rotated 90?. > > John, you did not explain by what mechanism you were trying to display > vertical PUA Zhuang text. > > I can display vertically-oriented PUA-encoded CJKVZ ideographs in > vertical layout in web pages using CSS, as demonstrated in this test > page: > > http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html > > The PUA characters display with correct orientation under Windows 10 > on the Edge, Chrome and Firefox browsers. The test page only fails > under IE, but we are not meant to use IE anymore anyway. > > Andrew From unicode at unicode.org Wed Aug 29 07:05:31 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 29 Aug 2018 13:05:31 +0100 Subject: Private Use areas - Vertical Text In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> <970787d82640279efdd541f02e39a1bd@koremail.com> Message-ID: On Wed, 29 Aug 2018 at 11:18, wrote: > > I was using a change horizontal to vertical text feature in office, the > PUA characters being from plane 15. I tested with Word 2007, and normal PUA characters from my font were displayed with vertical orientation in a vertical text box, but Plane 15 PUA characters were rotated. I also tested with Word 2016, and both normal PUA characters and Plane 15 PUA characters were displayed with vertical orientation in a vertical text box, as you want, although there were vertical spacing issues with the Plane 15 PUA characters which suggest that the vertical metrics tables (vhea and vmtx) in the font are not being applied for Plane 15 characters (or it could be a problem with my font). Andrew From unicode at unicode.org Wed Aug 29 15:33:18 2018 From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode) Date: Wed, 29 Aug 2018 21:33:18 +0100 (BST) Subject: Private Use areas - Vertical Text In-Reply-To: References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com> <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org> <444142b31601a3fbbdbb765e47cbd125@koremail.com> <20180821110156.453c129a@JRWUBU2> <1421005745.806742.1535476429135@mail2.virginmedia.com> <970787d82640279efdd541f02e39a1bd@koremail.com> Message-ID: <910040764.839307.1535574798093@mail2.virginmedia.com> > > On 29 August 2018 at 13:05 Andrew West via Unicode wrote: > > I tested with Word 2007, and normal PUA characters from my font were > > displayed with vertical orientation in a vertical text box, but Plane > 15 PUA characters were rotated. > And then the original question is whether a font can suppress this rotation. For example, it is entirely possible that the rotation could be eliminated by the vrt2 OpenType feature mapping a Zhuang PUA glyph to an identical glyph. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 29 16:42:57 2018 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Wed, 29 Aug 2018 21:42:57 +0000 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: <20180721085026.6aa07876@JRWUBU2> References: <20180721020131.4b22887b@JRWUBU2> <20180721085026.6aa07876@JRWUBU2> Message-ID: Thank you Richard and Shriramana for bringing up this interesting problem. I agree we need to fix this. I don?t want to fix this with a font hack or change to USE cluster rules or properties. I think the right place to fix this is in the encoding. This might be either a new character for Tamil Brahmi Pu??i ? as Shriramana has proposed (L2/12-226) ? or separate characters for Tamil Brahmi Short E and Tamil Brahmi Short O in independent and dependent forms (4 characters total). I?m inclined to think that a visible virama, Tamil Brahmi Pu??i, is the right approach. Cheers, Andrew -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Saturday, July 21, 2018 12:50 AM To: unicode at unicode.org Subject: Re: Tamil Brahmi Short Mid Vowels On Sat, 21 Jul 2018 07:55:51 +0530 Shriramana Sharma via Unicode > wrote: > This is a unique problem because this is probably the only case where > the same script produces conjuncts for one language and not for > another. There are and have been similar cases. Reformed (a.k.a. 'typewriter') Malayalam v. traditional Malayalam comes immediately to mind. Pre-5.0 Myanamar script was similar, with Pali stacking and Burmese mostly not, though that gives you the precedent of disunifying the invisible stacker and the vowel killer, which I've always considered a bad unification inherited from ISCII. 'Pure' Tai and Pali use stacking quite differently in the Tai Tham script, but some Tai languages use a lot of Pali-style spellings. > I had asked for a separate Tamil Brahmi virama to be encoded which > would obviate this problem but that was shot down. Maybe that case > should be reopened? Could be messy. Are you saying that people are relying on fonts being free of conjuncts? One could use a keyboard with a 'pulli' key that produced - I don't know if people do. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 29 19:27:33 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) Subject: Private Use areas Message-ID: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> On 29/08/18 07:55, Janusz S. Bie? via Unicode wrote: > > On Tue, Aug 28 2018 at 9:43 -0700, unicode at unicode.org writes: > > On August 23, 2011, Asmus Freytag wrote: > > > >> On 8/23/2011 7:22 AM, Doug Ewell wrote: > >>> Of all applications, a word processor or DTP application would want > >>> to know more about the properties of characters than just whether > >>> they are RTL. Line breaking, word breaking, and case mapping come to > >>> mind. > >>> > >>> I would think the format used by standard UCD files, or the XML > >>> equivalent, would be preferable to making one up: [?] > >> > >> The right answer would follow the XML format of the UCD. > >> > >> That's the only format that allows all necessary information contained > >> in one file, > > For me necessary are also comments and crossreferences contained in > NamesList.txt. Do I understand correctly that only "ISO Comment > properties" are included in the file? Even that comment field is obsoleted. But it?s unclear to me what exactly it was providing from ISO. > > >> and it would leverage of any effort that users of the > >> main UCD have made in parsing the XML format. > >> > >> An XML format shold also be flexible in that you can add/remove not > >> just characters, but properties as needed. > >> > >> The worst thing do do, other than designing something from scratch, > >> would be to replicate the UnicodeData.txt layout with its random, but > >> fixed collection of properties and insanely many semi-colons. None of > >> the existing UCD txt files carries all the needed data in a single > >> file. Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible. I never wondered why the header line is missing, probably because compared to the other UCD files, the file looks really odd without a file header showing at least the version number and datestamp. It?s like the file was made up for dumb parsers unable to handle comment delimiters, and never to be upgraded to do so. But I like the format, and that?s why at some point I submitted feedback asking for an extension. Indeed we could use more information than what is yielded by UCD \setminus NamesList.txt (that we may not parse, as per file header). Given NamesList.txt / Code Charts comments are kept minimal by design, one couldn?t simply pop them into XML or whatever, as the result would be disappointing and call for completion in the aftermath. Yet another task competing with CLDR survey. Reviewing CLDR data is IMO top priority. There are many flaws to be fixed in many languages including in English. A lot of useful digest charts are extracted from XML there, and we really need to go through the data and correct the many many errors, please. Unlike XML, human readability of CSV may not be immediate. Yes you simply cannot always count the semicolons and remember the property name from the value position if it isn?t obvious by itself. But we use spreadsheets. At least some people do. That?s where the magic works. Looking up things in a spreadsheet is a good way to find out about wrong property values. Looks like handling files only programmatically gets everything screwed up. Marcel From unicode at unicode.org Thu Aug 30 13:27:30 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 30 Aug 2018 12:27:30 -0600 Subject: Unicode Digest, Vol 56, Issue 20 In-Reply-To: Message-ID: <201808301827.w7UIRbqF028462@unicode.org> UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file. So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead. The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know. That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives. --Doug Ewell | Thornton, CO, US | ewellic.org -------- Original message --------Message: 3Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) From: Marcel Schneider via Unicode Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible. I never wondered why the header line is missing, probably because compared to the other UCD files, the file looks really odd without a file header showing at least the version number and datestamp. It?s like the file was made up for dumb parsers unable to handle comment delimiters, and never to be upgraded to do so. But I like the format, and that?s why at some point I submitted feedback asking for an extension. [...] -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 30 16:26:36 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 30 Aug 2018 23:26:36 +0200 Subject: Unicode Digest, Vol 56, Issue 20 In-Reply-To: <201808301827.w7UIRbqF028462@unicode.org> References: <201808301827.w7UIRbqF028462@unicode.org> Message-ID: Welel an alternative to XML is JSON which is more compact and faster/simpler to process; however JSON has no explicit schema, unless the schema is being made part of the data itself, complicating its structure (with many levels of arrays of arrays, in which case it becomes less easy to read by humans, but more adapted to automated processes for fast processing). I'd say that the XML alone is enough to generate any JSON-derived dataset that will conform to the schema an application expects to process fast (and with just the data it can process, excluding various extensions still not implemetned). But the fastest implementations are also based on data tables encoded in code (such as DLL or Java classes), or custom database formats (such as Berkeley dB) generated also automatically from the XML, without the processing cost of decompression schemes and parsers. Still today, even if XML is not the usual format used by applications, it is still the most interoperable format that allows building all sorts of applications in all sorts of languages: the cost of parsing is left to an application builder/compiler. Some apps embed the compilers themselves and use a stored cache for faster processing: this approach allows easy updates by detecting changes in the XML source, and then downloading them. But in CLDR such updates are generally not automated : the general scheme evolves over time and there are complex dependencies to check so that some data becomes usable (frequently you need to implement some new algorithms to follow the processing rules documented in CLDR, or to use data not completely validated, or to allow aplicatio?ns to provide their overrides from insufficiently complete datasets in CLDR, even if CLDR provides a root locale and applcaitions are supposed to follow the BCP47 fallback resolution rules; applciations also have their own need about which language codes they use or need, and CLDR provides many locales that many applications are still not prepared to render correctly, and many application users complain if an application is partly translated and contains too many fallbacks to another language, or worse to another script). Le jeu. 30 ao?t 2018 ? 20:38, Doug Ewell via Unicode a ?crit : > UnicodeData.txt was devised long before any of the other UCD data files. > Though it might seem like a simple enhancement to us, adding a header > block, or even a single line, would break a lot of existing processes that > were built long ago to parse this file. > > So Unicode can't add a header to this file, and that is the reason the > format can never be changed (e.g. with more columns). That is why new files > keep getting created instead. > > The XML format could indeed be expanded with more attributes and more > subsections. Any process that can parse XML can handle unknown stuff like > this without misinterpreting the stuff it does know. > > That's why the only two reasonable options for getting UCD data are to > read all the tab- and semicolon-delimited files, and be ready for new > files, or just read the XML. Asking for changes to existing UCD file > formats is kind of a non-starter, given these two alternatives. > > > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------- Original message -------- > Message: 3 > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) > From: Marcel Schneider via Unicode > > Curiously, UnicodeData.txt is lacking the header line. That makes it > unflexible. > I never wondered why the header line is missing, probably because compared > to the other UCD files, the file looks really odd without a file header > showing > at least the version number and datestamp. It?s like the file was made up > for > dumb parsers unable to handle comment delimiters, and never to be upgraded > to do so. > > But I like the format, and that?s why at some point I submitted feedback > asking > for an extension. [...] > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 30 17:33:42 2018 From: unicode at unicode.org (Wordingham Richard via Unicode) Date: Thu, 30 Aug 2018 23:33:42 +0100 (BST) Subject: Private Use areas In-Reply-To: <86h8jdaflo.fsf@mimuw.edu.pl> References: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com> <86h8jdaflo.fsf@mimuw.edu.pl> Message-ID: <1730684724.866931.1535668422391@mail2.virginmedia.com> > > On 29 August 2018 at 06:47 "Janusz S. Bie? via Unicode" wrote: > > > > > > Storing this information in a font, by hook or crook, would lock users > > of those PUA characters into that font. At that rate, you might as well > > use ASCII-hacked fonts, as we did 25 years ago. > > > > > I don't see that at all. The obvious way in the sfnt format, used by OpenType, is as a table consisting entirely of the XML file. It is quite easy to add a table to an unsigned sfnt font, and even easier to extract a table consisting entirely of UTF-8 text, though ASCII would be even easier, from a font file. > > Storing the information in a font is inappropriate not only for thetechnical reasons, as I wrote recently (on Thu, Aug 23 2018) > > > > > > Fonts are for *rendering*, new characters and variants are more and > > more often needed for *input* of real life old texts with sufficient > > precision. > > > > > 1. There are existing methods of associating a font with a text. Not using a font needs a new scheme for associating a set of PUA properties with a portion of a file. The font also serves as a code chart. It can also hold information on how characters combine, which is notoriously beyond the capability of code charts. 2. Registries can vanish. 3. In practice, a file needs to retain an association with a specialist font. Preserving the font should preserve its content, but there are pruning techniques (e.g. WOFF2) that may remove this content. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 30 18:14:41 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 31 Aug 2018 01:14:41 +0200 (CEST) Subject: Unicode Digest, Vol 56, Issue 20 Message-ID: <957858186.11079.1535670881197.JavaMail.www@wwinf1d25> Thank you for looking into this. First, I?m unable to retrieve the publication you are citing, but a February thread had nearly the same subject, referring to Vol. 50. How did you compute these figures? Is that a code phrase to say: ?The same questions over and over again; let?s settle this on the record, as a reference for later inquiries.? Also, "unicode-request at unicode.org" doesn?t appear to seem to be a valid e-mail address. That would mean that I?d better send a proposal with an enhancement request to docsubmit at unicode.org, rather than contribute to the topic while it is being discussed on the Unicode Public Mail List? OK I?ll try to get something out of this, because many people really want things to grow better: On 30/08/18 20:37 Doug Ewell via Unicode wrote: > > UnicodeData.txt was devised long before any of the other UCD data files. I can?t think of any era in the computer age where file headers were uncommon, and where a parser able to process semicolons couldn?t be directed to make sense of crosshatches. If ever releasing a headerless file was a mistake, implementers would be able to anticipate that it might be corrected at some point. Implementations are to be updated at every single Unicode release, that?s what I?m able to tell, while ignoring the arcanes of frozen APIs. > Though it might seem like a simple enhancement to us, adding a header block, or even a single line, > would break a lot of existing processes that were built long ago to parse this file. They are hopelessly outdated anyway, and most of them would have been replaced with something better since a long time. The remainder might not be worth bothering the rest of the world with headerless files. > So Unicode can't add a header to this file, and that is the reason the format can never be changed > (e.g. with more columns). That is why new files keep getting created instead. I figured out something like that rationale, and I can also understand that Unicode isn?t going to keep releasing headerless files while waiting for a guy telling them not to do so, and then to suddenly add the missing header. Also I didn?t really ask for that, but suggested adding yet another *new* file, not changing the data structure of the existing UnicodeData.txt. As of the reference, a Google search for "unicodedataextended.txt" just brought it up: http://www.unicode.org/review/pri297/ Having said that, I still think that while not parsing a header line in a process is a reasonable position if the field structure is known to be stable, not being able to *skip* a header is sort of odd. > The XML format could indeed be expanded with more attributes and more subsections. > Any process that can parse XML can handle unknown stuff like this without misinterpreting > the stuff it does know. Agreed. I?m not questioning XML. But I?m using spreadsheets. I don?t know how many computer scientists do use spreadsheets. Perhaps we?re not many looking up UnicodeData.txt that way (I use it in raw text, too, and I look up ucd.nounihan.flat.xml). Generating code in a spreadsheet is considered quick-and-dirty. I don?t agree it?s dirty, but it?s quick. And above all, it appears that doing certain research in spreadsheets is the most efficient way to check whether character properties are matching character identity. Using spreadsheet software is trivial, so it might be disconsidered and left to non-scientists, while it is closer to human experience and allows to do research in nearly no time, by adding columns, filters and formulae, that one would probably spend weeks to code in C, Lisp, Perl or Python (that I cannot do, so I?m biased). > That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, > and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, > given these two alternatives. Given the above, one can easily understand why I do not agree with being limited to these two alternatives. Given a process must be able to be updated to be able? to grab a newly added small file from the UCD, it can as well be updated to be able to skip file comments, and even to be able to parse a new *large* file from the UCD. On the other hand, given Unicode are ready to add new small semicolon-delimited files, they might wish to add as well a new *large* semicolon-delimited file to the UCD. That large file would have a file header and a header line, and be specified as being flexible. That file might have one hundred fields delimited by 99 semicolons. These 5 million semicolons would still be more lightweight than 5 million attribute names plus the XML tags. The added value is that people using spreadsheets have a handy file to import, rather than each individual having to convert a large XML file to a large CSV file, by lack of the latter being readily provided by Unicode. If this discussion has a positive echo, I or somebody else may submit an appropriate proposal. But I?d prefer not repeating the mistake of not discussing a topic on Unicode Public prior to submitting a proposal that is then kindly put on the agenda, but discussed in disfavor and dismissed in disgrace twice at UTC meetings. And figure out why I didn?t wish upstream discussion here? Because I was naively afraid that the unveiled mistakes could reflect badly on some people. Turned out that nothing reflects badly on anybody. (So UnicodeData.txt could as well get its missing header BTW.) Regards, Marcel From unicode at unicode.org Thu Aug 30 23:58:37 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20) In-Reply-To: References: <201808301827.w7UIRbqF028462@unicode.org> Message-ID: <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> On 30/08/18 23:34 Philippe Verdy via Unicode wrote: > > Welel an alternative to XML is JSON which is more compact and faster/simpler to process; Thanks for pointing the problem and the solution alike. Indeed the main drawback of the XML format of UCD is that it results in an ?insane? filesize. ?Insane? was applied to the number of semicolons in UnicodeData.txt, but that is irrelevant. What is really insane is the filesize of the XML versions of the UCD. Even without Unihan, it may take up to a minute or so to load in a text editor. > however JSON has no explicit schema, unless the schema is being made part of the data itself, > complicating its structure (with many levels of arrays of arrays, in which case it becomes > less easy to read by humans, but more adapted to automated processes for fast processing). > > I'd say that the XML alone is enough to generate any JSON-derived dataset that will conform > to the schema an application expects to process fast > (and with just the data it can process, excluding various extensions still not implemetned). > But the fastest implementations are also based on data tables encoded in code > (such as DLL or Java classes), or custom database formats (such as Berkeley dB) > generated also automatically from the XML, without the processing cost of decompression schemes > and parsers. > > Still today, even if XML is not the usual format used by applications, it is still > the most interoperable format that allows building all sorts of applications > in all sorts of languages: the cost of parsing is left to an application builder/compiler. I?ve tried an online tool to get ucd.nounihan.flat.xml converted to CSV. The tool is great and offers a lot of options, but given the ?insane? file size, my browser was up for over two hours of trouble until I shut down the computer manually. From what I could see in the result field, there are many bogus values, meaning that their presence is useless in the tags of most characters. And while many attributes have cryptic names in order to keep the file size minimal, some attributes have overlong values, ie the design is inconsistent. Eg in every character we read: jg="No_Joining_Group" That is bogus. One would need to take them off the tags of most characters, and even in the characters where they are relevant, the value would be simply "No". What?s the use of abbreviating "Joining Group" to "jg" in the atribute name if in the value it is written out? And I?m quoting from U+0000. Further many values are set to a crosshatch, instead of simply being removed from the characters where they are empty. Then the many instances of "undetermined script" resulting in *two* attribues with "Zyyy" value. Then in almost each character we?re told that it is not a whitespace, not a dash, not a hyphen, and not a quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn?t tell that UCD does actually benefit from the flexibility of XML, given that many attributes are systematically present even where they are useless. Perhaps ucd-*.xml would be two thirds, half, or one third their actual size if they were properly designed. > Some apps embed the compilers themselves and use a stored cache for faster processing: > this approach allows easy updates by detecting changes in the XML source, and then > downloading them. > > But in CLDR such updates are generally not automated : the general scheme evolves over time > and there are complex dependencies to check so that some data becomes usable Should probably read *un*usable. > (frequently you need to implement some new algorithms to follow the processing rules > documented in CLDR, or to use data not completely validated, or to allow aplicatio?ns > to provide their overrides from insufficiently complete datasets in CLDR, > even if CLDR provides a root locale and applcaitions are supposed to follow the BCP47 > fallback resolution rules; > applciations also have their own need about which language codes they use or need, > and CLDR provides many locales that many applications are still not prepared to render correctly, > and many application users complain if an application is partly translated and contains > too many fallbacks to another language, or worse to another script). So the case is even worse than what I could see when looking into CLDR. Many countries, including France, don?t care about the data of their own locale in CLDR, but I?m not going to vent about that on Unicode Public, because that involves language offices and authorities, and would have political entanglements. Staying technical, I can tell so far about the file header of UnicodeData.txt that I can see zero technical reasons not to add it. Processes using the file to generate an overview of Unicode also use other files and are thus able to process comments correctly, whereas those processes using UnicodeData to look up character properties provided in the file would start searching the code point. (Perhaps there are compilers building DLLs from the file.) Le?jeu. 30 ao?t 2018 ??20:38, Doug Ewell via Unicode a ?crit?: > UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file. > So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead. > The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know. > That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives. > > > -- Doug Ewell | Thornton, CO, US | ewellic.org > -------- Original message -------- Message: 3 Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) > From: Marcel Schneider via Unicode > > Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible. I never wondered why the header line is missing, probably because compared to the other UCD files, the file looks really odd without a file header showing at least the version number and datestamp. It?s like the file was made up for dumb parsers unable to handle comment delimiters, and never to be upgraded to do so. But I like the format, and that?s why at some point I submitted feedback asking for an extension. [...] From unicode at unicode.org Fri Aug 31 00:20:34 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 31 Aug 2018 07:20:34 +0200 Subject: CLDR (was: Private Use areas) In-Reply-To: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> (Marcel Schneider via Unicode's message of "Thu, 30 Aug 2018 02:27:33 +0200 (CEST)") References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> Message-ID: <86sh2v3ye5.fsf@mimuw.edu.pl> On Thu, Aug 30 2018 at 2:27 +0200, unicode at unicode.org writes: [...] > Given NamesList.txt / Code Charts comments are kept minimal by design, > one couldn?t simply pop them into XML or whatever, as the result would be > disappointing and call for completion in the aftermath. Yet another task > competing with CLDR survey. Please elaborate. It's not clear for me what do you mean. > Reviewing CLDR data is IMO top priority. > There are many flaws to be fixed in many languages including in English. > A lot of useful digest charts are extracted from XML there, Which XML? where? > and we really > need to go through the data and correct the many many errors, please. Some time ago I tried to have a close look at the Polish locale and found the CLDR site prohibitively confusing. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Aug 31 01:19:53 2018 From: unicode at unicode.org (Marius Spix via Unicode) Date: Fri, 31 Aug 2018 08:19:53 +0200 Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20) In-Reply-To: <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> Message-ID: <20180831081953.68476d36@spixxi> A good compromise between human readability, machine processability and filesize would be using YAML. Unlike JSON, YAML supports comments, anchors and references, multiple documents in a file and several other features. Regards, Marius Spix On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode wrote: > On 30/08/18 23:34 Philippe Verdy via Unicode wrote: > > > > Welel an alternative to XML is JSON which is more compact and > > faster/simpler to process; > > Thanks for pointing the problem and the solution alike. Indeed the > main drawback of the XML format of UCD is that it results in an > ?insane? filesize. ?Insane? was applied to the number of semicolons > in UnicodeData.txt, but that is irrelevant. What is really insane is > the filesize of the XML versions of the UCD. Even without Unihan, it > may take up to a minute or so to load in a text editor. > > > however JSON has no explicit schema, unless the schema is being > > made part of the data itself, complicating its structure (with many > > levels of arrays of arrays, in which case it becomes less easy to > > read by humans, but more adapted to automated processes for fast > > processing). > > > > I'd say that the XML alone is enough to generate any JSON-derived > > dataset that will conform to the schema an application expects to > > process fast (and with just the data it can process, excluding > > various extensions still not implemetned). But the fastest > > implementations are also based on data tables encoded in code (such > > as DLL or Java classes), or custom database formats (such as > > Berkeley dB) generated also automatically from the XML, without the > > processing cost of decompression schemes and parsers. > > > > Still today, even if XML is not the usual format used by > > applications, it is still the most interoperable format that allows > > building all sorts of applications in all sorts of languages: the > > cost of parsing is left to an application builder/compiler. > > I?ve tried an online tool to get ucd.nounihan.flat.xml converted to > CSV. The tool is great and offers a lot of options, but given the > ?insane? file size, my browser was up for over two hours of trouble > until I shut down the computer manually. From what I could see in the > result field, there are many bogus values, meaning that their > presence is useless in the tags of most characters. And while many > attributes have cryptic names in order to keep the file size minimal, > some attributes have overlong values, ie the design is inconsistent. > Eg in every character we read: jg="No_Joining_Group" That is bogus. > One would need to take them off the tags of most characters, and even > in the characters where they are relevant, the value would be simply > "No". What?s the use of abbreviating "Joining Group" to "jg" in the > atribute name if in the value it is written out? And I?m quoting from > U+0000. Further many values are set to a crosshatch, instead of > simply being removed from the characters where they are empty. Then > the many instances of "undetermined script" resulting in *two* > attribues with "Zyyy" value. Then in almost each character we?re told > that it is not a whitespace, not a dash, not a hyphen, and not a > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn?t > tell that UCD does actually benefit from the flexibility of XML, > given that many attributes are systematically present even where they > are useless. Perhaps ucd-*.xml would be two thirds, half, or one > third their actual size if they were properly designed. > > > Some apps embed the compilers themselves and use a stored cache for > > faster processing: this approach allows easy updates by detecting > > changes in the XML source, and then downloading them. > > > > But in CLDR such updates are generally not automated : the general > > scheme evolves over time and there are complex dependencies to > > check so that some data becomes usable > > Should probably read *un*usable. > > > (frequently you need to implement some new algorithms to follow the > > processing rules documented in CLDR, or to use data not completely > > validated, or to allow aplicatio?ns to provide their overrides from > > insufficiently complete datasets in CLDR, even if CLDR provides a > > root locale and applcaitions are supposed to follow the BCP47 > > fallback resolution rules; applciations also have their own need > > about which language codes they use or need, and CLDR provides many > > locales that many applications are still not prepared to render > > correctly, and many application users complain if an application is > > partly translated and contains too many fallbacks to another > > language, or worse to another script). > > So the case is even worse than what I could see when looking into > CLDR. Many countries, including France, don?t care about the data of > their own locale in CLDR, but I?m not going to vent about that on > Unicode Public, because that involves language offices and > authorities, and would have political entanglements. > > Staying technical, I can tell so far about the file header of > UnicodeData.txt that I can see zero technical reasons not to add it. > Processes using the file to generate an overview of Unicode also use > other files and are thus able to process comments correctly, whereas > those processes using UnicodeData to look up character properties > provided in the file would start searching the code point. (Perhaps > there are compilers building DLLs from the file.) > > Le?jeu. 30 ao?t 2018 ??20:38, Doug Ewell via Unicode a ?crit?: > > > > > UnicodeData.txt was devised long before any of the other UCD data > files. Though it might seem like a simple enhancement to us, adding a > header block, or even a single line, would break a lot of existing > processes that were built long ago to parse this file. > > > > So Unicode can't add a header to this file, and that is the reason > the format can never be changed (e.g. with more columns). That is why > new files keep getting created instead. > > > > The XML format could indeed be expanded with more attributes and more > subsections. Any process that can parse XML can handle unknown stuff > like this without misinterpreting the stuff it does know. > > > > That's why the only two reasonable options for getting UCD data are > to read all the tab- and semicolon-delimited files, and be ready for > new files, or just read the XML. Asking for changes to existing UCD > file formats is kind of a non-starter, given these two alternatives. > > > > > > > > > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > > > > -------- Original message -------- > Message: 3 > > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) > > From: Marcel Schneider via Unicode > > > > > Curiously, UnicodeData.txt is lacking the header line. That makes it > unflexible. I never wondered why the header line is missing, probably > because compared to the other UCD files, the file looks really odd > without a file header showing at least the version number and > datestamp. It?s like the file was made up for dumb parsers unable to > handle comment delimiters, and never to be upgraded to do so. > > But I like the format, and that?s why at some point I submitted > feedback asking for an extension. [...] > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: Digitale Signatur von OpenPGP URL: From unicode at unicode.org Fri Aug 31 03:27:12 2018 From: unicode at unicode.org (Manuel Strehl via Unicode) Date: Fri, 31 Aug 2018 10:27:12 +0200 Subject: CLDR (was: Private Use areas) In-Reply-To: <86sh2v3ye5.fsf@mimuw.edu.pl> References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> Message-ID: The XML files in these folders: https://unicode.org/repos/cldr/tags/latest/common/ But I agree. I spent an extreme amount of time to get somewhat used to cldr.unicode.org and and the data repo, and still I have no clue, where to find a concrete piece of information without digging into the site. Am Fr., 31. Aug. 2018 um 07:22 Uhr schrieb Janusz S. Bie? via Unicode : > > On Thu, Aug 30 2018 at 2:27 +0200, unicode at unicode.org writes: > > [...] > > > Given NamesList.txt / Code Charts comments are kept minimal by design, > > one couldn?t simply pop them into XML or whatever, as the result would be > > disappointing and call for completion in the aftermath. Yet another task > > competing with CLDR survey. > > Please elaborate. It's not clear for me what do you mean. > > > Reviewing CLDR data is IMO top priority. > > There are many flaws to be fixed in many languages including in English. > > A lot of useful digest charts are extracted from XML there, > > Which XML? where? > > > and we really > > need to go through the data and correct the many many errors, please. > > Some time ago I tried to have a close look at the Polish locale and > found the CLDR site prohibitively confusing. > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien > From unicode at unicode.org Fri Aug 31 03:36:45 2018 From: unicode at unicode.org (Manuel Strehl via Unicode) Date: Fri, 31 Aug 2018 10:36:45 +0200 Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20) In-Reply-To: <20180831081953.68476d36@spixxi> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> Message-ID: To handle the UCD XML file a streaming parser like Expat is necessary. For codepoints.net I use that data to stuff everything in a MySQL database. If anyone is interested, the code for that is Open Source: https://github.com/Codepoints/unicode2mysql/ The example for handling the large XML file can be found here: https://github.com/Codepoints/unicode2mysql/blob/master/bin/ucd_to_sql.py For me it's currently much easier to have all the data in a single place, e.g. a large XML file, than spread over a multitude of files _with different ad-hoc syntaxes_. The situation would possibly be different, though, if the UCD data would be split in several files of the same format. (Be it JSON, CSV, YAML, XML, TOML, whatever. Just be consistent.) Nota bene: That is also true for the emoji data, which consists as of now of five plain text files with similar but not identical formats. Cheers, Manuel Am Fr., 31. Aug. 2018 um 08:19 Uhr schrieb Marius Spix via Unicode : > > A good compromise between human readability, machine processability and > filesize would be using YAML. > > Unlike JSON, YAML supports comments, anchors and references, multiple > documents in a file and several other features. > > Regards, > > Marius Spix > > > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode > wrote: > > > On 30/08/18 23:34 Philippe Verdy via Unicode wrote: > > > > > > Welel an alternative to XML is JSON which is more compact and > > > faster/simpler to process; > > > > Thanks for pointing the problem and the solution alike. Indeed the > > main drawback of the XML format of UCD is that it results in an > > ?insane? filesize. ?Insane? was applied to the number of semicolons > > in UnicodeData.txt, but that is irrelevant. What is really insane is > > the filesize of the XML versions of the UCD. Even without Unihan, it > > may take up to a minute or so to load in a text editor. > > > > > however JSON has no explicit schema, unless the schema is being > > > made part of the data itself, complicating its structure (with many > > > levels of arrays of arrays, in which case it becomes less easy to > > > read by humans, but more adapted to automated processes for fast > > > processing). > > > > > > I'd say that the XML alone is enough to generate any JSON-derived > > > dataset that will conform to the schema an application expects to > > > process fast (and with just the data it can process, excluding > > > various extensions still not implemetned). But the fastest > > > implementations are also based on data tables encoded in code (such > > > as DLL or Java classes), or custom database formats (such as > > > Berkeley dB) generated also automatically from the XML, without the > > > processing cost of decompression schemes and parsers. > > > > > > Still today, even if XML is not the usual format used by > > > applications, it is still the most interoperable format that allows > > > building all sorts of applications in all sorts of languages: the > > > cost of parsing is left to an application builder/compiler. > > > > I?ve tried an online tool to get ucd.nounihan.flat.xml converted to > > CSV. The tool is great and offers a lot of options, but given the > > ?insane? file size, my browser was up for over two hours of trouble > > until I shut down the computer manually. From what I could see in the > > result field, there are many bogus values, meaning that their > > presence is useless in the tags of most characters. And while many > > attributes have cryptic names in order to keep the file size minimal, > > some attributes have overlong values, ie the design is inconsistent. > > Eg in every character we read: jg="No_Joining_Group" That is bogus. > > One would need to take them off the tags of most characters, and even > > in the characters where they are relevant, the value would be simply > > "No". What?s the use of abbreviating "Joining Group" to "jg" in the > > atribute name if in the value it is written out? And I?m quoting from > > U+0000. Further many values are set to a crosshatch, instead of > > simply being removed from the characters where they are empty. Then > > the many instances of "undetermined script" resulting in *two* > > attribues with "Zyyy" value. Then in almost each character we?re told > > that it is not a whitespace, not a dash, not a hyphen, and not a > > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn?t > > tell that UCD does actually benefit from the flexibility of XML, > > given that many attributes are systematically present even where they > > are useless. Perhaps ucd-*.xml would be two thirds, half, or one > > third their actual size if they were properly designed. > > > > > Some apps embed the compilers themselves and use a stored cache for > > > faster processing: this approach allows easy updates by detecting > > > changes in the XML source, and then downloading them. > > > > > > But in CLDR such updates are generally not automated : the general > > > scheme evolves over time and there are complex dependencies to > > > check so that some data becomes usable > > > > Should probably read *un*usable. > > > > > (frequently you need to implement some new algorithms to follow the > > > processing rules documented in CLDR, or to use data not completely > > > validated, or to allow aplicatio?ns to provide their overrides from > > > insufficiently complete datasets in CLDR, even if CLDR provides a > > > root locale and applcaitions are supposed to follow the BCP47 > > > fallback resolution rules; applciations also have their own need > > > about which language codes they use or need, and CLDR provides many > > > locales that many applications are still not prepared to render > > > correctly, and many application users complain if an application is > > > partly translated and contains too many fallbacks to another > > > language, or worse to another script). > > > > So the case is even worse than what I could see when looking into > > CLDR. Many countries, including France, don?t care about the data of > > their own locale in CLDR, but I?m not going to vent about that on > > Unicode Public, because that involves language offices and > > authorities, and would have political entanglements. > > > > Staying technical, I can tell so far about the file header of > > UnicodeData.txt that I can see zero technical reasons not to add it. > > Processes using the file to generate an overview of Unicode also use > > other files and are thus able to process comments correctly, whereas > > those processes using UnicodeData to look up character properties > > provided in the file would start searching the code point. (Perhaps > > there are compilers building DLLs from the file.) > > > > Le jeu. 30 ao?t 2018 ? 20:38, Doug Ewell via Unicode a ?crit : > > > > > > > > > UnicodeData.txt was devised long before any of the other UCD data > > files. Though it might seem like a simple enhancement to us, adding a > > header block, or even a single line, would break a lot of existing > > processes that were built long ago to parse this file. > > > > > > > So Unicode can't add a header to this file, and that is the reason > > the format can never be changed (e.g. with more columns). That is why > > new files keep getting created instead. > > > > > > > The XML format could indeed be expanded with more attributes and more > > subsections. Any process that can parse XML can handle unknown stuff > > like this without misinterpreting the stuff it does know. > > > > > > > That's why the only two reasonable options for getting UCD data are > > to read all the tab- and semicolon-delimited files, and be ready for > > new files, or just read the XML. Asking for changes to existing UCD > > file formats is kind of a non-starter, given these two alternatives. > > > > > > > > > > > > > > > > > -- > > Doug Ewell | Thornton, CO, US | ewellic.org > > > > > > > > > > > -------- Original message -------- > > Message: 3 > > > > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) > > > From: Marcel Schneider via Unicode > > > > > > > > Curiously, UnicodeData.txt is lacking the header line. That makes it > > unflexible. I never wondered why the header line is missing, probably > > because compared to the other UCD files, the file looks really odd > > without a file header showing at least the version number and > > datestamp. It?s like the file was made up for dumb parsers unable to > > handle comment delimiters, and never to be upgraded to do so. > > > > But I like the format, and that?s why at some point I submitted > > feedback asking for an extension. [...] > > > > > > > > > > > > > From unicode at unicode.org Fri Aug 31 05:17:41 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 31 Aug 2018 12:17:41 +0200 (CEST) Subject: CLDR (was: Private Use areas) In-Reply-To: <86sh2v3ye5.fsf@mimuw.edu.pl> References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> Message-ID: <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote: [?] > > Given NamesList.txt / Code Charts comments are kept minimal by design, > > one couldn?t simply pop them into XML or whatever, as the result would be > > disappointing and call for completion in the aftermath. Yet another task > > competing with CLDR survey. > > Please elaborate. It's not clear for me what do you mean. These comments are designed for the Code Charts and as such must not be disproportionate in exhaustivity. Eg we have lists of related languages ending in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt to be fed in an extensible and unconstrained format (without any constraint as of available space, number and length of comments, and so on), any lack is felt as a discriminating neglect, and there will be a huge rush adding data. Yet Unicode hasn?t set up products where that data could be published, ie not in the Code Charts (for the abovementioned reason), not in ICU so far as the additional information involved does not match a known demand on user side (localizing software does not mean providing scholarly exhaustive information about supported characters). The use will be in character pickers providing every available information about a given character. That is why Unicode is to prioritize CLDR for CLDR users, rather than extra information for the web. > > > Reviewing CLDR data is IMO top priority. > > There are many flaws to be fixed in many languages including in English. > > A lot of useful digest charts are extracted from XML there, > > Which XML? where? More precisely it is LDML, the CLDR-specific XML. What I called ?digest charts? are the charts found here: http://www.unicode.org/cldr/charts/34/ The access is via this page: http://cldr.unicode.org/index/downloads where the charts are in the Charts column, while the raw data is under SVN Tag. > > > and we really > > need to go through the data and correct the many many errors, please. > > Some time ago I tried to have a close look at the Polish locale and > found the CLDR site prohibitively confusing. I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive for the access to the XML data (except when knowing about SubVersioN). Polish data is found here: https://www.unicode.org/cldr/charts/34/summary/pl.html The access is via the top of the "Summary" index page (showing root data): https://www.unicode.org/cldr/charts/34/summary/root.html You may wish to particularly check the By-Type charts: https://www.unicode.org/cldr/charts/34/by_type/index.html Here I?d suggest to first focus on alphabetic information and on punctuation. https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html Under Latin (table caption, without anchor) we find out what punctuation Polish has compared to other locales using the same script. The exact character appears when hovering the header row. Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is an error in almost every locale using hyphen. TC is about to correct that. Further you will see that while Polish is using apostrophe https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish CLDR does not have the correct apostrophe for Polish, as opposed eg to French. You may wish to note that from now on, both U+0027 APOSTROPHE and U+0022 QUOTATION MARK are ruled out in almost all locales, given the preferred characters in publishing are U+2019 and, for Polish, the U+201E and U+201D that are already found in CLDR pl. Note however that according to the information provided by English Wikipedia: https://en.wikipedia.org/wiki/Quotation_mark#Polish Polish also uses single quotes, that by contrast are still missing in CLDR. Now you might understand what I meant when pointing that there are still many errors in many languages in CLDR, including in English. Best regards, Marcel > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien > > From unicode at unicode.org Fri Aug 31 12:50:08 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 31 Aug 2018 10:50:08 -0700 Subject: UCD in XML or in CSV? In-Reply-To: References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> Message-ID: <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net> On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote: > For codepoints.net I use that data to stuff everything in a MySQL > database. Well, for some sense of "everything", anyway. ;-) People having this discussion should keep in mind a few significant points. First, the UCD proper isn't "everything", extensive as it is. There are also other significant sets of data that the UTC maintains about characters in other formats, as well, including the data files associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, etc.), UTS #10 (collation), UTR #25 (a set of math-related property values), and UTS #51 (emoji-related). The emoji-related data has now strayed into the CLDR space, so a significant amount of the information about emoji characters is now carried as CLDR tags. And then there is various other information about individual characters (or small sets of characters) scattered in the core spec -- some in tables, some not, as well as mappings to dozens of external standards. There is no actual definition anywhere of what "everything" actually is. Further, it is a mistake to assume that every character property just associates a simple attribute with a code point. There are multiple types of mappings, complex relational and set properties, and so forth. The UTC attempts to keep a fairly clear line around what constitutes the "UCD proper" (including Unihan.zip), in part so that it is actually possible to run the tools that create the XML version of the UCD, for folks who want to consume a more consistent, single-file format version of the data. But be aware that that isn't everything -- nor would there be much sense in trying to keep expanding the UCD proper to actually represent "everything" in one giant DTD. Second, one of the main obligations of a standards organization is *stability*. People may well object to the ad hoc nature of the UCD data files that have been added over the years -- but it is a *stable* ad-hockery. The worst thing the UTC could do, IMO, would be to keep tweaking formats of data files to meet complaints about one particular parsing inconvenience or another. That would create multiple points of discontinuity between versions -- worse than just having to deal with the ongoing growth in the number of assigned characters and the occasional addition of new data files and properties to the UCD. Keep in mind that there is more to processing the UCD than just "latest". People who just focus on grabbing the very latest version of the UCD and updating whatever application they have are missing half the problem. There are multiple tools out there that parse and use multiple *versions* of the UCD. That includes the tooling that is used to maintain the UCD (which parses *all* versions), and the tooling that creates UCD in XML, which also parses all versions. Then there is tooling like unibook, to produce code charts, which also has to adapt to multiple versions, and bidi reference code, which also reads multiple versions of UCD data files. Those are just examples I know off the top of my head. I am sure there are many other instances out there that fit this profile. And none of the applications already built to handle multiple versions would welcome having to permanently build in tracking particular format anomalies between specific versions of the UCD. Third, please remember that folks who come here complaining about the complications of parsing the UCD are a very small percentage of a very small percentage of a very small percentage of interested parties. Nearly everybody who needs UCD data should be consuming it as a secondary source (e.g. for reference via codepoints.net), or as a tertiary source (behind specialized API's, regex, etc.), or as an end user (just getting behavior they expect for characters in applications). Programmers who actually *need* to consume the raw UCD data files and write parsers for them directly should actually be able to deal with the format complexity -- and, if anything, slowing them down to make them think about the reasons for the format complexity might be a good thing, as it tends to put the lie to the easy initial assumption that the UCD is nothing more than a bunch of simple attributes for all the code points. --Ken From unicode at unicode.org Fri Aug 31 15:02:27 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 31 Aug 2018 16:02:27 -0400 Subject: Private Use areas In-Reply-To: <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost> References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost> Message-ID: <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org> On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote: > Asmus Freytag wrote: > >> There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome. > I am thinking of such an ad-hoc special purpose markup language. > > I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display. That starts to sound no longer "ad-hoc", but that is not a well-defined term anyway.? You're essentially describing a special-purpose markup language or protocol, or perhaps even programming language.? Which is quite reasonable; you should (find some other interested people and) work out some of? the details and start writing up parsers and such > I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property. > > It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10. I still don't see why you're fixated on using circled characters. You're already dealing with a markup-language type setup, why not do what other markup schemes do?? You reserve three or four characters and use them to designate when other characters are not being used in their normal sense but are being used as markup.? In XML, when characters are inside '<>' tags, they are not "plain text" of the document, but they mean other things?perhaps things like "right-to-left" or "reference this web page" and so forth, which are exactly the kinds of things you're talking about here.? If you don't want to use plain ascii characters because then you couldn't express plain ascii in your text, you're left with exactly the same problem with circled characters: you can't express circled characters in your text.? While that is a smaller problem, it can be eliminated altogether by various schemes used by XML or RTF or lightweight markup languages.? Reserve a few special characters to give meanings to the others, and arrange for ways to escape your handful of reserved characters so you can express them.? More straightforward to say "you have to escape <, >, and & characters" than to say "you have to escape all circled characters." Anyway, this is clearly a whole new high-level protocol you need (or want) to work out, which would *use* Unicode (just like XML and JSON do), but doesn't really affect or involve it (Unicode is all about the "plain text".? Kind of getting off-topic, but get some people interested and start a mailing list to discuss it.? Good luck! ~mark From unicode at unicode.org Fri Aug 31 15:11:44 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 31 Aug 2018 16:11:44 -0400 Subject: Private Use areas In-Reply-To: <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost> References: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk> <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost> Message-ID: On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote: > Hi > > Mark E. Shoulson wrote: > >> I'm not sure what the advantage is of using circled characters instead of plain old ascii. > > My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters. What if circled characters are used in the text encoded in the file?? They're characters too, people use them and all.? Whenever you designate some characters to be used in a way outside their normal meaning, you have the problem of how to use them *with* their normal meaning.? So there are various escaping schemes and all.? So in XML, all characters have their normal meanings?except <, >, and &, which mean something special and change the interpretations of other nearby characters (so "bold" is a word in English that appears in the text, but "" is part of an instruction to the renderer that doesn't appear in the text.)? And the price is that those three characters have to be expressed differently (< > &).? I don't really see what you gain by branding some large swath of unicode ("circled characters") as "special" and not meaning their usual selves, and for that matter making these hard-to-type characters *necessary* for using your scheme, when you could do something like what XML does, and say "everything between < and > is to be interpreted specially, and there, these characters have the following meanings" and then have some other way of expressing those two reserved characters.? (not saying you need to do it XML's way, but something like that: reserve a small number of characters that have to be escaped, not some huge chunk.) > > My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format. That's another way of saying that this is a markup format which accepts a large variety of plain texts.? Because you ARE talking about making a "particular markup format," just a different and new one. I guess there's not even any reason for me to argue the point, though, since it is up to you how to design your markup language, and you can take advice (or not) from anyone you like.? Draw up some design, find some interested people, start a discussion, and work it out.? (but not here; this list is for discussing Unicode.) ~mark From unicode at unicode.org Fri Aug 31 15:43:49 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 31 Aug 2018 21:43:49 +0100 (BST) Subject: Private Use areas In-Reply-To: <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org> References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost> <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org> Message-ID: <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost> Hi Thank you for your posts from earlier today. Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea. I found a helpful page with diagrams. http://www.json.org/ Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project. William Overington Friday 31 August 2018 From unicode at unicode.org Fri Aug 31 15:59:06 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 31 Aug 2018 21:59:06 +0100 (BST) Subject: Private Use areas In-Reply-To: <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost> References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk> <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost> <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost> <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost> <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org> <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost> Message-ID: <16515804.43443.1535749146105.JavaMail.defaultUser@defaultHost> Hi I have now found the following document. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf William Overington Friday 31 August 2018 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 2018/08/31 - 21:43 (GMTDT) To : mark at kli.org, unicode at unicode.org Subject : Re: Private Use areas Hi Thank you for your posts from earlier today. Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea. I found a helpful page with diagrams. http://www.json.org/ Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project. William Overington Friday 31 August 2018 From unicode at unicode.org Fri Aug 31 23:18:32 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 1 Sep 2018 06:18:32 +0200 (CEST) Subject: UCD in XML or in CSV? In-Reply-To: <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net> Message-ID: <1369576173.49.1535775512871.JavaMail.www@wwinf1d31> On 31/08/18 19:59 Ken Whistler via Unicode wrote: [?] > Second, one of the main obligations of a standards organization is > *stability*. People may well object to the ad hoc nature of the UCD data > files that have been added over the years -- but it is a *stable* > ad-hockery. The worst thing the UTC could do, IMO, would be to keep > tweaking formats of data files to meet complaints about one particular > parsing inconvenience or another. That would create multiple points of > discontinuity between versions -- worse than just having to deal with > the ongoing growth in the number of assigned characters and the > occasional addition of new data files and properties to the UCD. I did not want to make trouble asking for moving conventions back and forth. I liked to learn why UnicodeData.txt was released as a draft without a header and nothing, given Unicode knew well in advance that the scheme adopted at first release would be kept stable for decades or forever. Then I?d like to learn how Unicode came to not devise a consistent scheme for all the UCD files if any such could be devised, so that people could get able to assess whether complaints about inconsistencies are well-founded or not. It is not enough for me that a given adhockery is stable; IMO it should also be well-designed, in responsiveness facing history from a standards body. That is not what one is telling about UnicodeData.txt, although it is the only effectively formatted file in UCD for streamlined processing. Was there not enough time to think about a header line and a file header? With the header line it would be flexible, and all the problems would be solved if specifying that parsers should start with counting the field number prior to creating storage arrays. We are lacking a real history of Unicode, explaining why everybody was in a hurry. ?Authors falling like flies? is the only hint that comes to mind. And given Unicode appear to have missed the hit, to discuss whether it would be time to add a more accomplished file for better usability. > > Keep in mind that there is more to processing the UCD than just > "latest". People who just focus on grabbing the very latest version of > the UCD and updating whatever application they have are missing half the > problem. There are multiple tools out there that parse and use multiple > *versions* of the UCD. That includes the tooling that is used to > maintain the UCD (which parses *all* versions), and the tooling that > creates UCD in XML, which also parses all versions. Then there is > tooling like unibook, to produce code charts, which also has to adapt to > multiple versions, and bidi reference code, which also reads multiple > versions of UCD data files. Those are just examples I know off the top > of my head. I am sure there are many other instances out there that fit > this profile. And none of the applications already built to handle > multiple versions would welcome having to permanently build in tracking > particular format anomalies between specific versions of the UCD. That point is clear to me, and even when suggesting to make changes to BidiMirrored.txt, I had alternatives with a stable existing file and a new enhanced file. But what is totally unclear to me is what are old versions doing in compiling latest data. Delta is OK, research on particular topic in old data is OK, but what does it mean to need to parse *all* versions to get newest products? > > Third, please remember that folks who come here complaining about the > complications of parsing the UCD are a very small percentage of a very > small percentage of a very small percentage of interested parties. > Nearly everybody who needs UCD data should be consuming it as a > secondary source (e.g. for reference via codepoints.net), or as a > tertiary source (behind specialized API's, regex, etc.), or as an end > user (just getting behavior they expect for characters in applications). > Programmers who actually *need* to consume the raw UCD data files and > write parsers for them directly should actually be able to deal with the > format complexity -- and, if anything, slowing them down to make them > think about the reasons for the format complexity might be a good thing, > as it tends to put the lie to the easy initial assumption that the UCD > is nothing more than a bunch of simple attributes for all the code points. That makes no sense to me. UCD raw data is and remains a primary source, I see no way to consume it as a secondary source or as a tertiary source. Do you mean to consume it via secondary or tertiary sources? Then we actually appear to consume those sources instead of UCD raw data. These sources are fine for the purpose of getting information about some particular code points, but most of these tools I remember don?t allow to filter values and compute overviews, nor to add data, as we can do it in spreadsheet software. Honestly are we so few people using Excel for Unicode data? Even Excel Starter, that I have, is a great tool helping to perform tasks I fail to get with other tools, even spreadsheet software. So I beg you to please spare me the tedious conversion from the clumsy UCD XML file to handy CSV. BTW the former could be cleaned up. As already said, if this discussion has a positive outcome, a request in due form may follow. But I have no time to work out any more papers if there is no point. Regards, Marcel