From nobody_uses at outlook.com Thu Sep 1 23:39:53 2016 From: nobody_uses at outlook.com (eduardo marin) Date: Fri, 2 Sep 2016 04:39:53 +0000 Subject: Missing block element characters Message-ID: It has come to my attention that there are four semi-graphics characters from the ZX-81 character set that are currently un-encoded: https://en.wikipedia.org/wiki/File:ZX81.chars.00-0A.80-8A.png [https://en.wikipedia.org/wiki/File:ZX81.chars.00-0A.80-8A.png]the last four characters on the right of which I propose the following names: UPPER HALF BLOCK MEDIUM SHADE, LOWER HALF BLOCK MEDIUM SHADE, FULL BLOCK-UPPER HALF MEDIUM SHADE and FULL BLOCK-LOWER HALF MEDIUM SHADE. I recommend encoding them in the Miscellaneous Symbols and Arrows block or in the Geometric Shapes Extended block. While a compelling reason for encoding is completing this obsolete character set (allowing for emulation) a much more convincing case (in my opinion) is the fact that it allows for greater artistic freedom for anybody decorating their text with Unicode or even creating illustrations: https://www.google.com.mx/search?q=utf-8+art&source=lnms&tbm=isch&sa=X&ved=0ahUKEwi406KI4-_OAhVL7SYKHdxfAUsQ_AUICCgB#imgrc=X9T-ssHdaBoJoM%3A The second argument implies also encoding the vertical counterparts of these characters and many more variants of block elements with half shading, but a proposal just for these four is a great start before considering and measuring the artistic implications. Another set of missing characters for Atari ST emulation are ATARI LOGO LEFT HALF, ATARI LOGO RIGHT HALF, SMILING MAN WITH PIPE UPPER LEFT, SMILING MAN WITH PIPE UPPER RIGHT, SMILING MAN WITH PIPE LOWER LEFT and SMILING MAN WITH PIPE LOWER RIGHT. These characters are much more objectionable due to their specificity and possible trademark issues (although unlikely), the set of digits represented as if they were in an seven segment display, could be considered a font variation there is also what appears to be a negative diagonal and lozenge, and those are even weirder: https://en.wikipedia.org/wiki/Atari_ST_character_set Atari ST character set - Wikipedia, the free encyclopedia -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Sep 2 05:08:09 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 2 Sep 2016 12:08:09 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> Message-ID: <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Christoph P?per : > > I just learned that recent Samsung phones already contain emoji representations for many of these symbols. And I?ve finally seen . A pointer would have been nice. From mark at macchiato.com Fri Sep 2 05:42:44 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 2 Sep 2016 12:42:44 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: In order to understand the status of any document in the registry, you need to also look at the minutes of the meeting where they are discussed, in this case: http://www.unicode.org/L2/L2016/16121.htm What you see there is: B.14.3 Provisional value for Emoji property [Emoji SC/Edberg, L2/16-087 ] B.14.3.1 Characters Proposed for Emoji=Provisional [Emoji SC/Edberg, L2/16-088 ] Discussion. UTC took no action at this time. "Took no action" generally means "rejected". Mark On Fri, Sep 2, 2016 at 12:08 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > Christoph P?per : > > > > I just learned that recent Samsung phones already contain emoji > representations for many of these symbols. > > And I?ve finally seen L2016/16087-provisional-value-for-emoji.pdf>. A pointer would have been > nice. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Sep 2 06:01:46 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 2 Sep 2016 13:01:46 +0200 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <20160831092504.665a7a7059d7ee80bb4d670165c8327d.2c28aa3363.wbe@email03.godaddy.com> References: <20160831092504.665a7a7059d7ee80bb4d670165c8327d.2c28aa3363.wbe@email03.godaddy.com> Message-ID: Doug Ewell : > [FAZ:] >> ("The Unicode Consortium appears like a reissue of Orwell's Ministry >> of Truth, which replaced the English language by a new one, sweeped >> clean from harmful terms, and which removed "unorthodox" connotations >> from the rest of the words.") > > I can imagine people with time on their hands criticizing Apple for > changing the glyph, but how did the Unicode Consortium itself get > dragged into this? What obvious thing am I missing? >From the FAZ article: >> Doch nach Darstellung des Portals ?Buzzfeed? intervenierten Apple und Microsoft bei dem f?r die Standardisierung von Emojis zust?ndigen Konsortium und verhinderten so die Aufnahme des Gewehrsymbols in den Unicode. Waffenkontrolle funktioniert heute also mit ein paar Code-?nderungen. My translation: >> According to the news portal ?Buzzfeed?, Apple and Microsoft intervened at the consortium responsible for standardizing emojis and thereby prevented the addition of the Rifle symbol into Unicode. Gun control hence works by changing a bit of code nowadays. Newspapers, especially big and traditional ones, notoriously don?t hyperlink their sources. The author, Adrian Lobe, probably references the first of the following Buzzfeed articles by Charlie Warzel (I remembered the Ars Technica article which is similar in tone): - - - The finer details of inclusion in Unicode, `Emoji=Yes`, `Emoji_Presentation=Yes`, inclusion in emoji picker GUIs or default fonts don?t really matter to the general public. Two characters were added in TUS9 to a block otherwise only used for (non-compatibility) emojis without making them proper emojis. That?s fishy, even to the halfwits. UTC would have been better off not encoding ?? U+1F946 Rifle and ?? U+1F93B Modern Pentathlon at all. The current situation is a half-assed compromise that pleases nobody and it?s really no wonder it gets mixed up with the more recent controversy about the default glyph for ?? U+1F52B Pistol in current beta versions of Apple?s OSs. From leoboiko at namakajiri.net Fri Sep 2 08:58:49 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Fri, 2 Sep 2016 10:58:49 -0300 Subject: Emoji semantic drift Message-ID: This isn't news, but I find it interesting how some emoji are being used in ways that differ from their Unicode names, reflecting alternative interpretations of common glyphs. I'll compare data from the Unicode chart with interpretations taken from Emojipedia, which I think do reflect real-world usage: U+1F617 KISSING FACE ?? Current keywords: face|kiss ? whistle (= nonchalance ; happiness) http://emojipedia.org/kissing-face/ U+1F481 INFORMATION DESK PERSON ?? ? person tipping hand Keywords: hand | help | information | sassy | tipping ? sassy ; hair flick http://emojipedia.org/information-desk-person/ U+1F601 GRINNING FACE WITH SMILING EYES ?? Keywords: face | grin ? grimace (discomfort, pain) http://emojipedia.org/grinning-face-with-smiling-eyes/ U+1F624 FACE WITH LOOK OF TRIUMPH ?? ? face with steam from nose Keywords: face | triumph| won ? angry; frustration; contemptuous http://emojipedia.org/face-with-look-of-triumph/ I see that *some* of those alternative readings are registered in the Unicode table as ? , while others are present in keywords, and still others are absent. Is there any criteria for that? Is someone trying to keep track of emoji in use? I think distributional methods are promising, as shown by Thomas Dimson: http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji By this we find that, for example, U+1F46F WOMAN WITH BUNNY EARS, also marked as *? people partying*, has additional connotations of sisterhood?specifically female friendship and loyalty ( #sistasista, #sistersforlife, #sistersister, #bestiesforlife, yearsoffriendship, #sisterfromanothermister, #morelikesisters, #bffl, #bestiesfortheresties, #bestfriendsforever ). U+1F647 PERSON BOWING DEEPLY ?? is seeing use as a marker of worry or shame ("late night thoughts", "deleting later", "in my feelings", "laughing but very serious"); probably due to the emotion lines drawn on most fontsets. -------------- next part -------------- An HTML attachment was scrubbed... URL: From irgendeinbenutzername at gmail.com Fri Sep 2 08:53:38 2016 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Fri, 2 Sep 2016 15:53:38 +0200 Subject: Missing block element characters Message-ID: I'd argue there's a fifth character missing from the ZX-81 set. Notice how there are two separate MEDIUM SHADEs, one the inverse of the other. For complete compatibility Unicode would also somehow need a second one of those. -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Sep 2 14:03:03 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 2 Sep 2016 21:03:03 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: Mark Davis ?? : > > In order to understand the status of any document in the registry, you need to also look at the minutes of the meeting where they are discussed, I?ve said it before: Unicode is lacking a proper public issue tracker. From christoph.paeper at crissov.de Sat Sep 3 00:08:39 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 3 Sep 2016 07:08:39 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> Message-ID: An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Sep 3 00:29:42 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 3 Sep 2016 07:29:42 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: ?There are three main Unicode technical committees: we use an issue tracker for two of them, but not the UTC (which established its processes earlier). Personally, I'd favor an issue tracker for the UTC as well, but decisions as to the process are determined by the committee.? (Note again that nothing said on this list will be taken up by the UTC unless someone submits a proposal to the UTC.) Mark On Fri, Sep 2, 2016 at 9:03 PM, Christoph P?per wrote: > Mark Davis ?? : > > > > In order to understand the status of any document in the registry, you > need to also look at the minutes of the meeting where they are discussed, > > I?ve said it before: Unicode is lacking a proper public issue tracker. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Sep 3 01:17:08 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 3 Sep 2016 08:17:08 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> Message-ID: As to your points below. There is great demand for the choice between female and male, and there is a specific proposal in E4.0. I have no doubt that it will be accepted. ?The addition of emoji is iterative: the acceptance of male and female forms doesn't preclude a neutral option in a later version. As I've said, people are actively working on that, and we'll see what they come up with. I strongly doubt that anyone would be receptive to your ~3,000? ?characters. Nor do I think that that many characters are required to "satisfy user expectations". There were earlier proposals to look at somewhat broadening the number of existing emoji. There is a very real cost to supporting more emoji characters, and the committee is being prudent about the number of new emoji it supports. Some vendors already give additional Unicode characters an emoji appearance; that is not forbidden. ? ?You aren't wasting your time if you present a grounded proposal for reasonably-sized additional sets of characters, based on expected usage and other criteria we've outlined, not just "is related". This is whether adding new characters, or making existing characters be Emoji: for example, a set of body parts as you mention. If you see successful proposals from the past, such as for additional sports symbols, those did not try to propose the addition of all possible emoji of that type (eg all human sports activities, or all species of animals), but rather looked at the most popular sports. ? ?The UTC is not trying to scare away input. We have accepted many proposals from "small and independent parties". It *does* mean that such independent? ?parties have to provide a good argument for their proposals, based on usage and other criteria. Mark On Thu, Aug 25, 2016 at 4:52 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > TL;DR: Unicode properties should reflect user expectations, not vendor > choices. > > Mark Davis ?? : > > On Mon, Aug 22, 2016 at 11:26 PM, Christoph P?per < > christoph.paeper at crissov.de> wrote: > >> 1. it?s incomplete without an explicit neutral/ambiguous alternative and > > > > ?As I said, people are actively investigating what to do about such > cases. It may be that the solution is to add ? U+26B2 Neuter, but maybe > not. We'll see as they develop further. > > Natively speaking a language which can explicitly mark any actor noun with > a morpheme as female/feminine, but neither as neutral nor as male/masculine > ? a generic version of English ?actor/actress?, ?waiter/waitress?, > ?prince/princess? ? and having intensely dealt with guidelines for > corporate languages and public speech, I?ll assure you that a feminism/LGBT > shitstorm will be heading for UTC and vendors if binary gender became > mandatory for profession emojis. You should not approve Google?s and > Apple?s ZWJ sequences without a neutral option. ?[snip]? > > Sorry, this got long. > ?yes?. shorter is better -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Sep 3 11:30:38 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 3 Sep 2016 18:30:38 +0200 (CEST) Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech Message-ID: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> On Wed, 31 Aug 2016 09:25:05 -0700, Doug Ewell wrote: >> [?] > > So I took another look and saw that: > > (1) U+1F946 RIFLE has the following cross-reference in NamesList.txt: > > = marksmanship, shooting, hunting > > which does not include any mention of squirt guns or water pistols, or > generally bowdlerizing the image or changing the intent of this code > point; > > (2) Section 22.9 "Miscellaneous Symbols" in TUS 9.0 does not make any > mention of modifying the RIFLE glyph, or symbol glyphs in general, so as > to alter their meaning; > > (3) the code chart at http://www.unicode.org/charts/PDF/U1F900.pdf > clearly shows a rifle, and not any other type of gun or non-gun. > > I can imagine people with time on their hands criticizing Apple for > changing the glyph, but how did the Unicode Consortium itself get > dragged into this? What obvious thing am I missing? Don?t mind. That?s just another example of ?Unicode bashing? that is sometimes found in European papers since the beginning of the public existence of the Standard. That is typically issued by people who didn?t learn much about the topic they?re writing on (well, like I didn?t when I started mailing here?). The underlying spirit is IMHO found also in the first ISO 10646 chief editor?s attitude when he enforced bad (wrong, inconsistent or useless) character names just to make for Europe?s superiority (forgetting BTW that ?CARON? was originally US-American internal use standardese). Now that Christoph P?per found out [1] that the FAZ authors most probably *did* read the BuzzFeed article I found with Bing and posted on this Mailing List [2], the issue is complicated in that there is obviously some dishonest handling of core information by the FAZ authors, except in the case that they were unable to understand the difference between a character encoding refusal and an emoji property value change, or?as of the PISTOL emoji?the difference between a character and a glyph. Apple could have made use of the possibility to shift the meaning of an emoji?a not uncommon phenomenon, according to Leonardo Boiko?s last findings [3]. Actually they didn?t have much choice, being urged to hide from the public area as far as possible the meaning of a fire weapon. The really troublesome thing is that German newspaper journalists are eager to promote guns, rifles and other pistols for interpersonal messaging. As I already said, much of the latter is performed by children. That?s the biggest reason why I find it OK that no effective pistols be provided in image. It seems that this FAZ article was written by some unmarried, unresponsive beginners. However, since they talk of the RIFLE character as if it didn?t exist in Unicode (and not only were ?missing? amidst the iOS emoji), it?s hard for me to make any sense except by considering those utterings as a kind of neonazi propaganda junk (despite of the renown of the newspaper itself) due most probably to the fact that the responsible chief editor was on holidays. So as I said: Don?t mind. Marcel [1] http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0003.html [2] http://www.unicode.org/mail-arch/unicode-ml/y2016-m08/0091.html [3] http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0004.html From asmusf at ix.netcom.com Sat Sep 3 16:21:22 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sat, 3 Sep 2016 14:21:22 -0700 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> Message-ID: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> I don't think that there should be a place on this list for accusing people of dishonesty and / or spreading "neo-nazi junk"; and I don't know what the marriage status of the editors has to do with anything. The central concern of the FAZ article appears to be the role that private entities play as gate-keepers of modern communication. That's actually a valid concern (see issues like net-neutrality, algorithm based search returns and news-feeds and the like). The fact that fine distinctions of a technical nature may have been handled with less precision than insiders would prefer, is perhaps sloppy, but pretty typical for journalism in general. None of that warrants the kind of loaded language used here. A./ PS: must admit, I haven't followed the FAZ in a while, so I have no personal knowledge of any changes that may have happened in recent years, but in earlier times the Feuilleton (the section that this article appeared in) used to be fairly liberal in outlook, certainly not given to the extremist views that they are accused of here. And I can detect no evidence that the charges below have any merit. On 9/3/2016 9:30 AM, Marcel Schneider wrote: > ...t there is obviously > some dishonest handling of core information by the FAZ authors, except > in the case that they were unable to understand the difference between > a character encoding refusal and an emoji property value change, or?as > of the PISTOL emoji?the difference between a character and a glyph. > ... It seems that this FAZ article was written by some > unmarried, unresponsive beginners. > > However, since they talk of the RIFLE character as if it didn?t exist in > Unicode (and not only were ?missing? amidst the iOS emoji), it?s hard > for me to make any sense except by considering those utterings as a kind > of neonazi propaganda junk (despite of the renown of the newspaper itself) > due most probably to the fact that the responsible chief editor was on > holidays. > > From charupdate at orange.fr Sat Sep 3 20:10:58 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 4 Sep 2016 03:10:58 +0200 (CEST) Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> Message-ID: <112979824.3.1472951458962.JavaMail.www@wwinf1c02> On Sat, 3 Sep 2016 14:21:22 -0700, Asmus Freytag (c) wrote: > I don't think that there should be a place on this list for accusing > people of dishonesty and / or spreading "neo-nazi junk"; and I don't > know what the marriage status of the editors has to do with anything. > > The central concern of the FAZ article appears to be the role that > private entities play as gate-keepers of modern communication. That's > actually a valid concern (see issues like net-neutrality, algorithm > based search returns and news-feeds and the like). The fact that fine > distinctions of a technical nature may have been handled with less > precision than insiders would prefer, is perhaps sloppy, but pretty > typical for journalism in general. > > None of that warrants the kind of loaded language used here. > > A./ > > PS: must admit, I haven't followed the FAZ in a while, so I have no > personal knowledge of any changes that may have happened in recent > years, but in earlier times the Feuilleton (the section that this > article appeared in) used to be fairly liberal in outlook, certainly not > given to the extremist views that they are accused of here. And I can > detect no evidence that the charges below have any merit. I admit that I mistook my language level. In front of the long discussion on the Unicode List triggered by that FAZ paper, I?ve ended up bursting out. The main concern as Doug Ewell?s last question underscores it, is whether the attack against the Unicode Consortium is justified in any way, or is mere calumny. Further research brings up that the author of the paper is a very young freelance journalist.[1] That confirms my suspicion that having no children in age of using an iPhone, he doesn?t feel concerned with Apple?s choice. However he is right in that, at the very end of his article he points out the risk of the waterpistol emoji being intended as such and received on Android: ?Die in Codes formulierte Entwaffnungspolitik kehrt sich in ihr Gegenteil, wenn ein iPhone-Nutzer seine Freunde zu einer Wasserschlacht einl?dt und ihnen per SMS ein Wasserpistolen-Emoji schickt: Dann erscheint auf dem Samsung-Ger?t keine Wasserpistole, sondern ein Revolver. Und das k?nnten die Empf?nger wom?glich missverstehen.? [revised Google translation: ?The disarmament policy that is formulated in codes is reversed into its opposite when an iPhone user invites his friends to a water fight and sends them a water pistol emoji by SMS: Then the Samsung device does not display a water pistol, but a revolver. Something that the receivers could possibly misunderstand.?] Arguing by this very rare case is consistent with the facts-twisting used in other parts of the article. This casts a crude twilight on the author?s approach. Harsh wording such as ?doppelz?ngig? (deceitful, speaking of Apple and Microsoft); ?schleift das Recht auf freie Meinungs?u?erung? (grinds the right on Free Speech, as of Apple), pointed as a ?Skandal?; ?zeugt von einer verqueren Sicht der Dinge? (brings evidence of an awry/askew/screwy point of view) contrasts with an obvious lack of knowledge when talking of Unicode as both proposing and accepting emoji? As is outlined by a reader?s comment, while emoji are formally in the first place, the demonstration is biased with a mix-up involving speech, then applied to emoji to make the reader believe that the Orwell-reminiscence is really well-placed. Unicode and big tech companies are always patient targets for attacks of that kind. As pointed by another commenting reader: A tempest in a teapot. I remember the FAZ feuilleton over a decade ago too, appearing to me always as high-quality journalism. A quick look at the last article from the same author[2] makes me believe that truth and accuracy still conform to the standard. In return, I?m left back with the troublesome question: Why do they hate Unicode, Apple and Microsoft? Marcel [1] https://www.linkedin.com/in/adrian-lobe-aa3057b7 [2] http://www.faz.net/aktuell/feuilleton/debatten/wer-haftet-fuer-luegen-die-algorithmen-verbreiten-14416032.html From christoph.paeper at crissov.de Sun Sep 4 01:08:55 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 4 Sep 2016 08:08:55 +0200 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> Message-ID: <1877D788-19A5-4ABD-B594-35E62245FE04@crissov.de> Asmus Freytag (c) : > > The central concern of the FAZ article appears to be the role that private entities play as gate-keepers of modern communication. That's actually a valid concern (?). The fact that fine distinctions of a technical nature may have been handled with less precision than insiders would prefer, is perhaps sloppy, but pretty typical for journalism in general. Exactly. Anybody who becomes aware of being considered a gatekeeper (i.e. a mild version of a ?censor?) by the general public should not react by dismissal but reflection! The FAZ is generally considered conservative by German/European standards, but would still be considered rather liberal in the US. Be assured that it takes a rather restrictive stand when it comes to *actual* gun control (at least by international standards). Within the spectrum of traditional German newspapers, it usually is quite on the pro side of capitalism and trans-Atlantic friendship (i.e. its policy is not ?anti-American?). The fear of being controlled or restricted by big (US-based) corporations or faceless bureaucrats, however, is shared by many left and right-wing authors. The state itself ? unlike in ?1984? or NRA propaganda ? is generally not seen as the enemy in German media. The reference to Orwell?s dystopia was hence badly chosen, but it is probably the one best known ? besides ?Brave New World? ? among the readership. From c933103 at gmail.com Sun Sep 4 05:48:04 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Sun, 4 Sep 2016 18:48:04 +0800 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <112979824.3.1472951458962.JavaMail.www@wwinf1c02> References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26> <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com> <112979824.3.1472951458962.JavaMail.www@wwinf1c02> Message-ID: After you admitted you have mistook the language level, your reply is still full of ad hominem, and being very young =\= have no children in age of using iphone, and even if he have no children in age of using iphone that does not mean he is not concerned about the problem of children possibly exposed to hate comments. In my opinion concern about what children might expose to shall be deal with parental control not something that affect every users. In some war-torn countries, guns are still unfortunately part of daily life for some people, and they are also unicode users. They use emoji too. Please at least try to pretend to be less US-centric. Calling inter-operatability problem a very rare case is just pretending the problem does not exist, think about how the yellow heart look like in Android 4.4. Those words used in the article are in my opinion fairly mild and indirect, at least they are not writing like some other reports I have read earlier that claim it is yet another example of how America dictate online world and utilize their technological advantage to force rest of the world sacrifice for them. I don't see any particular big problem in the googl translated text of the report on the original author's understansing on unicode's proposal and accepting procedure and I am afraid you might have misunderstood something between the lines. Also, author's point is not about priblems in proposal/acceptance procedure, instead he's talking about how such a use of existing procedure could lead to undesirable effect. You cannot deny emoji are used as part of speech, even when they are mostly informal. For instance, in an election take place in ny city today, some candidates are forced to use emoji to write their political campaigns to avoid censorship. It is likely that if people can't find their desired emoji in emoji list, they will turm to use other less fitting emoji and thus avoided the expression they originally intended to use. Determining what to/not to appear in the emoji list for non-technical reason surely open up a new path to governmental bodies around the world to control people's expression, for instance if you set your locale as China, then the flag for Taiwan would disappear from your keyboard if you're using native keyboard on phones from some brand. Or try to think about a hypothetical situation that certain Islamic country require all phones sold in it must have hijab for all the women emoji. If you think such a rational article is wrotten with hate then might be you habe already assumed any negative opinions are equal to hate. Ckyu. 2016?9?4? 09:13 ? "Marcel Schneider" ??? > On Sat, 3 Sep 2016 14:21:22 -0700, Asmus Freytag (c) wrote: > > > I don't think that there should be a place on this list for accusing > > people of dishonesty and / or spreading "neo-nazi junk"; and I don't > > know what the marriage status of the editors has to do with anything. > > > > The central concern of the FAZ article appears to be the role that > > private entities play as gate-keepers of modern communication. That's > > actually a valid concern (see issues like net-neutrality, algorithm > > based search returns and news-feeds and the like). The fact that fine > > distinctions of a technical nature may have been handled with less > > precision than insiders would prefer, is perhaps sloppy, but pretty > > typical for journalism in general. > > > > None of that warrants the kind of loaded language used here. > > > > A./ > > > > PS: must admit, I haven't followed the FAZ in a while, so I have no > > personal knowledge of any changes that may have happened in recent > > years, but in earlier times the Feuilleton (the section that this > > article appeared in) used to be fairly liberal in outlook, certainly not > > given to the extremist views that they are accused of here. And I can > > detect no evidence that the charges below have any merit. > > I admit that I mistook my language level. In front of the long discussion > on > the Unicode List triggered by that FAZ paper, I?ve ended up bursting out. > > The main concern as Doug Ewell?s last question underscores it, is whether > the > attack against the Unicode Consortium is justified in any way, or is mere > calumny. > > Further research brings up that the author of the paper is a very young > freelance > journalist.[1] That confirms my suspicion that having no children in age > of using > an iPhone, he doesn?t feel concerned with Apple?s choice. However he is > right in > that, at the very end of his article he points out the risk of the > waterpistol emoji > being intended as such and received on Android: > > ?Die in Codes formulierte Entwaffnungspolitik kehrt sich in ihr Gegenteil, > wenn ein iPhone-Nutzer seine Freunde zu einer Wasserschlacht einl?dt und > ihnen per SMS ein Wasserpistolen-Emoji schickt: Dann erscheint auf dem > Samsung-Ger?t keine Wasserpistole, sondern ein Revolver. Und das k?nnten > die Empf?nger wom?glich missverstehen.? > > [revised Google translation: ?The disarmament policy that is formulated in > codes > is reversed into its opposite when an iPhone user invites his friends to a > water > fight and sends them a water pistol emoji by SMS: Then the Samsung device > does not > display a water pistol, but a revolver. Something that the receivers could > possibly > misunderstand.?] > > Arguing by this very rare case is consistent with the facts-twisting used > in other > parts of the article. This casts a crude twilight on the author?s > approach. Harsh > wording such as ?doppelz?ngig? (deceitful, speaking of Apple and > Microsoft); > ?schleift das Recht auf freie Meinungs?u?erung? (grinds the right on Free > Speech, > as of Apple), pointed as a ?Skandal?; ?zeugt von einer verqueren Sicht der > Dinge? > (brings evidence of an awry/askew/screwy point of view) contrasts with an > obvious > lack of knowledge when talking of Unicode as both proposing and accepting > emoji? > > As is outlined by a reader?s comment, while emoji are formally in the > first place, > the demonstration is biased with a mix-up involving speech, then applied > to emoji > to make the reader believe that the Orwell-reminiscence is really > well-placed. > > Unicode and big tech companies are always patient targets for attacks > of that kind. As pointed by another commenting reader: A tempest in a > teapot. > > I remember the FAZ feuilleton over a decade ago too, appearing to me > always as > high-quality journalism. A quick look at the last article from the same > author[2] > makes me believe that truth and accuracy still conform to the standard. > > In return, I?m left back with the troublesome question: Why do they hate > Unicode, > Apple and Microsoft? > > Marcel > > [1] https://www.linkedin.com/in/adrian-lobe-aa3057b7 > [2] http://www.faz.net/aktuell/feuilleton/debatten/wer- > haftet-fuer-luegen-die-algorithmen-verbreiten-14416032.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Sep 5 11:51:03 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 5 Sep 2016 10:51:03 -0600 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: <84059D99CFD34648A4A4FC17D74047A4@DougEwell> Marcel Schneider wrote: > The main concern as Doug Ewell's last question underscores it, is > whether the attack against the Unicode Consortium is justified in any > way, or is mere calumny. I didn't accuse FAZ or anyone else of calumny, or any sort of malicious intent. Hanlon's Razor applies here. -- Doug Ewell | Thornton, CO, US | ewellic.org From charupdate at orange.fr Mon Sep 5 16:49:41 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 5 Sep 2016 23:49:41 +0200 (CEST) Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <84059D99CFD34648A4A4FC17D74047A4@DougEwell> References: <84059D99CFD34648A4A4FC17D74047A4@DougEwell> Message-ID: <931602732.19273.1473112181379.JavaMail.www@wwinf1e21> On Mon, 5 Sep 2016 10:51:03 -0600, Doug Ewell wrote: > Marcel Schneider wrote: > >> The main concern as Doug Ewell's last question underscores it, is >> whether the attack against the Unicode Consortium is justified in any >> way, or is mere calumny. > > I didn't accuse FAZ or anyone else of calumny, or any sort of malicious > intent. Hanlon's Razor applies here. Yep, I wasn?t aware! With this in mind, there will be much less hassles. Thanks! Marcel From nobody_uses at outlook.com Mon Sep 5 18:53:50 2016 From: nobody_uses at outlook.com (eduardo marin) Date: Mon, 5 Sep 2016 23:53:50 +0000 Subject: named character sequences foor tally marks Message-ID: I love the proposal to add western tally marks because it only occuies two code points for a techically equivalent solution: http://www.unicode.org/L2/L2016/16065-tally-marks.pdf L2/16-065 (Proposal to encode two Western-style tally marks) www.unicode.org 1 L2/16-065 Title: Proposal to encode two Western-style tally marks Source: Ken Lunde (Adobe) & Daisuke MIURA Status: Individual contribution Action: For ... However this proposal isn't complete unless we can identify tally marks 2, 3 and 4 easily and the simplest way is to add named character sequences, where we just repeat tally mark one an n number of times. -------------- next part -------------- An HTML attachment was scrubbed... URL: From irgendeinbenutzername at gmail.com Mon Sep 5 19:34:38 2016 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Tue, 6 Sep 2016 02:34:38 +0200 Subject: Why isn't MUSICAL SYMBOL NULL NOTEHEAD default ignorable? Message-ID: It has just come to my attention that U+1D159 MUSICAL SYMBOL NULL NOTEHEAD is not default ignorable, even though it has no visible glyph appearance and no advance width in text, just like the various Hangul jamo fillers that *are* default ignorable. Is there a technical reason for this or is it just an oversight? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nobody_uses at outlook.com Tue Sep 6 00:18:35 2016 From: nobody_uses at outlook.com (eduardo marin) Date: Tue, 6 Sep 2016 05:18:35 +0000 Subject: Enconding a flammable sign and others Message-ID: I'm really surprised this isn't already encoded but while we are at it let's also encode a symbol for non-ioninzing radiation, laser hazard, explosion hazard, strong magnetic field, chocking hazard, corrosion, slippery floor, oxidising, carcinogen and chemical weapon symbols just to name the most relevant. Other ones would include uv light, frezzing hazard, hand in the middle of cogs, foot or hand under machinery, battery hazard, washing hands, fragile symbol, crane hazard, suffocation, high temeperatures and probably some I missed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Sep 6 07:03:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Sep 2016 14:03:28 +0200 Subject: named character sequences foor tally marks In-Reply-To: References: Message-ID: Isn't the proposal showing only significant cases for numbers 1-5 (the others are repeating the glyphs with their separation made by their side bearing) Digits 1-4 (using the first variant and an overstriking slash for 5) are also highly confusable with existing vertical bars and Devanagari punctuations, but their significant difference is their side bearings, which may not be distinctable with monospaced fonts). For the second variant (alternating horizontals and verticals) has the same confusable glyphs for 1, but for 2 they may be confusable with some Hangul. However, for a coherent presentation (to avoid mixing fonts with various metrics, it still seems coherent to encode 1-5 in a single set (the addition of 10-20 does not seem necessary, unless there are other variants for talling marks) Other Variants : - 1. Many people extend the number of vertical bars and do not use any slanted slash (5 is represented just as " ||||| "): - 2. Another common variant draws successively each side of a square and adds a diagonal for 5. - 3. I've seen at least one variant of the first variant, using an horizontal bar for 6 (two for 7, three for 8, four for 9) and representing 10 with two orthogonal slashes (and when counting by tens directly, using a square or just an X cross, also similar to the Roman digit X). Talling marks are basically used to count events or objects progressively, as they come, by adding a single stroke to the counter without erasing anything. these marks are not necessarily dranw with a pen but may be engraved bur some cutting tool, they may also be holes through thin surface. their presentation depends on the material used and whever they must resist mechanically for extended periods of time (ink on paper is not always usable, notably when exposed directly to water/humidity, or when marks are on objects that are subject to frequent manipulations). Their presentation will then widely vary, but for use in plain text documents that will be printed on paper or displayed on screen (possibly along with other text) it seems that the first variant (verical bars and overstriking slash for 5) is the most representative. I don't think this proposal is really justifying the addition of number 6-20. In fact, for digits 1-5 it also looks very similar to variants of the Roman digits (where 4 may also be represented by " IIII " instead of " IV "). Roman digits where initially simple talling marks (just like older Greek or phoenician scripts and many scripts of the world) but then turned to reuse the same Latin letters of the alphabet (with additional letters borrowed for multiples of powers of ten) and derived later to use also lowercase letters and ligated glyphs. The proposal you cite suggests to reuse the same code for two very different variants, but I wonder why. the first variant is the most common and matches what is used in many scripts. The second variant is probably very specific for use with Asian scripts. The variant with sides of a square is probably better known (and used across cultures; howeer the order for frawing these sides is not significant, even if they are geenrally drawn by circularily connecting sides, which diagonal will be drawn is usually not significant: right-handed people usually use a "/" slash, left-handed people frequently use a "\" backslash, but people may also alternate the slashes between the group for five and the second group for ten, for easier visualization of the total). Another common usage is to add a second horizontal stroke over two successive groups of five talling marks, to create complete groups of 10. If the surface is easily erasable or discardable, another talling line may be used to count complete groups of tens (or some other significant groups such as 12, depending on the context, notably in games), erasing/discarding the first line for units as soon as it is complete. There are also games using groups of 3 units, where talling marks are sides of a triangle, or groups of 6 units where marks are sides of a square and its two diagonals. Finally there are also wellknown games where talling marks are drawing a hanged man, usually with 10 strokes (including straight strokes for the horizontal base, the vertical mast, the horizontal support at top, the hanging chord, a circle for the head, and a single stroke for the body+neck, the arms, and the legs; the exact layout may vary, I've seen also games drawing a basic house, with a triangular roof, and a door). However these talling schemes in games are not perceived as digits/numbers. --- Notes about variant (1.) above : This variant using only vertical bars is wellknown in France when opening and counting votes in all official elections and referenda (similar methods may be used for other elections/polls in organizations, such as professional elections, when there are large numbers of voters with secret votes and no agreed electronic votes). Attempts to use electronic votes for official elections have always been opposed (and they don't bring any advantage for the secret of votes or in terms of cost and speed of operations). The position where to draw these vertical bars is preprinted on talling sheets by a small dot or horizontal dash. This counting is made publicly, immediately after the closure of the vote, by volunteeers (assessors) voting in that bureau, whose identity is controled). An assermented public officier may also be present to control an assert the regularity of the operations (when sealing the empty urn before opening the vote, and when the seals are broken until votes are fully counted): this opens in some municipalities where candidates are in strong oppositions and the majority is likely to be contested), but most frequently candidates have their own representant present in each bureau (within the public which is still kept separate from the talling tables). Generally there are 3-4 tables by bureau for this operation (there may be only one table if there are not enough volunteers present), plus the president of the bureau (an elected member of the municipality and one or two representant of the opposition, or some administrative personel of the municipality; the police may also be present to control the public and secure the operations. As long as there are not enough persons there to start the count, the sealed transparent urns cannot be opened (they may be brought by the police to another bureau that will open it publicly, but before that, they must remain visible to anyone). The effective talling process occurs after opening the votes and counting the individual envelopes in groups of 100: these groups are put in sealed envelopeds that will be opened separately on talling tables. On each talling table, there is two assessors counting parallelly on separate sheets, another assessor opening the envelopes, and another one announcing the vote orally and ordering them (all 4 assessors are controling the regularity of the vote and signing the null votes or empty envelopes); it may happen that a group of 100 has one additional envelope or a missing one: this is signaled but does not cancel the group, the talling sheet will return the effective number of valid and invalid votes. All valid, null/canceled and blank votes are counted with these talling marks.Only the empty envelopes that contained the secret votes do not need to be kept with the talling sheets (but envelopes containing invalid/null/blank votes are signed by assessors and kept for later control, if needed). At end of counting a group of 100 votes, the total number is also added in a dedicated colum, using standard digits "0-9" and a new talling sheet is used for the next group (each talling sheet is then signed by each one of the 4 persons. It the two talling sheets do not have match totals at this step (and because alls open envelopes are kept on the table), the full group will be recounted on new sheets and assessors are signing the cancellation of a talling sheet. If the public sees irregularities of operations on a talling table, the group of 100 will be recounted on another table (groups of 100 votes are never mixed on the same table). At end of operations, another sheet is used by the bureau to total all votes for the bureau and results are immediately announced publicly in the bureau and displayed ouside for several days. This totalling sheet just use normal "0-9" digit, before sending all signed sheets of the bureau to the central bureau of the municipality where they will be totaled and announced poublicly and then sent to the prefecture (representing the national authority), electronically immediately and by secure postal mail. Finally these totals are compared to the registry of participants (each voter signs this registry before inserting their secret vote in the urn), which is controlled separately publicly with their own totals, while votes from the urns are being counted by assessors. The whole process lasts for about one hour (more or less depending on the number of tables). These wellknown public operations in bureau are rarely contested (and most people feel that it is more secure than any form of electronic vote, which is also felts as being intrusive for the secret). Contestations generally come from what is happening outside the bureau (such as illegal campaigns during the day of voting, or irregularities in the registry of voters, or security problems for the access of voters to the bureau), or opening votes before the official scheduled time (before the public can be officially present) or keeping it open too late (when there were no more voters arriving in the bureau in regular time and waiting their turn to access the secret cabins, sign the registry and insert their vote in the urn): there's a small tolerance for closing the vote one or two minutes after, but at this time the public is generally present (however at this time, results or estimations may be published and could influence the vote of last minute voters and this could be signaled as an irregularity, possibly invalidating all results of the bureau by a court; if too many results are canceled by a court, changing the final results significantly a new vote would then have to be organized later and this has a significant cost for municipalities). So the vertical talling marks on sheets are just used temporarily (really needed for a few minutes) but may still be controled later (along with with all envelopes and vote sheets that are kept together in a large sealed envelope containing the 100 votes, which is closed, signed by assessors and the president of the bureau). As this talling method is wellknown, it is also used informally (notably by children paying games). But talling marks using sides of a quare and a diagonal is also common in popular games. 2016-09-06 1:53 GMT+02:00 eduardo marin : > I love the proposal to add western tally marks because it only occuies two > code points for a techically equivalent solution: > http://www.unicode.org/L2/L2016/16065-tally-marks.pdf > L2/16-065 (Proposal to encode two Western-style tally marks) > > www.unicode.org > 1 L2/16-065 Title: Proposal to encode two Western-style tally marks > Source: Ken Lunde (Adobe) & Daisuke MIURA Status: Individual contribution > Action: For ... > However this proposal isn't complete unless we can identify tally marks 2, > 3 and 4 easily and the simplest way is to add named character sequences, > where we just repeat tally mark one an n number of times. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Sep 6 09:56:27 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 06 Sep 2016 07:56:27 -0700 Subject: Enconding a flammable sign and others (et al.) Message-ID: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com> As a reminder, "let's encode X" on the public mailing list doesn't constitute a proposal, even for emoji. -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Tue Sep 6 11:12:48 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 6 Sep 2016 09:12:48 -0700 Subject: Enconding a flammable sign and others (et al.) In-Reply-To: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com> References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com> Message-ID: <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Sep 6 11:17:44 2016 From: everson at evertype.com (Michael Everson) Date: Tue, 6 Sep 2016 17:17:44 +0100 Subject: Enconding a flammable sign and others (et al.) In-Reply-To: <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com> References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com> <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com> Message-ID: <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com> On 6 Sep 2016, at 17:12, Asmus Freytag (c) wrote: > As a reminder, "X" is already encoded, so no proposal for encoding it would be accepted. My Latin chi was accepted ;-) M From doug at ewellic.org Tue Sep 6 12:01:12 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 06 Sep 2016 10:01:12 -0700 Subject: Enconding a flammable sign and others (et al.) Message-ID: <20160906100112.665a7a7059d7ee80bb4d670165c8327d.2bdbbb8be8.wbe@email03.godaddy.com> Asmus Freytag (c) wrote: >> As a reminder, "let's encode X" on the public mailing list doesn't >> constitute a proposal, even for emoji. > > As a reminder, "X" is already encoded, so no proposal for encoding it > would be accepted. Ha ha! It is to laugh! Let me try again. Suggestions on the public mailing list to encode new characters, including but not limited to the recent ones quoted below, do not constitute proposals, even for emoji. > It is well known that the southern song style of counting rods, had > different forms for the digits 4, 5 and 9 > https://en.wikipedia.org/wiki/Counting_rods , however currently there > is no way to represent such forms, a proposal to add them would only > occupy five code points, since number four is identical both vertical > and horizontally. > the last four characters on the right of which I propose the following > names: UPPER HALF BLOCK MEDIUM SHADE, LOWER HALF BLOCK MEDIUM SHADE, > FULL BLOCK-UPPER HALF MEDIUM SHADE and FULL BLOCK-LOWER HALF MEDIUM > SHADE. I recommend encoding them in the Miscellaneous Symbols and > Arrows block or in the Geometric Shapes Extended block. > I'd argue there's a fifth character missing from the ZX-81 set. Notice > how there are two separate MEDIUM SHADEs, one the inverse of the > other. For complete compatibility Unicode would also somehow need a > second one of those. > However this proposal isn't complete unless we can identify tally > marks 2, 3 and 4 easily and the simplest way is to add named character > sequences, where we just repeat tally mark one an n number of times. > I'm really surprised this isn't already encoded but while we are at it > let's also encode a symbol for non-ioninzing radiation, laser hazard, > explosion hazard, strong magnetic field, chocking hazard, corrosion, > slippery floor, oxidising, carcinogen and chemical weapon symbols just > to name the most relevant. Other ones would include uv light, frezzing > hazard, hand in the middle of cogs, foot or hand under machinery, > battery hazard, washing hands, fragile symbol, crane hazard, > suffocation, high temeperatures and probably some I missed. -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Tue Sep 6 14:12:20 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 6 Sep 2016 12:12:20 -0700 Subject: Enconding a flammable sign and others (et al.) In-Reply-To: <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com> References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com> <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com> <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com> Message-ID: <1e79542b-d7f4-17d8-c432-595da6e9b774@ix.netcom.com> An HTML attachment was scrubbed... URL: From mpsuzuki at hiroshima-u.ac.jp Fri Sep 9 08:58:24 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 09 Sep 2016 22:58:24 +0900 Subject: how to evaluate the "emoji support level" in given font? Message-ID: <57D2C000.5070902@hiroshima-u.ac.jp> Hi, Recently, fontconfig developers are discussing how to evaluate "is this font supporting 'emoji' set sufficiently?". Is it possible to design a subset of emoji to serve common use of emoji? For detail about the discussion of fontconfig developers, please refer the thread from: https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html * about fontconfig fontconfig is a library which is widely used by Unix-like operating systems to locate a (pathname of) font file, by the query with a few typographic category (serif/sans-serif/monospace etc), script, and language. fontconfig crawls the font files on the systems, and make a database to respond such query. To guess the supported script and language, basically fontconfig checks the coverage of the codepoints with relevant glyph data. The coverage is compared with the orthography database: for the case of CJK script, the coverage is compared with GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc. * emoji and fontconfig At present, fontconfig developers are wondering how they can list the codepoints to evaluate the query "this font support emoji?". The stable subset of emoji would be the repertoire used by Japanese legacy cellular phones, but (personally) I don't think it is still respected to design some emoji fonts, as far as the developer is careful about the legacy cellular phone users. Is it possible to design a subset of emoji to serve common use of emoji? Or, if such attempt (evaluate the support level of emoji by checking some codepoints) is wrong, is there any good method to evaluate the support level of emoji in given font? Regards, mpsuzuki From mpsuzuki at hiroshima-u.ac.jp Fri Sep 9 09:20:43 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 09 Sep 2016 23:20:43 +0900 Subject: [Unicode] how to evaluate the "emoji support level" in given font? In-Reply-To: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com> References: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com> Message-ID: <57D2C53B.1050007@hiroshima-u.ac.jp> oh, I should add more words why I wrote "subset". There is a full list of emoji defined by Unicode; http://unicode.org/Public/emoji/3.0/emoji-data.txt But I'm questionable whether the most emoji font developers are trying to fill all of this list. For example, to check the support level for zh-CN, fontconfig does not check all G-source characters of CJK Unified Ideograph - because, there are so many Chinese fonts covering GB 2312 but not coverting GB 18030. I guess similar situation in emoji fonts... Regards, mpsuzuki suzuki toshiya wrote: > Hi, > > Recently, fontconfig developers are discussing how to evaluate > "is this font supporting 'emoji' set sufficiently?". Is it possible > to design a subset of emoji to serve common use of emoji? > > For detail about the discussion of fontconfig developers, please > refer the thread from: > https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html > > * about fontconfig > fontconfig is a library which is widely used by Unix-like operating > systems to locate a (pathname of) font file, by the query with a few > typographic category (serif/sans-serif/monospace etc), script, and > language. fontconfig crawls the font files on the systems, and make > a database to respond such query. To guess the supported script and > language, basically fontconfig checks the coverage of the codepoints > with relevant glyph data. The coverage is compared with the orthography > database: for the case of CJK script, the coverage is compared with > GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc. > > * emoji and fontconfig > At present, fontconfig developers are wondering how they can list the > codepoints to evaluate the query "this font support emoji?". The stable > subset of emoji would be the repertoire used by Japanese legacy cellular > phones, but (personally) I don't think it is still respected to design > some emoji fonts, as far as the developer is careful about the legacy > cellular phone users. > > Is it possible to design a subset of emoji to serve common use of emoji? > Or, if such attempt (evaluate the support level of emoji by checking > some codepoints) is wrong, is there any good method to evaluate the > support level of emoji in given font? > > Regards, > mpsuzuki > > From pedberg at apple.com Fri Sep 9 19:23:55 2016 From: pedberg at apple.com (Peter Edberg) Date: Fri, 09 Sep 2016 17:23:55 -0700 Subject: CLDR Version 30 alpha available Message-ID: <5D20A154-190C-48DD-A92D-041E9009D6A5@apple.com> Dear Unicode list members, The alpha draft version of Unicode CLDR v30 is available for testing. The main improvements include: ? New format and preference structure has been added to support week designations such as ?the week of August 10? or ?week 3 of March?. ? New data items have been added to support relative times such as ?3 Fridays ago? or ?this hour?. ? New data can be used to generate labels for groups of related characters in character pickers. ? The structure for emoji annotations has been revised, and the data has been significantly updated. ? Unicode support is updated to 9.0, including updated Unihan readings for the pinyin collation and Han-Latin transforms, and support for new script codes and number systems. Support is also added for region codes EZ, UN. ? The set of language codes for translation has been updated, with a significant increase in the total number of translated language names. ? The CLDR 30 Survey Tool data collection and additional bug fixing resulted in a net increase in data items of about 8.6%, with an additional 5.6% of items changed. Draft release note: http://cldr.unicode.org/index/downloads/cldr-30 Draft charts: http://www.unicode.org/cldr/charts/dev/ Draft data tag: http://www.unicode.org/repos/cldr/tags/release-30-d01 The final release of CLDR 30 is targeted for the end of September. Please provide any feedback on the alpha draft version by filing a ticket as described Here: http://cldr.unicode.org/index/bug-reports Best regards, Peter Edberg for the CLDR Project -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Sep 10 17:25:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 10 Sep 2016 16:25:29 -0600 Subject: how to evaluate the "emoji support level" in given font? Message-ID: <22B1CFFCA9DC423DA4874F4EED0AA274@DougEwell> suzuki toshiya wrote: > Is it possible to design a subset of emoji to serve common use of > emoji? Or, if such attempt (evaluate the support level of emoji by > checking some codepoints) is wrong, is there any good method to > evaluate the support level of emoji in given font? This could be a more complex question for emoji than for writing systems. Users might have different expectations of what constitutes "support" for emoji as compared to, say, Latin or Kanji. They might expect to be able to select text or emoji rendering via U+FE0E and U+FE0F, or might expect support for ZWNJ sequences. fontconfig may or may not be wired to take this into account. On the other hand, as far as full or less-than-full coverage is concerned, does fontconfig currently insist on 100% coverage of all characters in a script? Nearly all fonts offer Basic Latin (ASCII) support, while relatively few have glyphs for the Latin Extended-E block. Is the latter required in order to claim "Latin script support," and if not, would similar criteria apply when determining "emoji support"? Not trying to be critical, just trying to understand the expectations. -- Doug Ewell | Thornton, CO, US | ewellic.org From christoph.paeper at crissov.de Sun Sep 11 07:40:55 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 11 Sep 2016 14:40:55 +0200 Subject: Additional Emoji selection factor: Support by "Major Vendors" In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: Mark Davis ?? : > > In order to understand the status of any document in the registry, you need to also look at the minutes of the meeting where they are discussed, in this case: http://www.unicode.org/L2/L2016/16121.htm > >> B.14.3 Provisional value for Emoji property [Emoji SC/Edberg, L2/16-087] >> >> B.14.3.1 Characters Proposed for Emoji=Provisional [Emoji SC/Edberg, L2/16-088] >> >> Discussion. UTC took no action at this time. > > "Took no action" generally means "rejected". Can anyone explain then, why [L2/16-128] seems to have been ?rejected? and still made it into selection.html? Same minutes as above: > E.1.11 Additional Emoji selection factor [Emoji SC/Edberg, L2/16-128] > > Discussion. UTC took no action at this time. [L2/16-128]: http://www.unicode.org/L2/L2016/16128-additional-emoji-selection-factor.pdf This was the proposed text to be added: > The following is a criterion for adding characters into a release of Unicode. It is not a selection factor that proposals need to address, but rather a consideration that the UTC takes into account before approving a character as a candidate for inclusion in a future release. > > Compared to most other characters in Unicode, there is greater public awareness of new emoji characters, and a high expectation of support for them from major vendors. However, the cost to such vendors of supporting new emoji characters is also much higher than for most other Unicode characters, especially on devices with limited memory. > > Thus in addition to these selection factors, before approving a new emoji character the Unicode Technical Committee needs to expect wide deployment: that major vendors would plan to include the proposed emoji character into very widely deployed fonts and input methods (keyboards / palettes / speech). In the currently public version of ?Submitting Emoji Character Proposals? (dated 4 August 2016) we find most of it unchanged. http://www.unicode.org/emoji/selection.html#selection_factors > Before approving as candidates or adding to a release of Unicode, other considerations are taken into account. See UTC Consideration. http://www.unicode.org/emoji/selection.html#utc_consideration > 1. Compared to most other characters in Unicode, there is greater public awareness of new emoji characters, and a high expectation of support for them from major vendors. However, the cost to such vendors of supporting new emoji characters is also much higher than for most other Unicode characters, especially on devices with limited memory. > > 2. Thus in addition to the selection factors, before approving a new emoji character the Unicode Technical Committee needs to expect wide deployment: that major vendors would plan to include the proposed emoji character into very widely deployed fonts and input methods (keyboards / palettes / speech). > > 3. The committee may balance the choices of emoji in a given set of candidates or release. For example, rather than 15 different breeds of dogs, the committee might choose to have some faces, some clothing, other animals, food items, transport items, and sports. None of that was present in April 2016. I haven?t been able to find out what constitutes a ?major vendor?. Apple, Microsoft and Google are certainly ones (and Unicode Full Members), but what about, for instance, Samsung, LG, Sony, Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent ones like Emojione (mostly Associate Members or not Unicode members at all)? From verdy_p at wanadoo.fr Sun Sep 11 09:21:27 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 11 Sep 2016 16:21:27 +0200 Subject: Additional Emoji selection factor: Support by "Major Vendors" In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: 2016-09-11 14:40 GMT+02:00 Christoph P?per : > Mark Davis ?? : > > > > I haven?t been able to find out what constitutes a ?major vendor?. > Apple, Microsoft and Google are certainly ones (and Unicode Full Members), > but what about, for instance, Samsung, LG, Sony, They are vendors yes, but for hardware devices using common platforms (Windows and Android most often, or derived directly from Android). The hardware capabilities of devices made by them are mostly the same as the ycompete at the same time on the same global market with similar features. The good question however is how they support these devices for the long term: sales of new mobile devices are slowing down, as more and more people feel that they are too much expensive and there's now a life to maintain devices functional, and there's now a vivid secondary market for repairing or upgrading them, or changing their batteries instead of buying a new device (additionally now people already have a secondary phone, possibly older, but that can be used as backup when necessary). > Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent > ones like Emojione (mostly Associate Members or not Unicode members at all)? > These are connected wep apps, and web apps have no difficulties to support and upgrade their supporting fonts or collections of icons. Users don't need to upgrade, this occurs almost transparently. If these are mobile apps, they are updated extremely frequently from their publication store (Windows Store, Google Play, Apple Store), using simple procedures (however mobile apps tend to grow in size at each update, forcing users to select which one they'll keep, or forcing them to uninstall one to install another one, then switch back, if they have old models with low storage capacity: 8GB smartphones are now deprecating in factor of 16GB ones, but these app vendors should propose several versions of their apps for users with limited storage, and they forget to monitor the market to see that this growth of app sizes is becoming a problem, when many features are added but not used by users. Most resident features should better be installed on demand in a cache that will clean up automatically if storage becomes too low; ideally most apps should be online and should use minimal local code). Beside these vendors, there remain some niche markets for mobile devices that have their own OS not compatible and not supported by these wellknown apps vendors. But most of them are created based on a Linux core (just like Android itself), and use wellknown platforms (Java, .Net, or simply the HTML/CSS support of the builtin browser) and apps are not complicate to port. The real complication is in the default support for input interfaces (i.e. virtual keyboards) that these apps need, and adaptaing them for the local markets (languages). Emoji input however is mostly independant and can be developped and supported across languages in the same input panels. The necessary sets of fonts or icons may also be instaleld transparently using web queries and the standard browser cache. These should be mostly independant of the target app platform or OS. vendors may develop their own common subsets (with personalized style), but there will be plenty of alternative offers (just like there's a lot of providers since long for emoticons, long before many of them where encoded). I don't think this is a major problem. However the complexity now is not in the encoding of emojis but the recent development of complex encodings requiring now large ligature tables to work properly, and/or using variant selectors. The most common combinations should be better documented in the standard (it is possible to encode them in the list of "named sequences", or in a separate list specifically for emojis). But I wonder if this is productive to show all styles for emojis: why not returning to just display a single representative glyph, with basic "flat" designs, but probably with some colors to help distinguishing them: but which version from which vendor should the standard use for that glyph ?. Emoji doc pages and reports in the Unicode site tend to become very large and it becomes much harder now to initiale new sets for designing new styles. Google, Apple, Microsoft, Facebook can develop their own sets, just like there are standard sets from Japanese telcos. -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Sun Sep 11 16:26:47 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 11 Sep 2016 23:26:47 +0200 Subject: Additional Emoji selection factor: Support by "Major Vendors" In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: Philippe Verdy : > 2016-09-11 14:40 GMT+02:00 Christoph P?per : > >> I haven?t been able to find out what constitutes a ?major vendor?. Apple, Microsoft and Google are certainly ones (?), but what about, for instance, Samsung, LG, Sony, > > They are vendors yes, but for hardware devices using common platforms (?). The important point here is that at least Samsung and LG are selling millions of devices annually with custom emoji fonts installed on them. > The good question however is how they support these devices for the long term: Many vendors are indeed really, really bad at developing, maintaining and rolling out OS updates for much of their product line-up. Even Apple supports their iOS devices for only ca. 5 years (from launch, not purchase)^, although the update infrastructure is already set up and new fonts by themselves would be rather small and simple fixes. Current emoji font files for mobile operating systems are somewhere between 3 MB (vector-based) and almost 40 MB (bitmap-based). ^ Apple released iOS 5, the first version with non-PUA emojis and iMessages (IIRC), and the iPhone 4s in late 2011. The OS update being deployed this week (i.e. iOS 10) will no longer support that device which was sold in some places until earlier this year. That means, there are now iOS devices which were capable of handling Unicode emojis at their launch date, but will be permanently incapable of displaying new ones from Unicode 9 or later. >> Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent ones like Emojione (?)? > > These are connected wep apps, and web apps have no difficulties to support and upgrade their supporting fonts or collections of icons. Users don't need to upgrade, this occurs almost transparently. If these are mobile apps, they are updated extremely frequently from their publication store That?s quite true, but doesn?t say anything about whether they?re ?major vendors? when it comes standardizing new emoji characters in Unicode. > The real complication is in the default support for input interfaces (i.e. virtual keyboards) that these apps need, and adaptaing them for the local markets (languages). Emoji input however is mostly independant and can be developped and supported across languages in the same input panels. That?s a general assumption, but I?m not sure it would hold against a user test. Observe, for instance, how on 11 September (or 4 July or during the Olympics) people complain on Twitter and elsewhere that they cannot find their ?American flag emoji? on the top of the list, or how they confuse it with the Liberian or Malaysian regional indicator (cf. Texas vs. Chile). That?s why there are customized panels for frequently or recently used emojis, or auto-replace and suggest-as-you-type algorithms. > However the complexity now is not in the encoding of emojis but the recent development of complex encodings requiring now large ligature tables to work properly, and/or using variant selectors. Those large tables can be generated by rather short algorithms, which perhaps could be simpler if emoji properties were more systematic. From duerst at it.aoyama.ac.jp Tue Sep 13 03:03:24 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 13 Sep 2016 17:03:24 +0900 Subject: [Unicode] how to evaluate the "emoji support level" in given font? In-Reply-To: <57D2C53B.1050007@hiroshima-u.ac.jp> References: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com> <57D2C53B.1050007@hiroshima-u.ac.jp> Message-ID: <5fc34456-744b-7e8c-76a0-bd7368c58739@it.aoyama.ac.jp> I think the first and most obvious way to check would be according to Unicode Version, i.e. check for some Emoji introduced in version 6, in version 7, and so on. For very old sets, checking for emoji present in the NTT Docomo set but not in the Softbank set,... might also make sense. Regards, Martin. On 2016/09/09 23:20, suzuki toshiya wrote: > oh, I should add more words why I wrote "subset". There is a full > list of emoji defined by Unicode; > http://unicode.org/Public/emoji/3.0/emoji-data.txt > But I'm questionable whether the most emoji font developers are > trying to fill all of this list. > > For example, to check the support level for zh-CN, fontconfig does > not check all G-source characters of CJK Unified Ideograph - because, > there are so many Chinese fonts covering GB 2312 but not coverting > GB 18030. I guess similar situation in emoji fonts... > > Regards, > mpsuzuki > > suzuki toshiya wrote: >> Hi, >> >> Recently, fontconfig developers are discussing how to evaluate >> "is this font supporting 'emoji' set sufficiently?". Is it possible >> to design a subset of emoji to serve common use of emoji? >> >> For detail about the discussion of fontconfig developers, please >> refer the thread from: >> https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html >> >> * about fontconfig >> fontconfig is a library which is widely used by Unix-like operating >> systems to locate a (pathname of) font file, by the query with a few >> typographic category (serif/sans-serif/monospace etc), script, and >> language. fontconfig crawls the font files on the systems, and make >> a database to respond such query. To guess the supported script and >> language, basically fontconfig checks the coverage of the codepoints >> with relevant glyph data. The coverage is compared with the orthography >> database: for the case of CJK script, the coverage is compared with >> GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc. >> >> * emoji and fontconfig >> At present, fontconfig developers are wondering how they can list the >> codepoints to evaluate the query "this font support emoji?". The stable >> subset of emoji would be the repertoire used by Japanese legacy cellular >> phones, but (personally) I don't think it is still respected to design >> some emoji fonts, as far as the developer is careful about the legacy >> cellular phone users. >> >> Is it possible to design a subset of emoji to serve common use of emoji? >> Or, if such attempt (evaluate the support level of emoji by checking >> some codepoints) is wrong, is there any good method to evaluate the >> support level of emoji in given font? >> >> Regards, >> mpsuzuki >> >> > > . > -- Martin J. D?rst Department of Intelligent Information Technology Collegue of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From costello at mitre.org Thu Sep 15 06:14:58 2016 From: costello at mitre.org (Costello, Roger L.) Date: Thu, 15 Sep 2016 11:14:58 +0000 Subject: Default character encoding for each operating system? Message-ID: Hi Folks, In a book that I am reading [1] the author mentions "the default character encoding for the operating system." What is the default character encoding of: - Windows 10 - Mac OS - Linux /Roger [1] Practical Common Lisp by Peter Seibel, p. 165 (footnote 2). -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Sep 15 06:43:48 2016 From: prosfilaes at gmail.com (David Starner) Date: Thu, 15 Sep 2016 11:43:48 +0000 Subject: Default character encoding for each operating system? In-Reply-To: References: Message-ID: Linux is far less specific than Windows 10. In all recent versions of Debian GNU/Linux, UTF-8 is the most common character encoding, but it is still supported to use ISO-8859-x or I believe even something like EUC-JP. Other distributions may enforce UTF-8 or in rare cases ISO 8859-1 or even something else. On Thu, Sep 15, 2016, 4:18 AM Costello, Roger L. wrote: > Hi Folks, > > In a book that I am reading [1] the author mentions ?the default character > encoding for the operating system.? What is the default character encoding > of: > > - Windows 10 > > - Mac OS > > - Linux > > > /Roger > > [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Sep 15 08:19:38 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 15 Sep 2016 15:19:38 +0200 Subject: Default character encoding for each operating system? In-Reply-To: References: Message-ID: A better question is what is the default character encoding for the **installed** operating system. Unfortunately it has no single response, because there are several default encodings for several parts of the OS. An OS has lots of components, many of them don't are transparent to the encoding it uses. All the 3 OSes you cite support several default character encodings, and in addition they support them in several encoding forms. All three support Unicode internally, but not in all software components. that will run with one or the other. And defaults will change according to your distribution or OS configuration options, and to your own current user settings 2016-09-15 13:14 GMT+02:00 Costello, Roger L. : > Hi Folks, > > In a book that I am reading [1] the author mentions ?the default character > encoding for the operating system.? What is the default character encoding > of: > > - Windows 10 > > - Mac OS > > - Linux > > > /Roger > > [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.w.kennedy at gmail.com Thu Sep 15 09:36:39 2016 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Thu, 15 Sep 2016 10:36:39 -0400 Subject: Default character encoding for each operating system? In-Reply-To: References: Message-ID: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> macOS, and its offspring, iOS, watchOS, and tvOS, use UTF-16LE for all internals, but readily import and export all versions of Unicode and a good many historic 8-bit and mixed-length codings. In the new Swift programming language, which is white-hot in the Apple community, Apple is moving toward a model of a transparent, generic Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, but in which a ?character? contains however many code points it needs (?e? with a stacked macron, acute accent, and dieresis is algorithmically one ?character? in Swift). Moreover, e-with-an-acute-accent and e followed by a combining acute accent, for example, compare as equal. At present, the underlying code is still UTF-16LE. -- SKen Software, LLC Coming soon to an iPhone near you > On Sep 15, 2016, at 9:19 AM, Philippe Verdy wrote: > > A better question is what is the default character encoding for the **installed** operating system. > > Unfortunately it has no single response, because there are several default encodings for several parts of the OS. An OS has lots of components, many of them don't are transparent to the encoding it uses. > > All the 3 OSes you cite support several default character encodings, and in addition they support them in several encoding forms. All three support Unicode internally, but not in all software components. that will run with one or the other. > > And defaults will change according to your distribution or OS configuration options, and to your own current user settings > > 2016-09-15 13:14 GMT+02:00 Costello, Roger L. : >> Hi Folks, >> >> In a book that I am reading [1] the author mentions ?the default character encoding for the operating system.? What is the default character encoding of: >> >> - Windows 10 >> >> - Mac OS >> >> - Linux >> >> >> /Roger >> >> [1] Practical Common Lisp by Peter Seibel, p. 165 (footnote 2). >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Sep 15 10:25:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 15 Sep 2016 17:25:17 +0200 Subject: Default character encoding for each operating system? In-Reply-To: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> Message-ID: Not all internals. Many kernel drivers (notably bus drivers) still use an OEM 8 bit encoding in their debugging log (based on an US English locale most often even if the installed version if localized to another version; but I've seen CP850 still used; and you can see some samples in the Event Viewer). Those messages in fact are not localized at all and intended only for debugging or analysis by developers, or displayed on a Windows console. Many console tools on Windows still use the default 8-bit OEM charset and won't display any Unicode output, even when the console is set to use an Unicode codepage: I can still see some mojibake, even on Windows 10). When those ouput messages are read from other UI tools, they won't be interpreted in their codepage but in the default "ANSI" codepage (such as Windows1252). Filesystems still use legacy charsets in their basic directory structure (e.g. when inserting a FAT or FAT32 volume, formated without the LFN extensions for Windows which also stores filenames in UTF-16, such as a SD card formatted on a digital camera; as the directories and filenames create on those devices only use ASCII and uninformative names such as IMG00001.JPG this generally does not cause a problem; but no Unicode name is stored; I've seen however some digital cameras storing some filenames in a legacy Chinese or Japanese charset, incorrectly rendered when viewing their content on a non-Japanese/Chinese system). 2016-09-15 16:36 GMT+02:00 John W Kennedy : > macOS, and its offspring, iOS, watchOS, and tvOS, use UTF-16LE for all > internals, but readily import and export all versions of Unicode and a good > many historic 8-bit and mixed-length codings. > > In the new Swift programming language, which is white-hot in the Apple > community, Apple is moving toward a model of a transparent, generic Unicode > that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, but in which > a ?character? contains however many code points it needs (?e? with a > stacked macron, acute accent, and dieresis is algorithmically one > ?character? in Swift). Moreover, e-with-an-acute-accent and e followed by a > combining acute accent, for example, compare as equal. At present, the > underlying code is still UTF-16LE. > > -- > SKen Software, LLC > Coming soon to an iPhone near you > > On Sep 15, 2016, at 9:19 AM, Philippe Verdy wrote: > > A better question is what is the default character encoding for the > **installed** operating system. > > Unfortunately it has no single response, because there are several default > encodings for several parts of the OS. An OS has lots of components, many > of them don't are transparent to the encoding it uses. > > All the 3 OSes you cite support several default character encodings, and > in addition they support them in several encoding forms. All three support > Unicode internally, but not in all software components. that will run with > one or the other. > > And defaults will change according to your distribution or OS > configuration options, and to your own current user settings > > 2016-09-15 13:14 GMT+02:00 Costello, Roger L. : > >> Hi Folks, >> >> In a book that I am reading [1] the author mentions ?the default >> character encoding for the operating system.? What is the default character >> encoding of: >> >> - Windows 10 >> >> - Mac OS >> >> - Linux >> >> >> /Roger >> >> [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2). >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Thu Sep 15 14:12:53 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Thu, 15 Sep 2016 21:12:53 +0200 Subject: "textels" (was: Default character encoding for each operating system?) In-Reply-To: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> (John W. Kennedy's message of "Thu, 15 Sep 2016 10:36:39 -0400") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> Message-ID: <86poo5f03u.fsf_-_@mimuw.edu.pl> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes: [...] > In the new Swift programming language, which is white-hot in the Apple > community, Apple is moving toward a model of a transparent, generic > Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, > but in which a ?character? contains however many code points it needs > (?e? with a stacked macron, acute accent, and dieresis is > algorithmically one ?character? in Swift). Moreover, > e-with-an-acute-accent and e followed by a combining acute accent, for > example, compare as equal. At present, the underlying code is still > UTF-16LE. For several years I use the name "textel" (text element, in Polish "tekstel") for such objects. I do it mostly orally in my presentations for my students, but I used it also in writing e.g. in http://bc.klf.uw.edu.pl/118/, unfortunately without a proper definition. A rudymentary definition was provided for me only in my recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply (on p. 69) "an elementary text element independently of its Unicode representation" (meaning in particular composed vs precomposed). I still hope to formulate sooner or later a more satisfactory definition :-) I think Swift confirms that such a notion is really needed. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eliz at gnu.org Thu Sep 15 14:27:14 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 15 Sep 2016 22:27:14 +0300 Subject: "textels" (was: Default character encoding for each operating system?) In-Reply-To: <86poo5f03u.fsf_-_@mimuw.edu.pl> (jsbien@mimuw.edu.pl) References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> Message-ID: <83r38l55gt.fsf@gnu.org> > From: jsbien at mimuw.edu.pl (Janusz S. Bie?) > Date: Thu, 15 Sep 2016 21:12:53 +0200 > Cc: mufi-fonts > > On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes: > > [...] > > > In the new Swift programming language, which is white-hot in the Apple > > community, Apple is moving toward a model of a transparent, generic > > Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, > > but in which a ?character? contains however many code points it needs > > (?e? with a stacked macron, acute accent, and dieresis is > > algorithmically one ?character? in Swift). Moreover, > > e-with-an-acute-accent and e followed by a combining acute accent, for > > example, compare as equal. At present, the underlying code is still > > UTF-16LE. > > For several years I use the name "textel" (text element, in Polish > "tekstel") for such objects. I do it mostly orally in my presentations > for my students, but I used it also in writing e.g. in > http://bc.klf.uw.edu.pl/118/, unfortunately without a proper > definition. Isn't "grapheme cluster" the definition you are looking for? From jsbien at mimuw.edu.pl Thu Sep 15 14:56:32 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Thu, 15 Sep 2016 21:56:32 +0200 Subject: "textels" In-Reply-To: <83r38l55gt.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 15 Sep 2016 22:27:14 +0300") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> Message-ID: <86r38llyxb.fsf@mimuw.edu.pl> On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes: [...] > Isn't "grapheme cluster" the definition you are looking for? I don't think so. On Thu, Sep 15 2016 at 21:27 CEST, leoboiko at namakajiri.net writes: > Isn't the Swift "character" and the "textel" merely the same thing as > what Unicode already named "grapheme clusters"? (Well, technically UAX > #29[1] defines them as "user-perceived characters", but then says > grapheme clusters approximate user-perceived characters > algorithmically). > > And, indeed, Swift "Characters" are explicitly defined as "extended > grapheme clusters" (also from UAX #29): > > https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html > > Such a notion is indeed needed, but it has been always there. > > [1] http://unicode.org/reports/tr29/ Perhaps I don't understand properly the rather obscure definitions, like An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. However: 1. Graphemes, if I understand correctly, are language dependent, textels are not. 2. Textel "?" means both U+0144 and , so it is a notion on a higher abstraction level then a grapheme cluster. Moreover I don't want to call (LATIN SMALL LETTER N, COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2 reasons: 1. there is nothing extended in it 2. U+0301 is not a grapheme according to Polish linguistics terminology Regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From kenwhistler at att.net Thu Sep 15 19:27:24 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 15 Sep 2016 17:27:24 -0700 Subject: Additional Emoji selection factor: Support by "Major Vendors" In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de> Message-ID: <65f8b1a5-0755-c9f5-d917-59e6c17e16a1@att.net> On 9/11/2016 5:40 AM, Christoph P?per wrote: >> "Took no action" generally means "rejected". > Can anyone explain then, why [L2/16-128] seems to have been ?rejected? and still made it into selection.html? Not all documents in the UTC document register are born equal. If a document in the register is explicitly a *proposal* to encode X at code point Y in version Z of the Unicode Standard, then that requires a recorded decision by the UTC. If the UTC takes up such a document, and the minutes for the agenda item in question note only "UTC took no action at this time", that clearly indicates that as of that date the UTC had not *accepted* the proposal. It *might* mean that the proposal was rejected, but a rejection is often then also indicated with some action item to follow up with the proposal author. If the proposal author is in the room for the discussion, they might simply take notes about some possible future revision of the proposal, and no action need be formally minuted. In only a few instances would a rejection be minuted as a formal decision -- that case is generally limited to some encoding proposals that are objectionable in ways that the UTC determines are unlikely to be fixable, and which thus should not be re-discussed in future meetings. Other kinds of documents in the register (and associated agenda items to discuss them) may not require minuting of formal decisions by the UTC at all. > > > Same minutes as above: > >> >E.1.11 Additional Emoji selection factor [Emoji SC/Edberg, L2/16-128] >> > >> >Discussion. UTC took no action at this time. > [L2/16-128]:http://www.unicode.org/L2/L2016/16128-additional-emoji-selection-factor.pdf > > This was the proposed text to be added: > In a case like that, the UTC doesn't necessarily control the exact text of a web page. The emoji selection factors are not a formal specification or a published standard. They are guidelines that the Emoji Subcommittee uses to help organize and rationalize its consideration of all the various proposals that get submitted for encoding more emoji characters. That helps the Emoji Subcommittee assemble better summarized proposals to bring to the UTC when it is time to standardize some selected set of new emoji and assign code points for them for a new version of the Unicode Standard. L2/16-128 was brought to the attention of the UTC by the Emoji Subcommittee to let the UTC know they were considering another selection factor, and to allow discussion and let people raise objections or make other suggestions. Once the Emoji Subcommittee gets that feedback, they could then go back and update the relevant web page regarding selection factors. No UTC decision is required for something like that. People who have a problem with one or another of the selection criteria that the Emoji Subcommittee has been using can always submit feedback, if they wish, and I'm sure the Emoji Subcommittee would take such feedback under advisement. In general, I would advise people who are interested in the UTC and UTC process to not treat the UTC minutes as legal documents that require their wording to be litigated line by line. Minutes of standards organizations function primarily as their institutional memory about decisions taken and associated actions to follow up on decisions. The wording of such minutes tends to be brief and telegraphic, because a lot of topics are taken up, and a lot of decisions and actions have to be recorded quickly -- and their wording is usually aimed at being clear to the people doing the actual maintenance of the standard(s) or other specifications. They are not meeting transcripts, and they do not attempt to recapitulate discussions nor do they provide detailed rationales for every decision taken by the committee. If something is unclear about some decision taken by the UTC, or the outcome of the discussion of some particular topic is unclear and you desire elucidation, the best course is often simply to ask somebody who attended the meeting about it. Many participants in the UTC meetings *do* monitor this discussion list, for example. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Thu Sep 15 20:13:48 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 15 Sep 2016 18:13:48 -0700 Subject: Why isn't MUSICAL SYMBOL NULL NOTEHEAD default ignorable? In-Reply-To: References: Message-ID: <36a257b1-2ee7-a972-f0ba-60bfdf92fdda@att.net> On 9/5/2016 5:34 PM, Charlotte Buff wrote: > It has just come to my attention that U+1D159 MUSICAL SYMBOL NULL > NOTEHEAD is not default ignorable, even though it has no visible glyph > appearance and no advance width in text, just like the various Hangul > jamo fillers that *are* default ignorable. Is there a technical reason > for this or is it just an oversight? Well, the proximate reason is that it is General_Category=So, so that unless it were special-cased for the derivation of the Default_Ignorable property, it will end up Default_Ignorable=No in the UCD. As to why it wouldn't be special-cased to force it to end up Default_Ignorable=Yes, I don't think there was a whole lot of special thinking that went into this when the musical symbols were first added in Unicode 3.1 way back in 2001. Default_Ignorable was not even a formal property as of Unicode 3.1. That property was added (and rationalized) rather later. As to why Default_Ignorable=No is probably the correct value for U+1D159 anyway, think of it this way. The null notehead is essentially a musical notation specialized version of a non-breaking space -- it is essentially just a base for applying the various combining stems and flags for a display without showing a particular notehead, analogous to applying a generic combining mark to a NBSP to show that combining mark in isolation. It isn't clear that the null notehead should have no advance width, and in general, if you don't have a rendering system that displays such combinations correctly in context, it would arguably be better to show that there is some *thing* there, rather than to just omit any visible display at all. Such a situation is also roughly akin to the various synthetic virama characters in the standard, e.g., U+17D2 KHMER SIGN COENG, which is essentially a subscript consonant stacker. But if you can't display Khmer conjuncts correctly, it would be better to display a visible glyph at that point than to just ignore it for display altogether. So U+17D2 is also not Default_Ignorable, even though it has no well-defined glyph of its own (hence the dotted box shape shown in the code charts). And in the case of U+17D2, when correctly rendered, it definitely would *not* have its own advance width, yet it is still not Default_Ignorable=Yes. --Ken From verdy_p at wanadoo.fr Thu Sep 15 22:41:23 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 16 Sep 2016 05:41:23 +0200 Subject: "textels" In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> Message-ID: 2016-09-15 21:56 GMT+02:00 Janusz S. Bie? : > On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes: > > [...] > > > Isn't "grapheme cluster" the definition you are looking for? > > I don't think so. > > However: > > 1. Graphemes, if I understand correctly, are language dependent, textels > are not. > Your definition of textels is also language dependant, as you are reading it from a Polish point of view. However you are confusing here "graphemes" with "grapheme clusters". Your (Polish) textels are in fact the same as the (Polish) grapheme clusters. Unicode also defines "default grapheme clusters" that are "grapheme clusters" not tailored for a particular language. A "default grapheme clusters" is the minimum unbreakable unit that can be seen as a valid "grapheme cluster" in most languages (or at least in most languages using the same base script if the script is used in that language; in other scripts, it just provides a minimum compatibility level to allow insertion of foreign texts in a multilingual document). The grapheme clusters can then be used to parse text and apply various processes such as - normalization : grapheme clusters are not broken by it and can be compared for canonical equivalences (but you can compare smaller units using only the combining class property by breaking text on characters with CC=0 and handling the special algorithmic case of modern Hangul syllables; see the Unicode standard about normalization) - BiDi layout - line breaking - word breaking - most standard text transforms (such as case folding) - transliteration Rendering text however often requires larger units as successive grapheme clusters (if not split by a line break or by BiDi reoredring) will interact visually to create more complex layouts (notably in Indic scripts), glued together by some controls (notably joining controls); they are also compelxified in some cases where combining classes alone cannot properly represent these interactions. Additionnally for a few cases, the visual order is used for encoding text instead of the standard model using the logical order: this was made to preserve the roundtrip compatibility between Unicode and legacy encodings widely used (notably for the Thai script). However this has a known caveat (which already existed before Unicode) for some algorithms such as word breaking (implementaitons need to implement a lookup dictionnary, but in Thai this dictionnary is not very large) and line breaking (if we don't want to break words or in the middle oif syllables). The default grapheme clusters however will correctly break the text to allow Thai text (encoded in visual order) to be rendered correctly. In summary, the concept of "grapheme clusters" must be read and understood in the Unicode standard only as a Unicode terminology used to describe all other algorithms described in the standard. They are not bound to a particular language except if thsi language is explicitly specified with this term in that case we won't be handling the "default grapheme clusters" rules but the additional rules tailoring the basic rules used to define the default grapheme clusters. The "extended grapheme clusters" are used in context requiring more complex algorithms that need to group several grapheme clusters in a ordered sequence. These algorithms require some text buffering, and parsing from a random position in text may require looking backward on larger lengths to determine the context. Parsing text sequentially also requires keeping some additional context variables. Plain text searches based on "extended grapheme clusters" is also much more challenging than searches on "default grapheme clusters". For these reasons, the "extended grapheme clusters" are not defined in "default grapheme clusters" but will be needed for matching user expectations in particular languages or scripts. You normally don't need any "extended grapheme clusters" in Polish, except in multilingual documents that are embedding some non-Latin scripts, or some technical notations. > 2. Textel "?" means both U+0144 and , so it is a notion > on a higher abstraction level then a grapheme cluster. > > Moreover I don't want to call (LATIN SMALL LETTER N, > COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2 > reasons: > > 1. there is nothing extended in it > This combination is first a "grapheme cluster", before being also an "extended grapheme cluster" in Unicode terminology. The term "extended" comes from an extension added not for the case of combining chacters encoded after base characters (or combined to them in a canonically equivalent string), but for other extensions, notably for complex syllabic constructs: Every "grapheme cluster" may also be an "extended grapheme cluster", but the reverse is NOT true. You have to read the standard about the various kind of text breaking processes. > 2. U+0301 is not a grapheme according to Polish linguistics terminology > The Polish lingusitics uses its own Polish term, not "grapheme" which is in the standard what is defined there in English, but for being the base of other definitions needed for parsing texts in various languages. In Unicode U+0301 would be a grapheme, but if used in isolation it would not form a complete grapheme cluster, but a defective grapheme cluster as it lacks its base with which it should be associated and encoded before it (that base cannot be a non-character or a control, even if these are blockers against reordering for normalization processes and canonical equivalences, and cannot be another combining character) -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Fri Sep 16 08:15:37 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 16 Sep 2016 16:15:37 +0300 Subject: "textels" In-Reply-To: <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost> (message from William_J_G Overington on Fri, 16 Sep 2016 10:25:53 +0100 (BST)) References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost> Message-ID: <83y42s3s06.fsf@gnu.org> > Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST) > From: William_J_G Overington > > jsbien at mimuw.edu.pl wrote: > > > On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes: > > [...] > > >> Isn't "grapheme cluster" the definition you are looking for? > > > I don't think so. > > Is an example of a textel that would definitely not be a grapheme cluster be when a character is expressed as a BASE CHARACTER character followed by one or more TAG CHARACTER characters. Since no formal definition of a "textel" was presented, except via an example, it's not clear to me whether what you propose can be a textel. (I also don't quite understand the semantics of a base character followed by tag characters, to say the truth.) From jsbien at mimuw.edu.pl Fri Sep 16 08:52:26 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Fri, 16 Sep 2016 15:52:26 +0200 Subject: "textels" In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl> ("Janusz S. =?utf-8?Q?Bie=C5=84?= =?utf-8?Q?=22's?= message of "Thu, 15 Sep 2016 21:56:32 +0200") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> Message-ID: <864m5grlyd.fsf@mimuw.edu.pl> On Thu, Sep 15 2016 at 21:56 CEST, jsbien at mimuw.edu.pl writes: [...] > 1. Graphemes, if I understand correctly, are language dependent, textels > are not. > > 2. Textel "?" means both U+0144 and , so it is a notion > on a higher abstraction level then a grapheme cluster. In other words, textels are equivalence classes of some set of Unicode characters strings by an equivalence relation which at the moment is open to the discussion but is very close to the official Unicode canonical equivalence (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). [...] On Thu, Sep 15 2016 at 21:27 CEST, leoboiko at namakajiri.net writes: > Isn't the Swift "character" and the "textel" merely the same thing as > what Unicode already named "grapheme clusters"? As for the Swift "character", perhaps someone fluent in Swift will answer the question? > (Well, technically UAX > #29[1] defines them as "user-perceived characters", but then says > grapheme clusters approximate user-perceived characters > algorithmically). > > And, indeed, Swift "Characters" are explicitly defined as "extended > grapheme clusters" (also from UAX #29): > > https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html Thank you very much for the link. Let me quote the relevant fragment: --8<---------------cut here---------------start------------->8--- Extended Grapheme Clusters Every instance of Swift?s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character. Here?s an example. The letter ? can be represented as the single Unicode scalar ? (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars?a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an ? when it is rendered by a Unicode-aware text-rendering system. In both cases, the letter ? is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars: [...] *Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent.* --8<---------------cut here---------------end--------------->8--- For me it means that Swift's characters are equivalence classes of the set of extended grapheme clusters by canonical equivalence relation. > Such a notion is indeed needed, but it has been always there. > > [1] http://unicode.org/reports/tr29/ I don't see there a notion of such equivalent classes. On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes: [...] > In the new Swift programming language, which is white-hot in the Apple > community, Apple is moving toward a model of a transparent, generic > Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, > but in which a ?character? contains however many code points it needs > (?e? with a stacked macron, acute accent, and dieresis is > algorithmically one ?character? in Swift). Moreover, > e-with-an-acute-accent and e followed by a combining acute accent, for > example, compare as equal. At present, the underlying code is still > UTF-16LE. If you insist that Swift's "character" are just grapheme clusters, than you add different, although related, meaning to the term "grapheme cluster". I think the notion deserves a term of its own. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eric.muller at efele.net Fri Sep 16 10:03:54 2016 From: eric.muller at efele.net (Eric Muller) Date: Fri, 16 Sep 2016 08:03:54 -0700 Subject: "textels" In-Reply-To: <864m5grlyd.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl> Message-ID: <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote: > (when working on a corpus of historical Polish we > noticed some cases where standard Unicode equivalence was not > convenient). I'm very interested to know more about those cases. Thanks, Eric. From wjgo_10009 at btinternet.com Fri Sep 16 04:25:53 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST) Subject: "textels" In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> Message-ID: <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost> jsbien at mimuw.edu.pl wrote: > On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes: [...] >> Isn't "grapheme cluster" the definition you are looking for? > I don't think so. Is an example of a textel that would definitely not be a grapheme cluster be when a character is expressed as a BASE CHARACTER character followed by one or more TAG CHARACTER characters. Such a construct was first suggested for some flag characters. William Overington 16 September 2016 From wjgo_10009 at btinternet.com Fri Sep 16 09:07:41 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 16 Sep 2016 15:07:41 +0100 (BST) Subject: "textels" Message-ID: <3731612.52242.1474034861874.JavaMail.defaultUser@defaultHost> >(I also don't quite understand the semantics of a base character followed by tag characters, to say the truth.) Page 2 of the following document is where the idea was introduced. http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf The document is linked from the following page. http://www.unicode.org/L2/L2015/Register-2015.html William Overington 16 September 2016 From jsbien at mimuw.edu.pl Fri Sep 16 10:30:48 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Fri, 16 Sep 2016 17:30:48 +0200 Subject: "textels" In-Reply-To: <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl> <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net> Message-ID: <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl> Quote/Cytat - Eric Muller (pi?, 16 wrz 2016, 17:03:54): > On 9/16/2016 6:52 AM, Janusz S. Bie? wrote: >> (when working on a corpus of historical Polish we >> noticed some cases where standard Unicode equivalence was not >> convenient). > > I'm very interested to know more about those cases. For our search engine we were unable to use compatibility equivalence "out of the box" for splitting the ligature because it also converted long s to short s while we wanted to preserve the distinction. Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eric.muller at efele.net Fri Sep 16 10:47:27 2016 From: eric.muller at efele.net (Eric Muller) Date: Fri, 16 Sep 2016 08:47:27 -0700 Subject: "textels" In-Reply-To: <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl> <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net> <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl> Message-ID: <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net> On 9/16/2016 8:30 AM, Janusz S. Bien wrote: > Quote/Cytat - Eric Muller (pi?, 16 wrz 2016, > 17:03:54): > >> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote: >>> (when working on a corpus of historical Polish we >>> noticed some cases where standard Unicode equivalence was not >>> convenient). >> >> I'm very interested to know more about those cases. > > For our search engine we were unable to use compatibility equivalence > "out of the box" for splitting the ligature because it also converted > long s to short s while we wanted to preserve the distinction. I am interested in the problems with *canonical* equivalence. I thought that you were talking about those before. Compatibility equivalence is a completely different beast. It is, IMHO, too coarse a tool and best forgotten. For any particular task, it's typically doing too much (e.g. long/short s folding in your case) and too little (not everything you need). There was an attempt at improving the situation, by providing a whole bunch of fine grained, targeted transformations (http://www.unicode.org/reports/tr30/), but that did not pan out. Eric. Thanks, Eric. From jsbien at mimuw.edu.pl Fri Sep 16 10:57:44 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Fri, 16 Sep 2016 17:57:44 +0200 Subject: "textels" In-Reply-To: <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl> <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net> <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl> <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net> Message-ID: <20160916175744.11941cx23il9zp14@mail.mimuw.edu.pl> Quote/Cytat - Eric Muller (pi?, 16 wrz 2016, 17:47:27): > On 9/16/2016 8:30 AM, Janusz S. Bien wrote: >> Quote/Cytat - Eric Muller (pi?, 16 wrz >> 2016, 17:03:54): >> >>> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote: >>>> (when working on a corpus of historical Polish we >>>> noticed some cases where standard Unicode equivalence was not >>>> convenient). >>> >>> I'm very interested to know more about those cases. >> >> For our search engine we were unable to use compatibility >> equivalence "out of the box" for splitting the ligature because it >> also converted long s to short s while we wanted to preserve the >> distinction. > > I am interested in the problems with *canonical* equivalence. I > thought that you were talking about those before. I apologize for the confusion, that was my fault. I tend to answer too quickly and not precisely enough :-( On the other hand I'm not sure canonical equivalence is always what I want and expect, but I don't have specific examples at hand. Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From christoph.paeper at crissov.de Fri Sep 16 16:51:38 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 16 Sep 2016 23:51:38 +0200 Subject: "textels" In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> Message-ID: <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> Janusz S. Bie? : > > 1. Graphemes, if I understand correctly, are language dependent, ? That?s true in linguistic terminology ? well, at least within the more popular schools of thought ?, but not in technical (i.e. Unicode) jargon. From mats.gbproject at gmail.com Sat Sep 17 04:19:59 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sat, 17 Sep 2016 11:19:59 +0200 Subject: Dataset for all ISO639 code sorted by country/territory? Message-ID: Hi Is there any dataset that contains all languages in the world sorted by country/territory? I found this at Unicode, but seems like only containing the most spoken languages in each country and not the smaller once: http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html Thanks in advance for help. Best regards Mats Blakstad -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Sat Sep 17 06:27:02 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Sat, 17 Sep 2016 13:27:02 +0200 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: <57DD2886.2060408@uni-konstanz.de> Hello, am 2016-09-17 um 11:19 Uhr hat Mats Blakstad geschrieben: > Is there any dataset that contains all languages in the world sorted by > country/territory? Have you tried , already? Also, and may provide partial answers. Best wishes, Otto Stolz From verdy_p at wanadoo.fr Sat Sep 17 06:35:20 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 17 Sep 2016 13:35:20 +0200 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: Not all languages are sorted, only those for which there are released data in CLDR. And languages frequently belong to several countries/territories at the same time, with different official or recognized status (itself independant of the number of actual speakers, which is very frequently roughly estimated). Some countries are giving official statistics about their national or regional languages, but frequently these stats are old, or underestimated or overestimated for political reasons, or some languages are mixed as if they were only one, or simply discarded if it is considered locally as a secondary language, even if the official language is superficially understood but taken as a primary one. Statistics are also forgetting native speakers living abroad in a diaspora, or secondary learners of a language taught in foreign countries. 2016-09-17 11:19 GMT+02:00 Mats Blakstad : > Hi > > Is there any dataset that contains all languages in the world sorted by > country/territory? > > I found this at Unicode, but seems like only containing the most spoken > languages in each country and not the smaller once: > http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_ > information.html > > Thanks in advance for help. > > Best regards > Mats Blakstad > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sat Sep 17 07:10:26 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sat, 17 Sep 2016 14:10:26 +0200 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: I manage to find a dataset on the website of Ethnologue, though it doesn't look like open source, need to check with them exactly how I'm allowed to use it: http://www.ethnologue.com/codes/download-code-tables Thanks for the explanation Phillippe. I know it is not an easy issue. Look for different resources on the web, any specific links or feedbacks would be helpful. On 17 September 2016 at 13:35, Philippe Verdy wrote: > Not all languages are sorted, only those for which there are released data > in CLDR. > And languages frequently belong to several countries/territories at the > same time, with different official or recognized status (itself independant > of the number of actual speakers, which is very frequently roughly > estimated). > Some countries are giving official statistics about their national or > regional languages, but frequently these stats are old, or underestimated > or overestimated for political reasons, or some languages are mixed as if > they were only one, or simply discarded if it is considered locally as a > secondary language, even if the official language is superficially > understood but taken as a primary one. > Statistics are also forgetting native speakers living abroad in a > diaspora, or secondary learners of a language taught in foreign countries. > > > 2016-09-17 11:19 GMT+02:00 Mats Blakstad : > >> Hi >> >> Is there any dataset that contains all languages in the world sorted by >> country/territory? >> >> I found this at Unicode, but seems like only containing the most spoken >> languages in each country and not the smaller once: >> http://www.unicode.org/cldr/charts/latest/supplemental/terri >> tory_language_information.html >> >> Thanks in advance for help. >> >> Best regards >> Mats Blakstad >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deepak.jois at gmail.com Sat Sep 17 06:31:10 2016 From: deepak.jois at gmail.com (Deepak Jois) Date: Sat, 17 Sep 2016 17:01:10 +0530 Subject: =?UTF-8?Q?Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implementa?= =?UTF-8?Q?tion?= Message-ID: Hi It seems that the Java reference implementation for the Unicode Bidi algorithm that I downloaded from the unicode.org site fails against some test cases in the BidiCharacterTest.txt file ? the ones that are specifically meant to test for changes in Unicode 8.0. Has the reference implementation been updated, and does anyone have a copy they can share? Is there a reference implementation in some other language that I could look at, which has been updated? Thank you Deepak From khaledhosny at eglug.org Sat Sep 17 11:23:51 2016 From: khaledhosny at eglug.org (Khaled Hosny) Date: Sat, 17 Sep 2016 18:23:51 +0200 Subject: Unicode Bidi Algorithm =?utf-8?B?4oCT?= =?utf-8?Q?_Java?= reference implementation In-Reply-To: References: Message-ID: <20160917162351.GB1339@macbook> On Sat, Sep 17, 2016 at 05:01:10PM +0530, Deepak Jois wrote: > Hi > > It seems that the Java reference implementation for the Unicode Bidi > algorithm that I downloaded from the unicode.org site fails against > some test cases in the BidiCharacterTest.txt file ? the ones that are > specifically meant to test for changes in Unicode 8.0. > > Has the reference implementation been updated, and does anyone have a > copy they can share? Is there a reference implementation in some other > language that I could look at, which has been updated? I think there is a C implementation that is kept up to date, and there is also a Python implementation that should pass the tests: https://github.com/behdad/pybyedie Regards, Khaled From deepak.jois at gmail.com Sat Sep 17 12:26:55 2016 From: deepak.jois at gmail.com (Deepak Jois) Date: Sat, 17 Sep 2016 22:56:55 +0530 Subject: =?UTF-8?Q?Re=3A_Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implem?= =?UTF-8?Q?entation?= In-Reply-To: <20160917162351.GB1339@macbook> References: <20160917162351.GB1339@macbook> Message-ID: On Sat, Sep 17, 2016 at 9:53 PM, Khaled Hosny wrote: > I think there is a C implementation that is kept up to date, Yes, I found that one after I posted. FWIW, here are the changes for the latest version: https://gist.github.com/deepakjois/5a3ae81a105abd3523ed0efe2e52f52e/revisions > is also a Python implementation that should pass the tests That implementation looks very different from the C and Java versions. I can?t tell by looking at a glance if it has been updated for the changes in Unicode 8.0. But it definitely will not pass the tests in BidiCharacter.txt because it lacks support for paired brackets. I just finished writing a reference implementation in Lua[1] which is a line by line port of the Java reference implementation and passes nearly all tests in BidiCharacter.txt. I now need to make the updates to support the changes in Unicode 8.0, and I am finding it a bit hard to grok the changes in C at a glance. Deepak [1]: https://github.com/deepakjois/luabidi/blob/master/src/bidi.lua From jsbien at mimuw.edu.pl Sun Sep 18 05:26:26 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 18 Sep 2016 12:26:26 +0200 Subject: "textels" In-Reply-To: <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> Message-ID: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> Quote/Cytat - Christoph P?per (pi?, 16 wrz 2016, 23:51:38): > Janusz S. Bie? : >> >> 1. Graphemes, if I understand correctly, are language dependent, ? > > That?s true in linguistic terminology ? well, at least within the > more popular schools of thought ?, but not in technical (i.e. > Unicode) jargon. From the Unicode glossary: Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. As for (2), cf. User-Perceived Character. What everyone thinks of as a character in their script. So we have "a user" versus "everyone...in their script" - is the difference intentional? Probably not. Anyway the definitions are language/locale dependent. Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From christoph.paeper at crissov.de Sun Sep 18 14:40:21 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 18 Sep 2016 21:40:21 +0200 Subject: "textels" In-Reply-To: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> Message-ID: <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> Janusz S. Bien : > > From the Unicode glossary: > >> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. > >> User-Perceived Character. What everyone thinks of as a character in their script. > > [?] the definitions are language/locale dependent. A writing system is (usually) language-dependent, a script is not, although some scripts have been used exclusively (or prominently) in a single writing system with a single language. So definition (1) of ?grapheme? would be appropriate for linguistics, (2) maybe for typography and computer science, but it?? extremely vague. From asmusf at ix.netcom.com Sun Sep 18 15:02:01 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sun, 18 Sep 2016 13:02:01 -0700 Subject: "textels" In-Reply-To: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> Message-ID: On 9/18/2016 3:26 AM, Janusz S. Bien wrote: > Quote/Cytat - Christoph P?per (pi?, 16 > wrz 2016, 23:51:38): > >> Janusz S. Bie? : >>> >>> 1. Graphemes, if I understand correctly, are language dependent, ? >> >> That?s true in linguistic terminology ? well, at least within the >> more popular schools of thought ?, but not in technical (i.e. >> Unicode) jargon. > > From the Unicode glossary: > > Grapheme. (1) A minimally distinctive unit of writing in the context > of a particular writing system.[...] (2) What a user thinks of as a > character. "writing system" is vague enough to cover variations that might be regional or language dependent. > > As for (2), cf. > > User-Perceived Character. What everyone thinks of as a character in > their script. > > So we have "a user" versus "everyone...in their script" - is the > difference intentional? Probably not. Anyway the definitions are > language/locale dependent. The "everyone" here aims at a shared understanding. This becomes tricky in the case of Abugidas. There's certainly a shared understanding that the "unit of writing" is the syllable, rather than in individual mark, but the latter do have well-understood identities, not least for teaching. That's perhaps the reason why there's the handwaving about "minimally distinctive". In some scripts like that, users can enter multiple sequences of characters that resolve (for all practical purposes) into the same syllable. (A big part of that in some scripts is that Unicode does not always provide a means to normalize the order of subsidiary signs and marks, typically combining marks) For some tasks it would be great to have only well-formed syllables; but to do that, you would need to add additional interpretation on top of the Unicode definitions of a grapheme cluster. If you just wrap the raw combining sequences into textels, then some tasks might not actually get simpler. Instead of a simple rule that determines which alternate orderings of marks are equivalent (to account for users not typing them in the preferred order) you would have to exhaustively list all combinations and set up equivalent tables. A./ From kenwhistler at att.net Sun Sep 18 19:16:50 2016 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 18 Sep 2016 17:16:50 -0700 Subject: =?UTF-8?Q?Re:_Unicode_Bidi_Algorithm_=e2=80=93_Java_reference_imple?= =?UTF-8?Q?mentation?= In-Reply-To: References: <20160917162351.GB1339@macbook> Message-ID: <61882234-899a-84bf-3fac-017d27af553a@att.net> On 9/17/2016 10:26 AM, Deepak Jois wrote: > I now need to make the updates to support the changes in Unicode 8.0, > and I am finding it a bit hard to grok the changes in C at a glance. > The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change much about the gross behavior of the algorithm, but there were some fixes for edge cases in a couple rules. Also, the specification of behavior on stack overflow became exact, rather than implementation-defined. The C bidi reference code is a bit complicated, because it supports *all* UBA versions from 6.2 through 8.0, which means it has to special case rule processing by versions when the specification itself changes. If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c you'll find the heart of the differences there, along with explanations in comments for the changes. The new function br_SetBracketPairBC handles an edge case for combining marks following a bracket. The code using a new flag testONisNotRequired deals with an edge case for the current Bidi_Class of brackets being tested for pairing. Changes in br_PushBracketStack are involved in the need to keep the pre-8.0 behavior as it was for earlier versions of bidiref, but allowing for explicit behavior for stack overflow for 8.0. It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, so you can see the textual changes in the specification of the rules. Try diffing: http://www.unicode.org/reports/tr9/tr9-31.html (7.0) http://www.unicode.org/reports/tr9/tr9-33.html (8.0) The significant changes there are in BD11, BD14, BD15, BD16, and in rules X5a, X5b, X6a, and N0. (The rest of the changes in the updated document are cosmetic.) --Ken From jsbien at mimuw.edu.pl Mon Sep 19 01:23:53 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Mon, 19 Sep 2016 08:23:53 +0200 Subject: User-perceived character (was: "textels") In-Reply-To: (Asmus Freytag's message of "Sun, 18 Sep 2016 13:02:01 -0700") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> Message-ID: <86h99c5rwm.fsf_-_@mimuw.edu.pl> On Sun, Sep 18 2016 at 22:02 CEST, asmusf at ix.netcom.com writes: > On 9/18/2016 3:26 AM, Janusz S. Bien wrote: [...] >> From the Unicode glossary: >> >> Grapheme. (1) A minimally distinctive unit of writing in the context >> of a particular writing system.[...] (2) What a user thinks of as a >> character. > > "writing system" is vague enough to cover variations that might be > regional or language dependent. That is obvious for me. >> >> As for (2), cf. >> >> User-Perceived Character. What everyone thinks of as a character in >> their script. >> >> So we have "a user" versus "everyone...in their script" - is the >> difference intentional? Probably not. Anyway the definitions are >> language/locale dependent. > > The "everyone" here aims at a shared understanding. That's also quite obvious for me. "A user" is grapheme (2) is at least strange. > > This becomes tricky in the case of Abugidas. There's certainly a > shared understanding that the "unit of writing" is the syllable, > rather than in individual mark, but the latter do have well-understood > identities, not least for teaching. That's perhaps the reason why > there's the handwaving about "minimally distinctive". > > In some scripts like that, users can enter multiple sequences of > characters that resolve (for all practical purposes) into the same > syllable. (A big part of that in some scripts is that Unicode does not > always provide a means to normalize the order of subsidiary signs and > marks, typically combining marks) > > For some tasks it would be great to have only well-formed syllables; > but to do that, you would need to add additional interpretation on top > of the Unicode definitions of a grapheme cluster. > > If you just wrap the raw combining sequences into textels, then some > tasks might not actually get simpler. Instead of a simple rule that > determines which alternate orderings of marks are equivalent (to > account for users not typing them in the preferred order) you would > have to exhaustively list all combinations and set up equivalent > tables. I would like to know how Swift is handling this. I still have a feeling that the Swift characters are almost exactly my textels. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From verdy_p at wanadoo.fr Mon Sep 19 01:29:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 19 Sep 2016 08:29:12 +0200 Subject: =?UTF-8?Q?Re=3A_Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implem?= =?UTF-8?Q?entation?= In-Reply-To: <61882234-899a-84bf-3fac-017d27af553a@att.net> References: <20160917162351.GB1339@macbook> <61882234-899a-84bf-3fac-017d27af553a@att.net> Message-ID: I note that there's a confusion in the introduction of UAX#9: "On web pages, the explicit directional formatting characters (of all types ? embedding, override, and isolate) should be replaced by using the dir attribute and the elements BDI and BDO." The suggested replacements do not match the order of the listed types. - embedding (with LRE/PDF or RLO/PDF) just uses the dir="ltr/rtl" attribute on any element (except BDI and BDO) - override (with LRO/PDF or RLO/PDF) uses BDO with the dir="ltr/rtl" attribute - explicit isolate (with LRI/PDI or RLI/PDI) uses BDI with the dir="ltr/rtl" attribute - "automatic" isolate (with FSI/PDI) uses BDI without any dir attribute Two implicit directional characters (LRM or RLM) are also convertible to overrides as an empty BDO element with dir="ltr/rtl". Only ALM has no equivalent. ---- But for most cases, HTML documents should simply not use embedding or override at all, isolates with BDI are much prefered and are in fact simpler to manage than what section 6.4 suggests (this suggestion using RLM or LRM before the separating punctuation does not work reliably as it implies that you can predict the implicit reading direction of the whole list, whose ordering is normally depending on the context or the document containing the list. It is much simpler to isolate each list element and then pack the list using the unmarked punctuations. An example of this is found on International wikis thart must display some inter-language bar to navigate to other translated versions of the same page: the same template will be used on all pages, and the list of languages is not predicted and may evolve over time, containing LTR or RTL language names in unpredictable occurences anywhere in the list, formatted with the same separatorwithin a single inline span in a paragraph starting by a translatable introduction heading, and you cannot predict which language name will occur after that separator. Using BDI (without even needing any dir=rtl/trl") or FSI/PDI to isolate each language name will work much better than using uncondiionnaly some static RLM or LRM before the separating punctuation (note that there's no such punctuation at start of the list, so the ordering of the first element is not set correctly unless there's a RLM or LRM also before that first element, which may then render incorrectly). The best and most flexible solution is to use "automatic" isolates for each list item (with FSI/PDI in plain-text documents, or BDI elements without any dir attribute in HTML documents). The same is also true when inserting quotations (including when giving the title of another document, or the name of an author) or for formatting translatable text containing "placeholder variables" whose content will be generated separately. BDI elements without any dir attribute can efficiently replace SPAN elements, and can still have their own optional formatting styles (colors, font families, font size, line height, font styles and weight, visual effects...), or title attributes (to give hints to readers about what the isolate value will be used for), or identifier (useful to generate stable anchors that work across all translations of the document). There are also CSS styles using unicode-bidi properties, but they should be completely avoided in HTML (these styles will be better infered from BDI elements) 2016-09-19 2:16 GMT+02:00 Ken Whistler : > > On 9/17/2016 10:26 AM, Deepak Jois wrote: > >> I now need to make the updates to support the changes in Unicode 8.0, >> and I am finding it a bit hard to grok the changes in C at a glance. >> >> > The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change > much about the gross behavior of the algorithm, but there were some fixes > for edge cases in a couple rules. Also, the specification of behavior on > stack overflow became exact, rather than implementation-defined. > > The C bidi reference code is a bit complicated, because it supports *all* > UBA versions from 6.2 through 8.0, which means it has to special case rule > processing by versions when the specification itself changes. > > If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c > you'll find the heart of the differences there, along with explanations in > comments for the changes. The new function br_SetBracketPairBC handles an > edge case for combining marks following a bracket. The code using a new > flag testONisNotRequired deals with an edge case for the current Bidi_Class > of brackets being tested for pairing. Changes in br_PushBracketStack are > involved in the need to keep the pre-8.0 behavior as it was for earlier > versions of bidiref, but allowing for explicit behavior for stack overflow > for 8.0. > > It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, so > you can see the textual changes in the specification of the rules. Try > diffing: > > http://www.unicode.org/reports/tr9/tr9-31.html (7.0) > http://www.unicode.org/reports/tr9/tr9-33.html (8.0) > > The significant changes there are in BD11, BD14, BD15, BD16, and in rules > X5a, X5b, X6a, and N0. (The rest of the changes in the updated document are > cosmetic.) > > --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Sep 19 01:40:05 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Mon, 19 Sep 2016 08:40:05 +0200 Subject: graphemes (was: "textels") In-Reply-To: <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> ("Christoph =?utf-8?Q?P=C3=A4per=22's?= message of "Sun, 18 Sep 2016 21:40:21 +0200") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> Message-ID: <86d1k05r5m.fsf_-_@mimuw.edu.pl> On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes: > Janusz S. Bien : >> >> From the Unicode glossary: >> >>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. >> >>> User-Perceived Character. What everyone thinks of as a character in their script. >> >> [?] the definitions are language/locale dependent. > > A writing system is (usually) language-dependent, a script is not, > although some scripts have been used exclusively (or prominently) in a > single writing system with a single language. It depends of course what do you mean exactly by script, and which meaning of term is intended in the definition of User-Perceived Character. But "a user" is definitely language/locale dependent :-) > So definition (1) of ?grapheme? would be appropriate for linguistics, > (2) maybe for typography and computer science, but it?? extremely > vague. I think that 'grapheme' (2) in the present wording is simply incorrect. I suspect it is not used in the standard at all. Searching the Unicode site I found only one use of 'grapheme' alone: http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. I guess the intention of 'grapheme' (2) was to describe it without any reference to computer encoding, which is definitely an extremely difficult task. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From Mark.Dalley at swcsu.nhs.uk Mon Sep 19 03:45:56 2016 From: Mark.Dalley at swcsu.nhs.uk (Dalley Mark (South West Commissioning Support)) Date: Mon, 19 Sep 2016 08:45:56 +0000 Subject: graphemes (was: "textels") In-Reply-To: <86d1k05r5m.fsf_-_@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> Message-ID: I think the key phrase is "user-perceived". And you don't need to involve complex scripts either. For instance as an English-speaking person, I would perceive the "?" in "encyclop?dia" as being two characters (albeit shoved together somewhat). The argument for this is that the word can equally well be rendered as "encyclopaedia". A Danish or Norwegian speaker, on the other hand, would perceive "?" (as in "?re" or "?sj!") as being a single indivisible character. Mark Dalley -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Janusz S. Bien Sent: 19 September 2016 07:40 To: Christoph P?per Cc: unicode Unicode Discussion Subject: graphemes (was: "textels") On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes: > Janusz S. Bien : >> >> From the Unicode glossary: >> >>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. >> >>> User-Perceived Character. What everyone thinks of as a character in their script. >> >> [?] the definitions are language/locale dependent. > > A writing system is (usually) language-dependent, a script is not, > although some scripts have been used exclusively (or prominently) in a > single writing system with a single language. It depends of course what do you mean exactly by script, and which meaning of term is intended in the definition of User-Perceived Character. But "a user" is definitely language/locale dependent :-) > So definition (1) of ?grapheme? would be appropriate for linguistics, > (2) maybe for typography and computer science, but it?? extremely > vague. I think that 'grapheme' (2) in the present wording is simply incorrect. I suspect it is not used in the standard at all. Searching the Unicode site I found only one use of 'grapheme' alone: http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. I guess the intention of 'grapheme' (2) was to describe it without any reference to computer encoding, which is definitely an extremely difficult task. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Janusz S. Bien Sent: 19 September 2016 07:40 To: Christoph P?per Cc: unicode Unicode Discussion Subject: graphemes (was: "textels") On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes: > Janusz S. Bien : >> >> From the Unicode glossary: >> >>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character. >> >>> User-Perceived Character. What everyone thinks of as a character in their script. >> >> [?] the definitions are language/locale dependent. > > A writing system is (usually) language-dependent, a script is not, > although some scripts have been used exclusively (or prominently) in a > single writing system with a single language. It depends of course what do you mean exactly by script, and which meaning of term is intended in the definition of User-Perceived Character. But "a user" is definitely language/locale dependent :-) > So definition (1) of ?grapheme? would be appropriate for linguistics, > (2) maybe for typography and computer science, but it?? extremely > vague. I think that 'grapheme' (2) in the present wording is simply incorrect. I suspect it is not used in the standard at all. Searching the Unicode site I found only one use of 'grapheme' alone: http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. I guess the intention of 'grapheme' (2) was to describe it without any reference to computer encoding, which is definitely an extremely difficult task. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From christoph.paeper at crissov.de Mon Sep 19 14:16:50 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Mon, 19 Sep 2016 21:16:50 +0200 Subject: graphemes (was: "textels") In-Reply-To: References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> Message-ID: <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> Dalley Mark (South West Commissioning Support) : > > I think the key phrase is "user-perceived". And you don't need to involve complex scripts either. > > For instance as an English-speaking person, I would perceive the "?" in "encyclop?dia" as being two characters (albeit shoved together somewhat). The argument for this is that the word can equally well be rendered as "encyclopaedia". If - encyclopedia - encyclop?dia - encyclopaedia are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes. Although linguists often prefer minimal pair analysis, there are some rules of thumb for what is a grapheme: - ? whatever goes into a single box in a crossword puzzle. - ? whatever gets transposed if you reverse a word or generate an anagram. - ? whatever gets capitalized together in the beginning of a word. (Some argue that capitalization operates on characters, not graphemes, though.) - ? whatever can never be split up by hyphenation. From jcb+unicode at inf.ed.ac.uk Tue Sep 20 02:30:12 2016 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 20 Sep 2016 08:30:12 +0100 (BST) Subject: graphemes (was: "textels") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> Message-ID: On 2016-09-19, Christoph P?per wrote: > If > > - encyclopedia > - encyclop?dia > - encyclopaedia > > are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes. Such a bizarre definition, which would also entail "color/colour", "fulfill/fulfil", "sulfur/sulphur" having the same number of graphemes, would break the first three of your rules of thumb: > - ? whatever goes into a single box in a crossword puzzle. > - ? whatever gets transposed if you reverse a word or generate an anagram. > - ? whatever gets capitalized together in the beginning of a word. and the fourth is pretty dodgy, as it usually contradicts the others > - ? whatever can never be split up by hyphenation. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From christoph.paeper at crissov.de Tue Sep 20 03:57:57 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Tue, 20 Sep 2016 10:57:57 +0200 Subject: graphemes (was: "textels") In-Reply-To: References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> Message-ID: <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> Julian Bradfield : > On 2016-09-19, Christoph P?per wrote: >> If _encyclopedia, encyclop?dia, encyclopaedia_ are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes. > > Such a bizarre definition, which would also entail "color/colour", > "fulfill/fulfil", "sulfur/sulphur" having the same number of > graphemes, It?s not a bizarre definition at all, but one could also assume two or three different writing systems. > would break the first three of your rules of thumb: It would, at least partially. > and the fourth is pretty dodgy, as it usually contradicts the others > >> - ? whatever can never be split up by hyphenation. It?s not phrased well and it does contradict the other rules of thumb sometimes indeed, but together they often work reasonably well to separate clear cases from questionable ones which are likely to be treated differently by different scholars. From kenwhistler at att.net Tue Sep 20 09:37:30 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 20 Sep 2016 07:37:30 -0700 Subject: graphemes In-Reply-To: References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> Message-ID: On 9/20/2016 12:30 AM, Julian Bradfield wrote: >> are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes. > Such a bizarre definition, which would also entail "color/colour", > "fulfill/fulfil", "sulfur/sulphur" having the same number of > graphemes, would break the first three of your rules of thumb: > I agree with Julian here. Consider also similar common alternations as night/nite, light/lite which are widespread *within* American English spelling conventions and don't even raise questions of locale differences. Or you/u, your/ur, which vary on another dimension. If every variation in spelling is taken to constitute a distinct writing system, simply to preserve the concept of a "grapheme", we would be led to conclude that American English has millions of writing systems, because of the combinatorics involved. And the caveat that it is a "legal" spelling is a hinky dodge, particularly in the case of English. There isn't any recognized legal framework for English spelling. English, she is spelled how people decide to spell her -- or perhaps mostly how 2nd grade English teachers decide she is spelled. Even where legal or academic frameworks exist to formally control the spelling rules of a language, one should be leery that such rules somehow instantiate the identity of graphemes, which are unlikely to be the principal matter of concern for those trying to establish the spelling rules in the first place. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Sep 20 11:09:22 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 20 Sep 2016 09:09:22 -0700 Subject: "textels" Message-ID: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com> Janusz Bie? wrote: > For me it means that Swift's characters are equivalence classes of the > set of extended grapheme clusters by canonical equivalence relation. I still hope we can come to some conclusion on the correct Unicode name for this concept. I don't think non-Unicode interpretations of terms like "grapheme" are grounds for throwing out "grapheme cluster," but I can see that the equivalence class itself is lacking a name. Note that the Swift definition doesn't say that <00E9> and <0065 0301> are identical entities, only that the language compares them as equal. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Tue Sep 20 11:34:25 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 20 Sep 2016 09:34:25 -0700 Subject: Dataset for all ISO639 code sorted by =?UTF-8?Q?country/territory=3F?= Message-ID: <20160920093425.665a7a7059d7ee80bb4d670165c8327d.219e1cf756.wbe@email03.godaddy.com> Mats Blakstad wrote: > Is there any dataset that contains all languages in the world sorted > by country/territory? As others have pointed out, be careful about how slippery this slope can get. Everyone has his or her own opinion about how many speakers of Language X in country Y need to be identified, estimated, or conjectured in order to say that "language X is spoken in country Y." > I manage to find a dataset on the website of Ethnologue, though it > doesn't look like open source, need to check with them exactly how I'm > allowed to use it: > http://www.ethnologue.com/codes/download-code-tables The readme file included in the downloadable zip file makes SIL's terms very clear. Basically you need to credit SIL as the source of the data, not change it, and not make the data directly available for others to download. It's best not to get caught up in "open source" as if any other terms would make the data totally unusable. -- Doug Ewell | Thornton, CO, US | ewellic.org From jsbien at mimuw.edu.pl Tue Sep 20 23:44:08 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Wed, 21 Sep 2016 06:44:08 +0200 Subject: "textels" In-Reply-To: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com> (Doug Ewell's message of "Tue, 20 Sep 2016 09:09:22 -0700") References: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com> Message-ID: <86lgylvp47.fsf@mimuw.edu.pl> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes: > Janusz Bie? wrote: > >> For me it means that Swift's characters are equivalence classes of the >> set of extended grapheme clusters by canonical equivalence relation. > > I still hope we can come to some conclusion on the correct Unicode name > for this concept. I don't think non-Unicode interpretations of terms > like "grapheme" are grounds for throwing out "grapheme cluster," I agree. > but I can see that the equivalence class itself is lacking a name. I'glad. > > Note that the Swift definition doesn't say that <00E9> and <0065 0301> > are identical entities, only that the language compares them as equal. I'm fully aware of this. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From jsbien at mimuw.edu.pl Wed Sep 21 00:09:41 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Wed, 21 Sep 2016 07:09:41 +0200 Subject: graphemes In-Reply-To: <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> ("Christoph =?utf-8?Q?P=C3=A4per=22's?= message of "Tue, 20 Sep 2016 10:57:57 +0200") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> Message-ID: <8660ppvnxm.fsf@mimuw.edu.pl> On Tue, Sep 20 2016 at 10:57 CEST, christoph.paeper at crissov.de writes: > Julian Bradfield : >> On 2016-09-19, Christoph P?per wrote: >>> If _encyclopedia, encyclop?dia, encyclopaedia_ are all legal >>> spellings of the same word in a writing system, a useful linguistic >>> definition of grapheme should ensure that all three variants have >>> the same number of graphemes. >> >> Such a bizarre definition, which would also entail "color/colour", >> "fulfill/fulfil", "sulfur/sulphur" having the same number of >> graphemes, > > It?s not a bizarre definition at all, but one could also assume two or three different writing systems. > >> would break the first three of your rules of thumb: > > It would, at least partially. > >> and the fourth is pretty dodgy, as it usually contradicts the others >> >>> - ? whatever can never be split up by hyphenation. > > It?s not phrased well and it does contradict the other rules of thumb > sometimes indeed, but together they often work reasonably well to > separate clear cases from questionable ones which are likely to be > treated differently by different scholars. Let me remind the issues which started the thread: On Sun, Sep 18 2016 at 12:26 CEST, jsbien at mimuw.edu.pl writes: > Quote/Cytat - Christoph P?per (pi?, 16 > wrz 2016, 23:51:38): > >> Janusz S. Bie? : >>> >>> 1. Graphemes, if I understand correctly, are language dependent, ? >> >> That?s true in linguistic terminology ? well, at least within the >> more popular schools of thought ?, but not in technical (i.e. >> Unicode) jargon. And what is "grapheme" in "technical (i.e. Unicode) jargon"? > > From the Unicode glossary: > > Grapheme. (1) A minimally distinctive unit of writing in the context > of a particular writing system.[...] (2) What a user thinks of as a > character. > > As for (2), cf. > > User-Perceived Character. What everyone thinks of as a character in > their script. > > So we have "a user" versus "everyone...in their script" - is the > difference intentional? Probably not. Anyway the definitions are > language/locale dependent. Does 'Grapheme' (2) make sense with "a (single?) user"? BTW, it is rather well know that the term "phoneme" was proposed first by a Polish linguist Jan Niecis?aw Ignacy Baudouin de Courtenay (13 March 1845 ? 3 November 1929), cf. e.g https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay. It is much less know that he proposed also the term "grapheme". Let me quote Alexander Berg's "English Historical Linguistics vol. I" page 230 from Google Books: Since the introduction of the term grapheme by Baudouin de Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34), it has been defined in various ways: [...] As can be seen from these quotatioms, the available definitions can be divided into two groups, corresponding to two main senses, and reflecting "conflicting linguistics views of the status of writing" (Henderson 1985:142): 1. a letter or cluster of letters referring to or corresponding with a single phoneme; 2. the minimal distinctive unit of a writing system. For me the first meaning (not mentioned at all in English Wikipedia) is the primary, i.e. more useful, meaning, as is has some practical applications e.g. for describing Polish hyphenation rules. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From jameson.quinn at gmail.com Thu Sep 22 00:47:40 2016 From: jameson.quinn at gmail.com (Jameson Quinn) Date: Thu, 22 Sep 2016 01:47:40 -0400 Subject: Draft proposal for Mayan numerals Message-ID: Attached is my draft of a proposal for including Mayan numerals in unicode. I intend to finish and submit this proposal before October 1. Comments are welcome. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mayan numerals.nobills.odt Type: application/vnd.oasis.opendocument.text Size: 90034 bytes Desc: not available URL: From lang.support at gmail.com Mon Sep 26 01:23:01 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Mon, 26 Sep 2016 16:23:01 +1000 Subject: Myanmar Scripts and Languages FAQ In-Reply-To: References: Message-ID: H?, I just finished looking at the Myanmar Scripts and Languages FAQ. A few comments. Most of the questions and answers are specific to the Myanmar (Burmese) language. When discussing the ad hoc fonts, it would be useful to indicate that the ones already mentioned are Burmese specific, and that each of the major languages has its own ad hoc font(s). Mon, Shan and Sgaw Karen & Western Pwo Karen have their own specific fonts. It is also worth warning that most detectors and convertors are language specific. If your data has content in a range of Myanmar script languages, the results from such detectors and converters will be less than ideal. Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Tue Sep 27 09:28:15 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Tue, 27 Sep 2016 16:28:15 +0200 Subject: graphemes In-Reply-To: <8660ppvnxm.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> <8660ppvnxm.fsf@mimuw.edu.pl> Message-ID: An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Tue Sep 27 23:59:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Wed, 28 Sep 2016 06:59:24 +0200 Subject: graphemes In-Reply-To: ("Christoph =?utf-8?Q?P=C3=A4per=22's?= message of "Tue, 27 Sep 2016 16:28:15 +0200") References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> <8660ppvnxm.fsf@mimuw.edu.pl> Message-ID: <861t04wrf7.fsf@mimuw.edu.pl> I wrote already On Mon, Sep 19 2016 at 8:40 CEST, jsbien at mimuw.edu.pl writes: [...] > Searching the Unicode site I found only one use of 'grapheme' alone: > > http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm Anybody is aware of any other occurences? On Tue, Sep 27 2016 at 16:28 CEST, christoph.paeper at crissov.de writes: > Janusz S. Bie? : > > On Sun, Sep 18 2016 at 12:26 CEST, jsbien at mimuw.edu.pl writes: > > Quote/Cytat - Christoph P?per (pi?, > 16 > wrz 2016, 23:51:38): > > Janusz S. Bie? : > > > 1. Graphemes, if I understand correctly, are language > dependent, ? > > > That?s true in linguistic terminology ? ? ?, but not in > technical (i.e. > Unicode) jargon. > > > And what is "grapheme" in "technical (i.e. Unicode) jargon"? > > > It depends on the script (hence Unicode block), but not the writing > system or language. The line is not always drawn consistently. Please prove this claim by explicit quotations from the standard. In my opinion there is no such thing as "grapheme" in "technical (i.e. Unicode) jargon". > > From the Unicode glossary: > > Grapheme. [?] (2) What a user thinks of as a character. > > User-Perceived Character. What everyone thinks of as a > character in their script. > > > Does 'Grapheme' (2) make sense with "a (single?) user"? > > > No linguistic term makes sense with only a *single* user > (?Privatsprache?). That's obvious. > It?s a very vague definition, but not quite > incorrect for ?a typical user?. Exactly - "a typical user" is quite different from "a user". Do we agree that the wording of "grapheme" (2) should be corrected? > > BTW, it is rather well know that the term "phoneme" was proposed > first by a Polish linguist Jan Niecis?aw Ignacy Baudouin de > Courtenay (?). It is much less know that he proposed also the term > "grapheme". > > > Yes, he introduced both terms, but the definitions have changed quite > a bit through history and among schools. Entire books have been > published about that, e.g. (in German) Manfred Kohrt (1985): > ?Problemgeschichte des Graphembegriffs und des fru?hen Phonembegriffs? > (ISBN 3-484-31061-8) ? I wish I knew a more recent one. > The question is whether all these linguistic discussions are relevant to Unicode. > Alexander Berg's "English Historical Linguistics vol. I" page 230 > [?]: > > [?] the available definitions [of ?grapheme?] > can be divided into two groups, corresponding to two main senses, > and reflecting "conflicting linguistics views of the status of > writing" (Henderson 1985:142): > > 1. a letter or cluster of letters referring to or corresponding > with a > single phoneme; > > 2. the minimal distinctive unit of a writing system. > > For me the first meaning (?) is the primary, i.e. more useful, > meaning, as is has some practical applications e.g. for describing > Polish hyphenation rules. > > > Type 1 has also been called ?phono-graphemes? (with or without the > hyphen). Seems a good term, I was not aware of it. Do you happen to remember who introduced it? > > The conflicting views quoted from the 30 years old work by Henderson > still exist. There is no doubt about it. > Many scholars ? yourself included, it seems ? infer a > structural primacy of spoken language over written language from its > historic primacy. I do not, but it is completely irrelevant to the problem of the Unicode use of the "grapheme" term. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From christoph.paeper at crissov.de Wed Sep 28 03:24:34 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 28 Sep 2016 10:24:34 +0200 Subject: graphemes In-Reply-To: <861t04wrf7.fsf@mimuw.edu.pl> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> <8660ppvnxm.fsf@mimuw.edu.pl> <861t04wrf7.fsf@mimuw.edu.pl> Message-ID: <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de> Janusz S. Bie? : > On Tue, Sep 27 2016 at 16:28 CEST, christoph.paeper at crissov.de writes: >>> And what is "grapheme" in "technical (i.e. Unicode) jargon"? >> >> It depends on the script (hence Unicode block), but not the writing >> system or language. The line is not always drawn consistently. > > Please prove this claim by explicit quotations from the standard. I?ll try another day. > In my opinion there is no such thing as "grapheme" in "technical > (i.e. Unicode) jargon". Even if it?s not used explicitly, it?s still there implicitly in compounds like ?grapheme joiner? or ?grapheme cluster?. > Do we agree that the wording of "grapheme" (2) should be corrected? We do. > The question is whether all these linguistic discussions are relevant to > Unicode. Probably not worth it at this stage with all the legacy baggage, e.g. regarding ?ideographs?, but a sound linguistic foundation would have been nice, even if it?s primarily a technical standard. Alas, since there is so much disagreement among scholars, e.g. regarding ?alphasyllabaries?, stuff would probably never have gotten done. Engineers are usually better at this than scientists (or politicians). >> Type 1 has also been called ?phono-graphemes? (?). > > Seems a good term, I was not aware of it. Do you happen to remember who > introduced it? My oldest quote is from Heller 1980, but I think it was introduced earlier (maybe by Gelb). McLaughlin 1963 proposes ?graphoneme?. The terms are not very common, probably because everyone just uses their definition of ?grapheme?. JFTR, Daniels/Bright 1999 state with resignation: > *grapheme* > term intended to designate a unit of a writing system, parallel to phoneme and morpheme, > but in practice used as a synonym for letter, diacritic, character (2), or sign (2) From verdy_p at wanadoo.fr Wed Sep 28 05:41:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 28 Sep 2016 12:41:07 +0200 Subject: graphemes In-Reply-To: <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de> References: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl> <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de> <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl> <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> <86d1k05r5m.fsf_-_@mimuw.edu.pl> <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de> <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> <8660ppvnxm.fsf@mimuw.edu.pl> <861t04wrf7.fsf@mimuw.edu.pl> <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de> Message-ID: 2016-09-28 10:24 GMT+02:00 Christoph P?per : > > My oldest quote is from Heller 1980, but I think it was introduced earlier > (maybe by Gelb). McLaughlin 1963 proposes ?graphoneme?. The terms are not > very common, probably because everyone just uses their definition of > ?grapheme?. > > > *grapheme* > > term intended to designate a unit of a writing system, parallel to > phoneme and morpheme, > > but in practice used as a synonym for letter, diacritic, character (2), > or sign (2) > IMHO, the term grapheme only applies (traditionally) to the written **form**, it.e. the **graphic** item which can be clearly separated from others (even if there's some joining). So a grapheme may as well represent several logical letters (as they are spelled orally), Some ligatues are mandatroy in the written form of script and the grapheme represents the sets of graphical varaitions that will be read the same in a language (in fact what Unicode may also designate as "confusable characters". So the grapheme for A does not really differentiate the Latin, Greek and Cyrillic versions, even if, when analyzing them in a linguistic context, these letters are read differently ("a" vs. "alpha", which is in fact not really a distrinction of the script but on the linguistic tradition of alphabets for as spelled for the vocal language), and the graphemes do not have any case pairings, which is part of the semantic of the script as used for the orthography of a given language. But in the vocal language the case distinctions are almost always not relevant. The written form adds some distinctions but still carying the initial semantic in the language. This makes scripts (or more exactly writing systems) more complex to map within a unified universal encoding. Graphemes are then weaker definitions of what Unicode encodes as abstract characters (to map on them additional properties that are not relevant at the grapheme level but useful to parse the semantic of a complete text). The abstract characters in Unicode do not distinguish some letter forms even if traditionally the scripts and their associated writing systems for a language make clear distinctions: a "Fraktur Latin" letter A is a distinct "grapheme" from the modern cursive letter A even if they map to the same Unicode abstract character (as a result of unification), but the grapheme for the modern cursive letter A is the same between Latin, Cyrillic and Greek scripts. There are however significant differences when handling diacritics (e.g. the diaeresis in German works very differently as an umlaut in the Fraktur script than in the current modern script and really acts as a plain distinct letter: the graphemes differences are exposed in this case even if the Unicode-encoded letters unify them; and even logically when spelling them vocally there's a clear difference between the diaeresis as used in French or English and the umlaut used in German and several other Central-European languages). So I think that the term "grapheme" cannot be formally defined in Unicode, it does not match anything with what's encoded. What is encoded is the possibility to represent "grapheme clusters" (the set of graphical forms which are minimally distinguished but not minimally separated in a specific language) and map them with a sequence of Unicode-encoded "abstract characters" (whose individual identity does not match exactly the traditional graphemes, and are also detached from the perceived distinctions of writing systems in a specific language). Unicode cannot then define formally what is a "grapheme". It an only give a definition of "grapheme clusters", but it is mostly based on its own definitions of properties (which are also not sufficient to carry all distinctions for any given language in its writing systems). "Grapheme clusters" in Unicode are also not required to have a significant graphic form, they purely exist at semantic level directly from their encoding and can be used to generate other renderings (e.g. it can be rendered vocally, aor used to derive some other semantics, such as values of numbers, word breaking...) or to infer some grammatical/orthographic rules to compose or generate other texts. In summary, there's NO "grapheme" (isolately) in Unicode and I think it should not be defined, it would break expectations on languages, and the universal repertoire does not encode specific langauges and not even any specific writing system (the scripts in Unicode are NOT writing systems, which are always dependant of the language using them, and also dependant on the epoch and geographic area of use, for their working rules/conventions). So the "grapheme" *may* be used (contextually) as a letter, a diacritic, a sign, or even a ligature (the ligature is not just contextual when it is mandated by the writing system and adds some semanctic distinctions, depending on whever it is used or not, it's not just a question of "user preferences" or "font styles"), or any combination of these, up to the complete combination of what Unicode calls a "grapheme cluster" (the only thing really encodable with one or more abstract characters). -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.lukyanov at yspu.org Wed Sep 28 02:59:05 2016 From: a.lukyanov at yspu.org (a.lukyanov) Date: Wed, 28 Sep 2016 10:59:05 +0300 Subject: IJ with accent Message-ID: <57EB7849.3070908@yspu.org> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, like this: If one uses two separate characters (i+j), one can put an accent on each character (?j?). However, if monolithic ligature ? is used, how one can accent it correctly? Unicode standard does not answer this. Probably one should use the sequence U+0133 U+301, with the accent doubling automatically, but this is not implemented (??). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 220px-Bijna.png Type: image/png Size: 3608 bytes Desc: not available URL: From verdy_p at wanadoo.fr Wed Sep 28 11:16:27 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 28 Sep 2016 18:16:27 +0200 Subject: IJ with accent In-Reply-To: <57EB7849.3070908@yspu.org> References: <57EB7849.3070908@yspu.org> Message-ID: There's a double acute accent which you could use on the ij ligature. But it causes search problems when the ij ligature is separable, giving then (the double acute accent is not decomposable). My opinion is to put an accent on each letter and join them with a joiner, either as , or (which works with canonical equivalences, collations, and should work in rederings to instruct their ligature and the absence of syllable break between both letters, just like should render like to produce the same unbreakable ligature. 2016-09-28 9:59 GMT+02:00 a.lukyanov : > Dutch language writing uses the ligature ? (U+0132, U+0133). When > accented, it should take an accent on each component, like this: > > > > If one uses two separate characters (i+j), one can put an accent on each > character (?j?). > > However, if monolithic ligature ? is used, how one can accent it > correctly? Unicode standard does not answer this. > > Probably one should use the sequence U+0133 U+301, with the accent > doubling automatically, but this is not implemented (??). > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 220px-Bijna.png Type: image/png Size: 3608 bytes Desc: not available URL: From markus.icu at gmail.com Wed Sep 28 13:36:19 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 28 Sep 2016 11:36:19 -0700 Subject: IJ with accent In-Reply-To: References: <57EB7849.3070908@yspu.org> Message-ID: On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy wrote: > My opinion is to put an accent on each letter and join them with a joiner > I don't see a reason for the joiner. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Sep 28 13:55:33 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 28 Sep 2016 20:55:33 +0200 Subject: IJ with accent In-Reply-To: References: <57EB7849.3070908@yspu.org> Message-ID: Technically I see one, as b?j?na shound never break between ? and j?, and they should remain ligated (or their kerning kept), even if interletter spacing is enabled (that's whay the letter is frequently rendered also as "?". When converting to CAPITALS, they form a ligature looking more like ? (with the left arm broken). Adding the accents, this looks like "y" plus a double acute, or like "U" with double acute and the broken left bar, no additional spacing should be inserted). Without the joiner, there's nothing to prohibit the normal negative kerning to be removed and spacing to be inserted. When using monospaced fonts, both characters should also occupy the same cell (just like a "y" or "U"), not two (normal rendering without the joiner) 2016-09-28 20:36 GMT+02:00 Markus Scherer : > On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy > wrote: > >> My opinion is to put an accent on each letter and join them with a joiner >> > > I don't see a reason for the joiner. > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Sep 28 14:30:04 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 28 Sep 2016 12:30:04 -0700 Subject: IJ with accent Message-ID: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com> > Technically I see one, as b?j?na shound never break between ? and j?, These wor- ds should not bre- ak at the places wh- ere I have broken t- hem but they don't need embedded control characters to enforce that. -- Doug Ewell | Thornton, CO, US | ewellic.org From everson at evertype.com Wed Sep 28 14:33:23 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 28 Sep 2016 12:33:23 -0700 Subject: IJ with accent In-Reply-To: <57EB7849.3070908@yspu.org> References: <57EB7849.3070908@yspu.org> Message-ID: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com> The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature. > On 28 Sep 2016, at 00:59, a.lukyanov wrote: > > Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, like this: > > <220px-Bijna.png> > > If one uses two separate characters (i+j), one can put an accent on each character (?j?). > > However, if monolithic ligature ? is used, how one can accent it correctly? Unicode standard does not answer this. > > Probably one should use the sequence U+0133 U+301, with the accent doubling automatically, but this is not implemented (??). > > > From ruland at luckymail.com Wed Sep 28 14:54:14 2016 From: ruland at luckymail.com (Charlie Ruland) Date: Wed, 28 Sep 2016 21:54:14 +0200 Subject: IJ with accent In-Reply-To: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com> References: <57EB7849.3070908@yspu.org> <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com> Message-ID: <4d60bd08-da74-7d54-ced2-52777616f543@luckymail.com> Brill fonts (designed by John Hudson and ? by Koninklijke Brill NV) draw ?? and ?? with two acute accents. > The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Sep 28 15:48:14 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 28 Sep 2016 21:48:14 +0100 Subject: IJ with accent In-Reply-To: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com> References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com> Message-ID: <20160928214814.67bf3e87@JRWUBU2> On Wed, 28 Sep 2016 12:30:04 -0700 "Doug Ewell" wrote: > > Technically I see one, as b?j?na shound never break between ? and > > j?, > > These wor- > ds should not bre- > ak at the places wh- > ere I have broken t- > hem > > but they don't need embedded control characters to enforce that. Indeed, there aren't any control characters to control hyphenation. Indeed, CGJ between default grapheme clusters is often a very good place to hyphenate. Richard. From verdy_p at wanadoo.fr Wed Sep 28 16:22:34 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 28 Sep 2016 23:22:34 +0200 Subject: IJ with accent In-Reply-To: <20160928214814.67bf3e87@JRWUBU2> References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com> <20160928214814.67bf3e87@JRWUBU2> Message-ID: 2016-09-28 22:48 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Wed, 28 Sep 2016 12:30:04 -0700 > "Doug Ewell" wrote: > > > > Technically I see one, as b?j?na shound never break between ? and > > > j?, > > > > These wor- > > ds should not bre- > > ak at the places wh- > > ere I have broken t- > > hem > > > > but they don't need embedded control characters to enforce that. > > Indeed, there aren't any control characters to control hyphenation. > Indeed, CGJ between default grapheme clusters is often a very good > place to hyphenate. > Who told about CGJ ? But zero-width joiners should prevent such undesired breaking ; the legacy ZWNBSP however does not suggest any ligature but instead will prevent it, by only gluing two grapheme clusters side by side (with just kerning enabled), but without altering these glyphs (like in the capital IJ ligature whose I is shortened and placed on top of the left arm of the J when using ligaturing joiners). In South-Est Asian scripts there are such cases to create complex clusters that also carry semantic distinctions and layout restrictions. the "default grapheme clusters" may not include these complex clusters, but the later are needed. The rules about "default grapheme clusters" are only good for simpler cases where no ligaturing is involved and you don't really care about specific languages (even fonts contain specific data for specific languages, independantly of the script represented). -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.plantema at xs4all.nl Wed Sep 28 17:12:52 2016 From: alex.plantema at xs4all.nl (Alex Plantema) Date: Thu, 29 Sep 2016 00:12:52 +0200 Subject: IJ with accent References: <57EB7849.3070908@yspu.org> Message-ID: Op woensdag 28 september 2016 09:59 schreef a.lukyanov: > Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, like this: > > If one uses two separate characters (i+j), one can put an accent on each character (?j?). > However, if monolithic ligature ? is used, how one can accent it correctly? Unicode standard does not answer this. > Probably one should use the sequence U+0133 U+301, with the accent doubling automatically, but this is not implemented (??). I've never seen an ij with an accent. You can safely assume it's never needed. Alex. From everson at evertype.com Wed Sep 28 17:20:54 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 28 Sep 2016 15:20:54 -0700 Subject: IJ with accent In-Reply-To: References: <57EB7849.3070908@yspu.org> Message-ID: <441A66B1-4D94-431C-8223-9F16097B1A5F@evertype.com> On 28 Sep 2016, at 15:12, Alex Plantema wrote: > I've never seen an ij with an accent. You can safely assume it's never needed. I?ve had people request that I add support for it to Everson Mono, so I safely assume that it?s sometimes needed. ;-) Michael From richard.wordingham at ntlworld.com Wed Sep 28 17:39:54 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 28 Sep 2016 23:39:54 +0100 Subject: IJ with accent In-Reply-To: References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com> <20160928214814.67bf3e87@JRWUBU2> Message-ID: <20160928233954.2ba6b3de@JRWUBU2> On Wed, 28 Sep 2016 23:22:34 +0200 Philippe Verdy wrote: > 2016-09-28 22:48 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > On Wed, 28 Sep 2016 12:30:04 -0700 > > "Doug Ewell" wrote: > > > > > > Technically I see one, as b?j?na shound never break between ? > > > > and j?, > > > > > > These wor- > > > ds should not bre- > > > ak at the places wh- > > > ere I have broken t- > > > hem > > > > > > but they don't need embedded control characters to enforce that. > > > > Indeed, there aren't any control characters to control hyphenation. > > Indeed, CGJ between default grapheme clusters is often a very good > > place to hyphenate. > > > > Who told about CGJ ? > > But zero-width joiners should prevent such undesired breaking ; the > legacy ZWNBSP however does not suggest any ligature but instead will > prevent it, by only gluing two grapheme clusters side by side (with > just kerning enabled), but without altering these glyphs (like in the > capital IJ ligature whose I is shortened and placed on top of the > left arm of the J when using ligaturing joiners). If you could be bothered to read the Unicode standard annexes and the character database (UCD), you would note that ZWJ (let alone ZWNJ) has no effect on line-breaking, except with emoji and ideographs. In addition to the UCD, a statement to this effect can be found in TUS 23.2 'Layout Controls'. Indeed, the only character that is described as having an effect on a hyphenator, and that is only described as a convention (TR14 Line-Breaking, Section 5.4), is U+00AD SOFT HYPHEN. So far as Unicode is concerned, there is no other plain text control over hyphenators. > In South-Est Asian scripts there are such cases to create complex > clusters that also carry semantic distinctions and layout > restrictions. The only semantic distinction available is the forcing of word boundaries. Richard. From kent.karlsson14 at telia.com Wed Sep 28 17:59:23 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Thu, 29 Sep 2016 00:59:23 +0200 Subject: IJ with accent In-Reply-To: Message-ID: Den 2016-09-29 00:12, skrev "Alex Plantema" : > Op woensdag 28 september 2016 09:59 schreef a.lukyanov: > >> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, >> it should take an accent on each component, like this: >> >> If one uses two separate characters (i+j), one can put an accent on each >> character (?j?). >> However, if monolithic ligature ? is used, how one can accent it correctly? >> Unicode standard does not answer this. >> Probably one should use the sequence U+0133 U+301, with the accent doubling >> automatically, but this is not implemented (??). > > I've never seen an ij with an accent. You can safely assume it's never needed. See https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemto onteken /K > Alex. > From kent.karlsson14 at telia.com Wed Sep 28 18:12:43 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Thu, 29 Sep 2016 01:12:43 +0200 Subject: IJ with accent In-Reply-To: <20160928214814.67bf3e87@JRWUBU2> Message-ID: Den 2016-09-28 22:48, skrev "Richard Wordingham" : > On Wed, 28 Sep 2016 12:30:04 -0700 > "Doug Ewell" wrote: > >>> Technically I see one, as b?j?na shound never break between ? and >>> j?, >> >> These wor- >> ds should not bre- >> ak at the places wh- >> ere I have broken t- >> hem >> >> but they don't need embedded control characters to enforce that. > > Indeed, there aren't any control characters to control hyphenation. Well, there is SOFT HYPHEN, as you yourself noted later. There is also 0083;;Cc;0;BN;;;;;N;NO BREAK HERE;;;; "NBH is used to indicate a point where a line break shall not occur when text is formatted." But that is in the C1 area, most of which nearly no-one implements... /K > Indeed, CGJ between default grapheme clusters is often a very good > place to hyphenate. > > Richard. > From junichi.chiba.bps at gmail.com Wed Sep 28 22:13:02 2016 From: junichi.chiba.bps at gmail.com (Junichi Chiba) Date: Thu, 29 Sep 2016 03:13:02 +0000 Subject: Dates in Japanese Era Names in Unicode Standard Message-ID: Dear all, Nice to e-meet you. I'm looking at the latest Unicode Standard [1] listing the dates for Japanese Era Names in Table 22-8. What I noticed is the begin and end dates for each era. They seem to have one day difference with the dates that are recognized publicly in Japan. For example, the current Heisei actually started January 8th, 1989, after Showa ended on 7th, 1989. However, the Unicode Standard says in Table 22-8: U+337B square era name heisei 1989-01-07 to present day U+337C square era name syouwa 1926-12-24 to 1989-01-06 Looking at Wikipedia in Japanese [2] and English [3], you can see exact dates for Syouwa end and Heisei start. Could there be certain intentions to leave some difference in this description and official dates? Is the date counted according to GMT, instead of local date/time for some reason? REFERENCE [1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90 > 1989????64??1?7????????????????????????????????????????????1989????64??1?7????????????????????????1?8??????????? [3] https://en.wikipedia.org/wiki/Heisei_period > Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ... since 8 January. > On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,... > The Heisei era went into effect immediately upon the day after Emperor Akihito's succession to the throne on 7 January 1989. -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Thu Sep 29 01:00:59 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 29 Sep 2016 08:00:59 +0200 Subject: IJ with accent In-Reply-To: <57EB7849.3070908@yspu.org> References: <57EB7849.3070908@yspu.org> Message-ID: a.lukyanov : > > Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, > > However, if monolithic ligature ? is used, how one can accent it correctly? JFTR: - ? U+0133 - ?? U+0133+0301 - ?? U+0133+030B - y U+0079 - y? U+0079+0301 - ? U+00FD - y? U+0079+030B - ? U+00FF - ?? U+00FF+0301 - ?? U+00FF+030B From a.lukyanov at yspu.org Thu Sep 29 02:45:07 2016 From: a.lukyanov at yspu.org (a.lukyanov) Date: Thu, 29 Sep 2016 10:45:07 +0300 Subject: IJ with accent In-Reply-To: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com> References: <57EB7849.3070908@yspu.org> <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com> Message-ID: <57ECC683.9090302@yspu.org> 28.09.2016 22:33, Michael Everson wrote: > The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature. > That seems good, still the Unicode standard says nothing about it. And doubling a diacritic is not quite self-evident. It would be nice to have an explicit description of this issue somewhere in the "Europe-I" section. From verdy_p at wanadoo.fr Thu Sep 29 05:06:22 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 29 Sep 2016 12:06:22 +0200 Subject: Dates in Japanese Era Names in Unicode Standard In-Reply-To: References: Message-ID: Is it possible that these eras start at midday instead of noon ? This could explain the date difference, if you do not set the time in your query (your query will assume a default time at 00:00 midnight) There's a similar issue with most calendars before the modern Gregorian, and even within historic documents still using date shifting at midday (and then naming the morning with the previous day). This practice survided for long as physical 24-hour clocks were rare. Still today, many English-speaking countries use AM/PM periods and 12-hour clocks are used for almost all non-electronic displays (even if some watches also include a small circle display a 24-clock, the 12-hour display is the most common and the easiest to read (it is a partial survival of the old Roman calendar that counted time negatively relative to the date defined clearly at midday, because midday is more more easily observable with a good precision than midnight). The recent introduction of daylight saving (and generalization of official times in large timezones) changed the perception of clock, as it was no longer synchronized with observation of the Sun. Negative counting in dates and time as now almost disappeared (except in popular language for counting the last minutes relative to hours, a correct form of precision rounding). Dates are better understood to cover the whole working day (or rest day), except for religious purpose (e.g. withing Judaism, whose reference is the variable time of sun fall in the evening, or in Islam with also a variable reference time at sunrise as observed in a reference location determined by local or national communities). Many people still count the second half of the night after midnight as part of the previous day (and so will say "Saturday evening"/"Saturday night" even if it's already the first hours of Sunday). If you test dates and don't want to specify hours, it is highly recommended to set the default time at midday. For the Japanese eras, it's not clear at which time they really start, except for the last two eras since WW2 but setting time at midday shoudl give the correct result. However there's no ambiguity during the day of era switch, if the era is correctly specified (and not just the year number in era). 2016-09-29 5:13 GMT+02:00 Junichi Chiba : > Dear all, > > Nice to e-meet you. > > I'm looking at the latest Unicode Standard [1] listing the dates for > Japanese Era Names in Table 22-8. > What I noticed is the begin and end dates for each era. > They seem to have one day difference with the dates that are recognized > publicly in Japan. > For example, the current Heisei actually started January 8th, 1989, after > Showa ended on 7th, 1989. > > However, the Unicode Standard says in Table 22-8: > U+337B square era name heisei 1989-01-07 to present day > U+337C square era name syouwa 1926-12-24 to 1989-01-06 > > Looking at Wikipedia in Japanese [2] and English [3], you can see exact > dates for Syouwa end and Heisei start. > Could there be certain intentions to leave some difference in this > description and official dates? > Is the date counted according to GMT, instead of local date/time for some > reason? > > REFERENCE > > [1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf > > [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90 > > 1989????64??1?7????????????????????????????????????????????1989????64?? > 1?7????????????????????????1?8??????????? > > [3] https://en.wikipedia.org/wiki/Heisei_period > > Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ... > since 8 January. > > On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial > Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,... > > The Heisei era went into effect immediately upon the day after Emperor > Akihito's succession to the throne on 7 January 1989. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond at almanach.co.uk Thu Sep 29 05:23:17 2016 From: raymond at almanach.co.uk (Raymond Mercier) Date: Thu, 29 Sep 2016 11:23:17 +0100 Subject: Dates in Japanese Era Names in Unicode Standard In-Reply-To: References: Message-ID: Philippe, >>Is it possible that these eras start at midday instead of noon ? I assume you mean midnight RM www.raymondm.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu Sep 29 05:45:54 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 29 Sep 2016 19:45:54 +0900 Subject: Dates in Japanese Era Names in Unicode Standard In-Reply-To: References: Message-ID: <6c865cc7-8227-d72a-7794-e9fe9f3bc583@it.aoyama.ac.jp> Just a few not very closely related comments: On 2016/09/29 19:06, Philippe Verdy wrote: > Is it possible that these eras start at midday instead of noon ? This could > explain the date difference, if you do not set the time in your query (your > query will assume a default time at 00:00 midnight) It's extremely difficult to imagine this for Japan in this day and age. I was in Japan when the era changed from Showa to Heisei. I remember the announcement very well, but I don't remember anything about the exact time of the cutover. > Many people still count the second half of the night after midnight as part > of the previous day (and so will say "Saturday evening"/"Saturday night" > even if it's already the first hours of Sunday). In Japan, that happens e.g. in displays of restaurants and bars, which may announce their opening hours as 17:30-27:00 (i.e. open until three in the morning the next day). But that's only a convention for convenience, everybody knows that it's already the next day on the calendar. > If you test dates and don't want to specify hours, it is highly recommended > to set the default time at midday. For the Japanese eras, it's not clear at > which time they really start, except for the last two eras since WW2 but > setting time at midday shoudl give the correct result. However there's no > ambiguity during the day of era switch, if the era is correctly specified > (and not just the year number in era). Yes indeed. These days, people just refer to 1989 (and any dates in it) as Heisei 1 (????). This is all the easier because otherwise, an exception would be necesary for only 7 days. On the other hand, I saw places that said Showa 64 as late as July (that was when I climbed Mt. Fuji; a placard put up the year before said "closed until July Showa 64"). I also got some money in February or so that year and had to sign a receipt that said Showa 64 because it was printed earlier. The Japanese Wikipedia article, at the bottom of the ?? (https://ja.wikipedia.org/wiki/??#.E6.94.B9.E5.85.83) section, says that in contrast to the two earlier changes in era, the change started on the next day, in order to give engineers time for the change. That next day was a Sunday, which meant that in effect, they had even more time, because most systems had to work with the new ear only from Monday. But I guess it must have been a busy weekend for those involved, anyway. To know all the details, the best thing to do would be to check the official government documents, which should be available online. But I wouldn't be surprised if they were not specifying things to the second. Regards, Martin. > 2016-09-29 5:13 GMT+02:00 Junichi Chiba : > >> Dear all, >> >> Nice to e-meet you. >> >> I'm looking at the latest Unicode Standard [1] listing the dates for >> Japanese Era Names in Table 22-8. >> What I noticed is the begin and end dates for each era. >> They seem to have one day difference with the dates that are recognized >> publicly in Japan. >> For example, the current Heisei actually started January 8th, 1989, after >> Showa ended on 7th, 1989. >> >> However, the Unicode Standard says in Table 22-8: >> U+337B square era name heisei 1989-01-07 to present day >> U+337C square era name syouwa 1926-12-24 to 1989-01-06 >> >> Looking at Wikipedia in Japanese [2] and English [3], you can see exact >> dates for Syouwa end and Heisei start. >> Could there be certain intentions to leave some difference in this >> description and official dates? >> Is the date counted according to GMT, instead of local date/time for some >> reason? >> >> REFERENCE >> >> [1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf >> >> [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90 >>> 1989????64??1?7????????????????????????????????????????????1989????64?? >> 1?7????????????????????????1?8??????????? >> >> [3] https://en.wikipedia.org/wiki/Heisei_period >>> Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ... >> since 8 January. >>> On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial >> Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,... >>> The Heisei era went into effect immediately upon the day after Emperor >> Akihito's succession to the throne on 7 January 1989. >> > -- Martin J. D?rst Department of Intelligent Information Technology Collegue of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From doug at ewellic.org Thu Sep 29 11:02:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 29 Sep 2016 09:02:29 -0700 Subject: IJ with accent Message-ID: <20160929090229.665a7a7059d7ee80bb4d670165c8327d.527942b7de.wbe@email03.godaddy.com> Kent Karlsson wrote: >> I've never seen an ij with an accent. You can safely assume it's >> never needed. > > See > https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemtoonteken I note with amusement that this Wikipedia page, presumably written and edited by Dutch speakers who we often hear insist on the precomposed letters, contains more than 30 instances of IJ or ij (the separate Basic Latin letters) and zero instances of ? or ?. ?j?, as others have observed, is trivially simple. -- Doug Ewell | Thornton, CO, US | ewellic.org From everson at evertype.com Thu Sep 29 11:17:02 2016 From: everson at evertype.com (Michael Everson) Date: Thu, 29 Sep 2016 09:17:02 -0700 Subject: IJ with accent In-Reply-To: References: <57EB7849.3070908@yspu.org> Message-ID: <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com> y is not an acceptable variant of ? though. ?Byoux? is not correct; ?bijoux? or ?b?oux? is? > JFTR: > > - ? U+0133 > - ?? U+0133+0301 > - ?? U+0133+030B > - y U+0079 > - y? U+0079+0301 > - ? U+00FD > - y? U+0079+030B > - ? U+00FF > - ?? U+00FF+0301 > - ?? U+00FF+030B > > From alex.plantema at xs4all.nl Thu Sep 29 11:25:24 2016 From: alex.plantema at xs4all.nl (Alex Plantema) Date: Thu, 29 Sep 2016 18:25:24 +0200 Subject: IJ with accent References: <20160929090229.665a7a7059d7ee80bb4d670165c8327d.527942b7de.wbe@email03.godaddy.com> Message-ID: Op donderdag 29 september 2016 18:02 schreef Doug Ewell: > Kent Karlsson wrote: > >>> I've never seen an ij with an accent. You can safely assume it's >>> never needed. >> >> See >> https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemtoonteken > > I note with amusement that this Wikipedia page, presumably written and > edited by Dutch speakers who we often hear insist on the precomposed > letters, contains more than 30 instances of IJ or ij (the separate > Basic Latin letters) and zero instances of ? or ?. > > ?j?, as others have observed, is trivially simple. The precomposed version isn't recommended anymore. The ij evolved from ii, because ii is indistinguishable from ? in handwriting. Alex. From verdy_p at wanadoo.fr Thu Sep 29 12:16:33 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 29 Sep 2016 19:16:33 +0200 Subject: IJ with accent In-Reply-To: <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com> References: <57EB7849.3070908@yspu.org> <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com> Message-ID: Actually your example is not contrieved, it cites words in French, which makes no use at all of this Dutch digraph; French however distinguishes "?" as a valid letter in its alphabet and will distinguish it from "y" and "ij". But with the old Dutch way of writing ij, it would become ? (keeping the dots), not "y", so your incorrect example "bijou(x)" would appear as "B?ou(x)", not "Byou(x)... if only it was Dutch and if there was no syllable break between i and j like in this actual French word "bi-jou(x)". In capitals the dots would disappear and "BIJOU(X)" would become "B?OU(X)" (with the ligature... if only it was Dutch), but the normal French "?" (which occurs in rare words) considers the dots as a diareasis (where there's a clear syllable break before, as "?" only occurs after another vowel, so that "?" becomes a plain vowel /i/ with an leading glotal stop, and not the half-consonant /j/: "L'Ha?es-les-Roses" is clearly prononced /la??i?l???oz/ (as if it was written "L'Hahi(es)-les-Roses") but not if there was not this diareasis it would be read incorrectly as /laj?l???oz/ (as if it was written "L'A?l-les-Roses") (the "-es" termination is mute here). The need of a diareasis if very rare with "y" in French where "y" is normally /j/ after a vowel (but not before a final mute "e"), or /i/ after a consonnant, and the digrams "ay" and "oy" are working like "ai" /?/ and "oi" /wa/ when final, or before a consonnant, or before other final mute letters. Why there's a "y" and not a "i" here is historic, it was initially pronounced /la?ji?l???oz/ and could have then been rewritten as "L'Hayies-les-Roses", but possibly incorrectly read as /l??ji?l???oz/ (using the normal pronouciation of the "ay" digram like "ai". The diaresis solved the reading problem, the "y" was kept but without any following "i", to make sure it is not turned into a half-consonnant /j/ and remains an plain /i/ vowel, the the diareasis implies the glottal stop separation of syllables. All this is not relevant for "bijou(x)" or "BIJOU(X)", and not relevant for Dutch which treats the digram "ij" most often as a long form of the vowel /i/ alone (and not a pair with the vowel /i/ and a consonnant /?/ or /d?/ or /j/ when there's a syllable break between them). In French, long vowels are no longer distinguished phonetically and never orthographically, other languages use diacritics such as a macron (for Japanese romanization) or an acute accent over stressed/long vowels. I suppose that the need to add acute accent in the Dutch digraph "ij" is to not just mark the length, but also the stress (accents are placed on both letters of the digraph, but it could as well been a single macron, a very unusual diacritic in Dutch). 2016-09-29 18:17 GMT+02:00 Michael Everson : > y is not an acceptable variant of ? though. ?Byoux? is not correct; > ?bijoux? or ?b?oux? is? > > > JFTR: > > > > - ? U+0133 > > - ?? U+0133+0301 > > - ?? U+0133+030B > > - y U+0079 > > - y? U+0079+0301 > > - ? U+00FD > > - y? U+0079+030B > > - ? U+00FF > > - ?? U+00FF+0301 > > - ?? U+00FF+030B > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Sep 30 00:43:45 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Fri, 30 Sep 2016 14:43:45 +0900 Subject: Dates in Japanese Era Names in Unicode Standard In-Reply-To: References: <6c865cc7-8227-d72a-7794-e9fe9f3bc583@it.aoyama.ac.jp> Message-ID: <59642171-c152-0863-8165-ac48ace1d9a1@it.aoyama.ac.jp> Hello Junichi, Your analysis sounds very plausible. I suggest you send an official error report using http://www.unicode.org/reporting.html. Regards, Martin. On 2016/09/30 13:16, ?? ?? wrote: >> Is it possible that these eras start at midday instead of noon ? >> This could explain the date difference, if you do not set the time in > your query >> (your query will assume a default time at 00:00 midnight) > > The new era starts 00:00 midnight local time. > Together with the time zone difference, I assume that the cause was a > simple chain of mistakes while drafting the unicode document. > > My story: > > First, the author for the Table 22-8 asks somebody to send a list of the > dates. > For the table to work, the accuracy of "day" should be enough, rather than > time. > The "day" value is thus recorded in YYYYMMDD format. > It is then listed in a file format like a spreadsheet, that keeps day value > in "time" accuracy with time zone marker. > As there is no intention to keep it in "time" accuracy, let's suppose that > a default marker such as UTC+0 is embed automatically. > > The spreadsheet is then sent to the author and opened in more "Western" > time zone than it was recorded. > Upon opening the file, the dates were converted to local time zone. > Specifying a more "Western" time zone results in smaller date values. > Thus the smaller values are picked up by the author for Table 22-8. > > Actually all of the day values in Table 22-8 are shifted by one earlier. > > Current values: > U+337B square era name heisei 1989-01-07 to present day > U+337C square era name syouwa 1926-12-24 to 1989-01-06 > U+337D square era name taisyou 1912-07-29 to 1926-12-23 > U+337E square era name meizi 1867 to 1912-07-28 > > Suggested correction: > U+337B square era name heisei 1989-01-08 to present day > U+337C square era name syouwa 1926-12-25 to 1989-01-07 > U+337D square era name taisyou 1912-07-30 to 1926-12-24 > U+337E square era name meizi 1868 to 1912-07-29 > > > Here are some citations. > > I will cite from the most reliable source, law database provided by the > government (in Japanese). > This is the actual law about when Heisei shall start: > http://law.e-gov.go.jp/cgi-bin/idxselect.cgi?IDX_OPT=1&H_NAME=%8C%B3%8D%86%82%F0%89%FC%82%DF%82%E9%90%AD%97%DF&H_NAME_YOMI=%82%A0&H_NO_GENGO=H&H_NO_YEAR=&H_NO_TYPE=2&H_NO_NO=&H_FILE_NAME=S64SE001&H_RYAKU=1&H_CTG=1&H_YOMI_GUN=1&H_CTG_GUN=1 > >> ??????????????? >> ... >> ?????????? >> ?? >> ???????????????????? > > Translation: >> Showa 64 January 7 Ordinance 1 >> ... >> Era name shall be Heisei. >> Appendix >> This ordinance shall be effective since the next day of promulgation. > > The release date was January 7. > As Martin mentioned, Heisei started on the next day of the announcement. > Thus Showa lasted until the end of January 7 very midnight, then Heisei > started at very morning of January 8. > >> On the other hand, I saw places that said Showa 64 as late as July (that >> was when I climbed Mt. Fuji; a placard put up the year before said >> "closed until July Showa 64"). > > I remember the same thing when I was a child. > For about a half year, many things such as application forms and street > signs still displayed in Showa. I saw Passport and License showing > expiration date as Showa 70 or 80. Coins are printed and stocked before > release, so there are circulation of Showa 64 coins. > > People often carry a conversion table like: > 1986 : Showa 61 > 1987 : Showa 62 > 1988 : Showa 63 > 1989 : Showa 64 : Heisei 1 > 1990 : Showa 65 : Heisei 2 > 1991 : Showa 66 : Heisei 3 > > I also cite start of Showa. This is citation from Wikisource, another > reliable source for public documents. > https://ja.wikisource.org/wiki/%E6%98%AD%E5%92%8C%E3%83%88%E6%94%B9%E5%85%83 >> ?????????????????????????????????????????????????????????? >> ???? >> ???????????? > Translation: >> In the name of Emperor who is given inherited soverignty to administer > state affairs, We let Taisho 15 December 25 and forth be begin of Showa. >> Signed by Emperor >> Taisho 15 December 25 > As Martin mentioned, eras before Heisei were renewed in the way that > announcement overwrites the old day. > > > Here is start of Taisho: > https://ja.wikisource.org/wiki/%E6%98%8E%E6%B2%BB%E5%9B%9B%E5%8D%81%E4%BA%94%E5%B9%B4%E4%B8%83%E6%9C%88%E4%B8%89%E5%8D%81%E6%97%A5%E4%BB%A5%E5%BE%8C%E3%83%B2%E6%94%B9%E3%83%A1%E3%83%86%E5%A4%A7%E6%AD%A3%E5%85%83%E5%B9%B4%E3%83%88%E7%88%B2%E3%82%B9 >> ???????????????????????????? >> ?????????????????????????????????????? >> ???? >> ??????????? > > Translation: >> In the name of Emperor under inherited spirit of soverignty to administer > state affairs with virtue, We let, regarding ordinance enacted by the > previous Emperor, Meiji 45 July 30 and forth be begin of Taisho. >> Signed by Emperor >> Meiji 45 July 30 > > With this law, Meiji 45 July 30 is overwritten by Taisho 1 July 30. > > > Lastly, here is start of Meiji. > https://ja.wikisource.org/wiki/%E4%BB%8A%E5%BE%8C%E5%B9%B4%E8%99%9F%E3%83%8F%E5%BE%A1%E4%B8%80%E4%BB%A3%E4%B8%80%E8%99%9F%E3%83%8B%E5%AE%9A%E3%83%A1%E6%85%B6%E6%87%89%E5%9B%9B%E5%B9%B4%E3%83%B2%E6%94%B9%E3%83%86%E6%98%8E%E6%B2%BB%E5%85%83%E5%B9%B4%E3%83%88%E7%88%B2%E3%82%B9%E5%8F%8A%E8%A9%94%E6%9B%B8 >> ?? >> ...????????????????????????????? >> ???????? > > Translation: >> Imperial Edict >> ... Keio 4 be renamed as Meiji 1 and since now the tradition of frequent > renaming of Era be limited to one Era per Emperor. > > Since Meiji, the Era is less frequently renewed. It is more engineer > friendly! > > In Table 22-8, the Meiji start day is omitted. > The omission itself is reasonable. It can avoid controversy in writing the > day along Lunar calendar used until Meiji 5 December 2 midnight. (The next > day is Meiji 6 January 1.) > > The problem here is the year shown as 1867. > The ordinance was released on Meiji 1 September 8 Lunar, which was 1868 > October 23 Gregorian. > Meiji 1 January 1 Lunar (and Keio 4 January 1 Lunar) is 1868 January 25 > Gregorian. > My best guess is that the author of Table 22-8 picked up the year value > from spreadsheet showing "1867-12-31" in local time, originally intended to > show merely "1868-01". > > On Thu, 29 Sep 2016 at 19:46 Martin J. D?rst wrote: > >> Just a few not very closely related comments: >> >> On 2016/09/29 19:06, Philippe Verdy wrote: >>> Is it possible that these eras start at midday instead of noon ? This >> could >>> explain the date difference, if you do not set the time in your query >> (your >>> query will assume a default time at 00:00 midnight) >> >> It's extremely difficult to imagine this for Japan in this day and age. >> >> I was in Japan when the era changed from Showa to Heisei. I remember the >> announcement very well, but I don't remember anything about the exact >> time of the cutover. >> >> >>> Many people still count the second half of the night after midnight as >> part >>> of the previous day (and so will say "Saturday evening"/"Saturday night" >>> even if it's already the first hours of Sunday). >> >> In Japan, that happens e.g. in displays of restaurants and bars, which >> may announce their opening hours as 17:30-27:00 (i.e. open until three >> in the morning the next day). But that's only a convention for >> convenience, everybody knows that it's already the next day on the >> calendar. >> >> >>> If you test dates and don't want to specify hours, it is highly >> recommended >>> to set the default time at midday. For the Japanese eras, it's not clear >> at >>> which time they really start, except for the last two eras since WW2 but >>> setting time at midday shoudl give the correct result. However there's no >>> ambiguity during the day of era switch, if the era is correctly specified >>> (and not just the year number in era). >> >> Yes indeed. These days, people just refer to 1989 (and any dates in it) >> as Heisei 1 (????). This is all the easier because otherwise, an >> exception would be necesary for only 7 days. >> >> On the other hand, I saw places that said Showa 64 as late as July (that >> was when I climbed Mt. Fuji; a placard put up the year before said >> "closed until July Showa 64"). I also got some money in February or so >> that year and had to sign a receipt that said Showa 64 because it was >> printed earlier. >> >> The Japanese Wikipedia article, at the bottom of the ?? >> (https://ja.wikipedia.org/wiki/??#.E6.94.B9.E5.85.83) section, says that >> in contrast to the two earlier changes in era, the change started on the >> next day, in order to give engineers time for the change. That next day >> was a Sunday, which meant that in effect, they had even more time, >> because most systems had to work with the new ear only from Monday. But >> I guess it must have been a busy weekend for those involved, anyway. >> >> To know all the details, the best thing to do would be to check the >> official government documents, which should be available online. But I >> wouldn't be surprised if they were not specifying things to the second. >> >> Regards, Martin. >> >>> 2016-09-29 5:13 GMT+02:00 Junichi Chiba : >>> >>>> Dear all, >>>> >>>> Nice to e-meet you. >>>> >>>> I'm looking at the latest Unicode Standard [1] listing the dates for >>>> Japanese Era Names in Table 22-8. >>>> What I noticed is the begin and end dates for each era. >>>> They seem to have one day difference with the dates that are recognized >>>> publicly in Japan. >>>> For example, the current Heisei actually started January 8th, 1989, >> after >>>> Showa ended on 7th, 1989. >>>> >>>> However, the Unicode Standard says in Table 22-8: >>>> U+337B square era name heisei 1989-01-07 to present day >>>> U+337C square era name syouwa 1926-12-24 to 1989-01-06 >>>> >>>> Looking at Wikipedia in Japanese [2] and English [3], you can see exact >>>> dates for Syouwa end and Heisei start. >>>> Could there be certain intentions to leave some difference in this >>>> description and official dates? >>>> Is the date counted according to GMT, instead of local date/time for >> some >>>> reason? >>>> >>>> REFERENCE >>>> >>>> [1] >> http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf >>>> >>>> [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90 >>>>> 1989????64??1?7????????????????????????????????????????????1989????64?? >>>> 1?7????????????????????????1?8??????????? >>>> >>>> [3] https://en.wikipedia.org/wiki/Heisei_period >>>>> Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ... >>>> since 8 January. >>>>> On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial >>>> Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's >> death,... >>>>> The Heisei era went into effect immediately upon the day after Emperor >>>> Akihito's succession to the throne on 7 January 1989. >>>> >>> >> >> -- >> Martin J. D?rst >> Department of Intelligent Information Technology >> Collegue of Science and Engineering >> Aoyama Gakuin University >> Fuchinobe 5-1-10, Chuo-ku, Sagamihara >> 252-5258 Japan >> > -- Martin J. D?rst Department of Intelligent Information Technology Collegue of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From glorieul at coanda-deviation.info Fri Sep 30 04:57:15 2016 From: glorieul at coanda-deviation.info (Gael Lorieul) Date: Fri, 30 Sep 2016 11:57:15 +0200 Subject: Why incomplete subscript/superscript alphabet ? Message-ID: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> Hello all, I wonder why only a subset of the alphabet is available as subscript and/or superscript ? This is well illustrated on the table in the following Wikipedia page: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Latin_and_Greek_tables Is there a reason for this ? I would love to have these characters available because I often use Unicode to write equations as comments of a source code. For instance: class Term_diff_rotDivStressTensor_splitted /** * Computes: * * ? ??? ?1 ? * ?.?? + ??????u + ????.(?u + ?u?)???? * ? ??? ?? ? */ { [...] (class definition) } or a more problematic example: /* * ?t??? * q(t?) ? q(t?) +? rhs(q,t) dt + (t??? - t?????) * ?t????? */ Here "end" and "start" would have been better as subscripts, but I could not do so because letter "d" is not available as a subscript? As you can see, having only some letters available as subscript (& superscript) is sometimes a pain? Ga?l Lorieul PhD student in Computational Fluid Dynamics at Universit? catholique de Louvain From jkorpela at cs.tut.fi Fri Sep 30 10:07:29 2016 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 30 Sep 2016 18:07:29 +0300 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> Message-ID: <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi> 30.9.2016, 12:57, Gael Lorieul wrote: > I wonder why only a subset of the alphabet is available as subscript > and/or superscript ? This is explained in section 22.4 of the standard: http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#page=25 To put it briefly, in my interpretation, subscript and superscript characters have been encoded in Unicode only if they have specialized, defined meaning in some notations (e.g. superscript letters in phonetic notations) or if they exist in some legacy character encoding. Apart from specialized cases, the recommended approach is to use higher protocols (such as formatting or markup). So instead of trying to find superscript letters for ?end?, you should consider using rich text or a markup language so that the word written with normal letters ?end? is formatted or marked up as a superscript. Yucca From jknappen at web.de Fri Sep 30 10:08:52 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 30 Sep 2016 17:08:52 +0200 Subject: Aw: Why incomplete subscript/superscript alphabet ? In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> Message-ID: An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Sep 30 10:19:34 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 30 Sep 2016 17:19:34 +0200 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> Message-ID: Your problem here is that "start" and "end" are not symbols/variables but actual English words. Why would this usage be restricted only to English ? The same formula would need to be really translated in various languages and scripts, needing then mapping all letters in Latin, Greek, Cyrillic, but even also Arabic, Japanese Chinese, Hindi... This usage in plain text as comments in source codes generally do not need to be really very friendly in their layout, they can remain more symbolic and you should not even need to split these formulas in multiple lines, using broken characters (such as parentheses and square braces, whose presence in Unicode is justified only for mapping legacy characters used to render actual text on old monospace-only terminals. Here your source code is intended for programmers and should better use a technical notation. If you want to include a conventional formula, include an URL going to an image or to an anchor in some document (HTML, PDF, Doc(x) file, or a reference to a page in a book) So I suggest you use some notational conventions such as TeX here if you want to be exact (this notation may be different from the actual implemetnation in the documented code). The superscript/subscripts in Unicode have been encoded mostly because they are needed for the orthography of some languages as distinct letters, but most often as modifiers, they are not intended to be used to compose separate words like "start" or "end" here. Note also that many tools generating documentation from source code allow you to insert HTML comments, so you could as well use , and then we don't need these additions (this would be an open door to reending almost all letters in all scripts as subscripts/superscripts, with many new problems for their diacritics). Just consider how you would translate your formula in French: "start" would become "d?but" (note the combining accute accent...). Here again with a TeX notation or an HTML notation you solve the problem using d?but in the formula. or using a ... HTML element to embed a complete MathML (TeX-like) formula. Your souce code documentation is not necessarily in English. English is used frequently in corporate code or in many open-sourced projects, but not always. There's even open-sourced code that is managed by teams speaking another language, for projects targetting mostly another language or an organization that wants or requires documentation in another language (notably for the public APIs; internal/private APIs are often excluded from doc generation tools, so programmers are free to use any language that are convenient to them, but they won't pass a lot of time tuning these comments so that they are perfectly readable with all exact linguistic and scriptural features and good looking for many readers). Discussing these projects in English would exclude valuable contributions for the target users of the application, possibly using incorrect terms or very fuzzy translations to English when there are other requirements (notably with terms with legal meaning). Ok, the terms "end" and "start" are understood by all programmers, but not necessarily all users of a public API (which may use it through other code generation helpers, templates, HTML/application input forms and so on). 2016-09-30 11:57 GMT+02:00 Gael Lorieul : > Hello all, > > I wonder why only a subset of the alphabet is available as subscript > and/or superscript ? > > This is well illustrated on the table in the following Wikipedia page: > > https://en.wikipedia.org/wiki/Unicode_subscripts_and_ > superscripts#Latin_and_Greek_tables > > Is there a reason for this ? > > I would love to have these characters available because I often use > Unicode to write equations as comments of a source code. For instance: > > class Term_diff_rotDivStressTensor_splitted > /** > * Computes: > * > * ? ??? ?1 ? > * ?.?? + ??????u + ????.(?u + ?u?)???? > * ? ??? ?? ? > */ > { > [...] (class definition) > } > > > or a more problematic example: > > /* > * ?t??? > * q(t?) ? q(t?) +? rhs(q,t) dt + (t??? - t?????) > * ?t????? > */ > > Here "end" and "start" would have been better as subscripts, but I could > not do so because letter "d" is not available as a subscript? > > As you can see, having only some letters available as subscript (& > superscript) is sometimes a pain? > > > Ga?l Lorieul > > PhD student in Computational Fluid Dynamics > at Universit? catholique de Louvain > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Fri Sep 30 10:54:27 2016 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 30 Sep 2016 18:54:27 +0300 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> Message-ID: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> 30.9.2016, 18:19, Philippe Verdy wrote: > Note also that many tools generating documentation from source code > allow you to insert HTML comments, so you could as well use , Yes, but there?s a serious typographic pitfall with this, as well as with using e.g. subscript or superscript formatting in a word processor. The problem is that the rendering is almost always simplistic: letters (or other characters) of the current font are used in reduced size and in lowered or raised position. The result is that the glyphs have reduced stroke width too, and the position change very often causes line spacing to be uneven. The typographically correct implementation of such formatting or markup would use subscript or superscript glyphs from the font, designed by the font creator to match the style of the font. This is more difficult than the simplistic approach, and of course it is possible only when using a font that contains such glyphs. Using HTML, for example, the way to achieve that at present would be to use markup like ... (to avoid the problems caused by the default formatting of and ) and to use a CSS style sheet that sets font-family suitably and uses OpenType font feature settings to select subscript or superscript glyphs. In practice, you would need to use @font-face to embed a suitable OpenType font. So it?s doable, but not trivial like just slapping and around some text. A practical conclusion is that if you need only e.g. 2 and 3 as superscripts (a rather general situation in general texts, where you just need m? or m?), it is much simpler to use the relevant Unicode superscript characters (instead of e.g. m2). This means using typographer-designer superscript glyphs in a simple and reliable way. Yucca From leoboiko at gmail.com Fri Sep 30 11:11:19 2016 From: leoboiko at gmail.com (Leonardo Boiko) Date: Fri, 30 Sep 2016 13:11:19 -0300 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> Message-ID: The Unicode codepoints are not intended as a place to store typographically variant glyphs (much like the Unicode "italic" characters aren't designed as a way of encoding italic faces). The correct thing here is that the markup and the font-rendering systems *should* automatically work together to choose the proper face?as they already do with italics or optical sizes, and as they should do with true small-caps etc. I agree that our current systems are typographically atrocious and an abomination before the God of good taste, and I don't blame anyone for resorting to Unicode tricks to work around that. But that's a crummy stopgap at best, and legitimizing it would be counterproductive in the long run?not to mention ethnocentric (unless you want Unicode sub- and superscript codepoints for every single existing character ever, including the full Han set). Rather, let's bug the authors of font rendering systems, user interface libraries, text editors, web browsers etc. for halfway decent typography. 2016/09/30 12:56 "Jukka K. Korpela" : > 30.9.2016, 18:19, Philippe Verdy wrote: > > Note also that many tools generating documentation from source code >> allow you to insert HTML comments, so you could as well use , >> > > Yes, but there?s a serious typographic pitfall with this, as well as with > using e.g. subscript or superscript formatting in a word processor. The > problem is that the rendering is almost always simplistic: letters (or > other characters) of the current font are used in reduced size and in > lowered or raised position. The result is that the glyphs have reduced > stroke width too, and the position change very often causes line spacing to > be uneven. > > The typographically correct implementation of such formatting or markup > would use subscript or superscript glyphs from the font, designed by the > font creator to match the style of the font. This is more difficult than > the simplistic approach, and of course it is possible only when using a > font that contains such glyphs. > > Using HTML, for example, the way to achieve that at present would be to > use markup like ... (to avoid the problems caused > by the default formatting of and ) and to use a CSS style sheet > that sets font-family suitably and uses OpenType font feature settings to > select subscript or superscript glyphs. In practice, you would need to use > @font-face to embed a suitable OpenType font. So it?s doable, but not > trivial like just slapping and around some text. > > A practical conclusion is that if you need only e.g. 2 and 3 as > superscripts (a rather general situation in general texts, where you just > need m? or m?), it is much simpler to use the relevant Unicode superscript > characters (instead of e.g. m2). This means using > typographer-designer superscript glyphs in a simple and reliable way. > > Yucca > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Fri Sep 30 11:31:58 2016 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 30 Sep 2016 19:31:58 +0300 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> Message-ID: <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi> 30.9.2016, 19:11, Leonardo Boiko wrote: > The Unicode codepoints are not intended as a place to store > typographically variant glyphs (much like the Unicode "italic" > characters aren't designed as a way of encoding italic faces). There is no disagreement on this. What I was pointing at was that when using rich text or markup, it is complicated or impossible to have typographically correct glyphs used (even when they exist), whereas the use of Unicode codepoints for subscript or superscript characters may do that in a much simpler way. > The > correct thing here is that the markup and the font-rendering systems > *should* automatically work together to choose the proper face?as they > already do with italics or optical sizes, and as they should do with > true small-caps etc. While waiting for this, we may need for interim solutions (for a few decades, for example). By the way, font-rendering systems don?t even do italics the right way in all cases. They may silently use ?fake italics? (algorithmically slanted letters). (I?m not suggesting the use of Unicode codepoints to deal with this.) > I agree that our current systems are typographically atrocious and an > abomination before the God of good taste, and I don't blame anyone for > resorting to Unicode tricks to work around that. I don?t think it?s a trick to use characters like SUPERSCRIPT TWO and SUPERSCRIPT THREE. The practical problem is that at the point where you need other superscripts that cannot be (reliably) produced using similar codepoints, you will need to consider replacing SUPERSCRIPT TWO and SUPERSCRIPT THREE by DIGIT TWO and DIGIT THREE with suitable markup or formatting, to avoid stylistic mismatch. This isn?t as serious as it sounds. When that day comes, you can probably do a suitable global replace operation on your texts. Yucca From verdy_p at wanadoo.fr Fri Sep 30 11:36:22 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 30 Sep 2016 18:36:22 +0200 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> Message-ID: 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela : > Using HTML, for example, the way to achieve that at present would be to > use markup like ... (to avoid the problems caused > by the default formatting of and ) and to use a CSS style sheet > that sets font-family suitably and uses OpenType font feature settings to > select subscript or superscript glyphs. In practice, you would need to use > @font-face to embed a suitable OpenType font. So it?s doable, but not > trivial like just slapping and around some text. > Not needed. the and elements in HTML can be styled directly as well (also with CSS), with clear implied semantic, without needing the creation of a custom class in a non-semantic element. Here the intent in the formula was clearly to designate a subscript notation (as opposed to a superscript whose meaning in formulas after the symbol of a variable is generally an exponent. Using superscripts after other symbols (such as a summation operation) generally designate something else (an upper bound). After some operators such as "C" it means a cardinal in a set from which all possible unordered combinations (distinct subsets) are counted. In cimicla formulas, superscripts and subscripts are used before or after an element to indicate some physical state (total charge, charge of the nucleus, total weight, 3D configuration for compound elements and cristalline forms, orientation, number of occurences for subgroups in complex compounds...). In formulas the supercripts and subscripts, are parsed according to the context after which they occur (which will remap these superscript or superscripts by assigning them a speficic role), but alone they are just sub/super-scripts with no other semantics added (but still keeping all the semantics of their content). For complex compounds, these subscript/superscripts are not enough and specific layouts and symbols are needed, but you cannot use simple linear plain-text to represent them without defining a specific notation convention and defining annotation terms inserted in the custom formula. Plain-text encoding will not solve the problem of representation at a character level: you'll need an upper protocol. There's an infinite way to define these protocols but they are out of scope of Unicode, which will not encode them (the same way that it does not encode orthographic conventions or script conventions for specific languages: the conventions for technical notations are creating their own language). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Sep 30 11:53:43 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 30 Sep 2016 18:53:43 +0200 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi> Message-ID: 2016-09-30 18:31 GMT+02:00 Jukka K. Korpela : > 30.9.2016, 19:11, Leonardo Boiko wrote: > > The Unicode codepoints are not intended as a place to store >> typographically variant glyphs (much like the Unicode "italic" >> characters aren't designed as a way of encoding italic faces). >> > > There is no disagreement on this. What I was pointing at was that when > using rich text or markup, it is complicated or impossible to have > typographically correct glyphs used (even when they exist), whereas the use > of Unicode codepoints for subscript or superscript characters may do that > in a much simpler way. If things are simple with the few existing characters encoded in Unicode, they should also be simple with common markup or notation systems. If not, blame the authors of these systems for not implementing them correctly. HTML, TeX or MathML have no problem representing these simple superscript/subscript notations. Use them ! including when commenting source code (you'll need these systems anyway when parsing the source code to generate readable documentation for your projects). Such doc generating tools are now extremely common and used in lot of common programming languages. It's high time to invest in them (most of them are integrated within code quality analysis tools, and project management tools, they generate progress reports, help tuning the APIs, help generating or checking test code coverage, help tracking bugs, coordinating work teams, communicating with final users or recipients of the software). Programmers should all know and use some of these tools (which can also work across multiple programming languages, as modern projects are frequently using multiple ones, needed for the integration, deployment or interoperability of systems). Unicode will certainly not favor a specific system, except for specific standards widely used internationaly (e.g. the few additions requested for TeX needed for technical reasons, such as specific distinctions of symbols). -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Fri Sep 30 11:54:39 2016 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 30 Sep 2016 19:54:39 +0300 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> Message-ID: <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi> 30.9.2016, 19:36, Philippe Verdy wrote: > 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela >: > > Using HTML, for example, the way to achieve that at present would be > to use markup like ... (to avoid the > problems caused by the default formatting of and ) and to > use a CSS style sheet that sets font-family suitably and uses > OpenType font feature settings to select subscript or superscript > glyphs. In practice, you would need to use @font-face to embed a > suitable OpenType font. So it?s doable, but not trivial like just > slapping and around some text. > > > Not needed. the and elements in HTML can be styled directly > as well (also with CSS) I didn?t want to go into details, but probably I now need to mention that some browsers, rather unpleasantly, interpret relative font sizes for and as relating to their default font size in that browser, against CSS specs. This is frustrating enough to ignore the ?semantics? and use instead. The semantics was never clear, actually; the descriptions and examples contain both essential superscripting (e.g. mathematical exponents) and stylistic superscripting (e.g. rendering ?1st? with the letters as superscripts). > For complex compounds, these subscript/superscripts are not enough and > specific layouts and symbols are needed Certainly. Thinking of a mathematical expression with a superscript that has a superscript should be enough to demonstrate this. My point, however, has been that there are many situations, in general texts and even in some specialized texts, where Unicode code points for superscripts and subscripts are very useful. It is therefore natural to ask why they are such incomplete sets; but I think this question has been answered in this discussion. Yucca From verdy_p at wanadoo.fr Fri Sep 30 12:13:24 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 30 Sep 2016 19:13:24 +0200 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi> <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi> Message-ID: 2016-09-30 18:54 GMT+02:00 Jukka K. Korpela : > 30.9.2016, 19:36, Philippe Verdy wrote: > > 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela > >: >> >> Using HTML, for example, the way to achieve that at present would be >> to use markup like ... (to avoid the >> problems caused by the default formatting of and ) and to >> use a CSS style sheet that sets font-family suitably and uses >> OpenType font feature settings to select subscript or superscript >> glyphs. In practice, you would need to use @font-face to embed a >> suitable OpenType font. So it?s doable, but not trivial like just >> slapping and around some text. >> >> >> Not needed. the and elements in HTML can be styled directly >> as well (also with CSS) >> > > I didn?t want to go into details, but probably I now need to mention that > some browsers, rather unpleasantly, interpret relative font sizes for > and as relating to their default font size in that browser, against > CSS specs. Bug the authors of these browsers. But most probably those browsers are antique and no longer supported. So bug users of these old tools and ask them to switch. I've not seen any decent modern browser not correctly respecting the CSS styles you set for the relative size and positioning for superscripts/supbscripts. It is very easy to do with basic CSS stylesheet for your document (or website/application). There's not a lot of modern browsers. The antique browsers no longer supported are those you find in embedded systems (in their limited firmware), but they are not the best systems to use to read a technical documentation, and generally not used by programmers; they all have a decent PC with a decent browser, or TeX tools, or decent word processors. These bugs will be solved sooner or later IF there are people requesting their resolution and a demonstrated usage of these tools (or fonts, rendering libraries...), or if they pay developers for these needed corrections. Unicode encodes characters for the long term, but not because some current tools may have some rendering bugs (that are already resolved is similar tools). Most of these tools already have several alternatives some ill disappear new one will be created offering better support. The only thing that cannot be corrected are historic documents used as sources and for which there's a need to find an appropriate representation : if this can be done at character level, may be they will be encoded, provided that there's evidence thhat they require distinction. But in your case there's no distinction: the "start" and "end" words in the formulas are regular English words that should better encoded using normal Latin letters (the extra layout needed for the formula is not encodable); using some encoded superscript/subscripts to write them is really a hack, don't expect them to have a coherent layout or style matching what you expect in your formulas, they were mostly encoded for compatibility with old encoding standards (because old archived documents won't be reencoded/recreated for use with newer tools). -------------- next part -------------- An HTML attachment was scrubbed... URL: From KalvesmakiJ at doaks.org Fri Sep 30 13:00:08 2016 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Fri, 30 Sep 2016 18:00:08 +0000 Subject: Why incomplete subscript/superscript alphabet ? Message-ID: Newly proposed OpenType Variable Fonts may go a long way to rectifying those typographic pitfalls. The technology is some ways off, but is promising, as explained in a recent blog post by John Hudson: https://medium.com/@tiro/https-medium-com-tiro-introducing-opentype-variable-fonts-12ba6cd2369#.ucum1whtl jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435 On 9/30/16, 11:54 AM, "Unicode on behalf of Jukka K. Korpela" wrote: there?s a serious typographic pitfall with this From everson at evertype.com Fri Sep 30 13:26:17 2016 From: everson at evertype.com (Michael Everson) Date: Fri, 30 Sep 2016 11:26:17 -0700 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi> Message-ID: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com> On 30 Sep 2016, at 08:07, Jukka K. Korpela wrote: > Apart from specialized cases, the recommended approach is to use higher protocols (such as formatting or markup). So instead of trying to find superscript letters for ?end?, you should consider using rich text or a markup language so that the word written with normal letters ?end? is formatted or marked up as a superscript. Even I don?t because I want stuff to be preserved in plain text. Michael From steve at swales.us Fri Sep 30 13:36:32 2016 From: steve at swales.us (Steve Swales) Date: Fri, 30 Sep 2016 11:36:32 -0700 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi> <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com> Message-ID: I?m with Michael on this. The obvious use case is text messaging, which has no higher protocols to leverage. -steve > On Sep 30, 2016, at 11:26 AM, Michael Everson wrote: > > On 30 Sep 2016, at 08:07, Jukka K. Korpela wrote: > >> Apart from specialized cases, the recommended approach is to use higher protocols (such as formatting or markup). So instead of trying to find superscript letters for ?end?, you should consider using rich text or a markup language so that the word written with normal letters ?end? is formatted or marked up as a superscript. > > Even I don?t because I want stuff to be preserved in plain text. > > Michael From asmusf at ix.netcom.com Fri Sep 30 15:18:56 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Fri, 30 Sep 2016 13:18:56 -0700 Subject: Why incomplete subscript/superscript alphabet ? In-Reply-To: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com> References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info> <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi> <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com> Message-ID: <328312cd-094c-5f9b-62fd-7803e51173f8@ix.netcom.com> An HTML attachment was scrubbed... URL: