From boldewyn at gmail.com Thu Dec 3 10:20:41 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Thu, 3 Dec 2015 17:20:41 +0100 Subject: Sources for the B&W emoji samples in the PDFs Message-ID: Hello, I am wondering, if there is a list, which font / work is used to render which of the black & white emoji (and other symbols) in the code chart PDFs. Neither http://www.unicode.org/charts/fonts.html nor http://unicode.org/emoji/images.html nor the PDF itself have a sufficiently detailed answer. The question also defied quick answering from @eevee [1] ("Dark Corners of Unicode"), Jeremy Burge [2] (Emojipedia) and myself (codepoints.net). So I thought the best would be to carry it over as close as possible to authoritative sources... Cheers, Manuel [1] https://twitter.com/eevee/status/672393603948216320 [2] https://twitter.com/Emojipedia/status/672425863112204288 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Dec 6 09:47:34 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 6 Dec 2015 08:47:34 -0700 Subject: Sources for the B&W emoji samples in the PDFs Message-ID: <24EAB68332E54FF0B59E32F1AE3B8232@DougEwell> Manuel Strehl wrote: > I am wondering, if there is a list, which font / work is used to > render which of the black & white emoji (and other symbols) in the > code chart PDFs. Neither > > http://www.unicode.org/charts/fonts.html > > nor > > http://unicode.org/emoji/images.html > > nor the PDF itself have a sufficiently detailed answer. In the absence of any other reply after three days... Those references are likely the best you're going to get. Many original images came from Apple and Google, and some from Japanese telcos, but it's important to reiterate -- possibly to the Twitter group as well -- that there are no "standard" or "official" (by which I mean normative) glyphs for emoji or any other characters. Any rendering that maintains the basic identity of the character is fine. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From simon at simon-cozens.org Sat Dec 5 18:08:59 2015 From: simon at simon-cozens.org (Simon Cozens) Date: Sun, 6 Dec 2015 09:08:59 +0900 Subject: Line breaking status of emoji modifiers Message-ID: <56637C9B.8080401@simon-cozens.org> My renderer just got hit with an interesting, if possibly obscure, bug. UTR#51 says "A supported emoji modifier sequence should be treated as a single grapheme cluster for editing purposes (cursor moment, deletion, etc.); word break, line break, etc." However, the modifier codepoints have line break category AL. So you have an emoji (line break ID) and its modifier (line break AL), and ICU (quite correctly) inserts a line break opportunity between the two. This split the cluster, and then everything went downhill after that. If you don't expect a line break here, shouldn't they be better as CM for line breaking purposes rather than AL? From mark at macchiato.com Sun Dec 6 11:25:19 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 6 Dec 2015 18:25:19 +0100 Subject: Line breaking status of emoji modifiers In-Reply-To: <56637C9B.8080401@simon-cozens.org> References: <56637C9B.8080401@simon-cozens.org> Message-ID: Yes. This was discussed at the last UTC, and for line break (and other segmentation, eg #29), there is an action to proposal appropriate rules for 9.0. There are three types of emoji sequences that need to be handled: - flag sequences - modifier sequences - zwj sequences In the meantime, people are customizing their implementations to deal with the emoji sequences. For now, it may be simpler for some to just use the complete list of current sequences as exceptions, and disallow breaking within them. Mark On Sun, Dec 6, 2015 at 1:08 AM, Simon Cozens wrote: > My renderer just got hit with an interesting, if possibly obscure, bug. > > UTR#51 says "A supported emoji modifier sequence should be treated as a > single grapheme cluster for editing purposes (cursor moment, deletion, > etc.); word break, line break, etc." However, the modifier codepoints > have line break category AL. > > So you have an emoji (line break ID) and its modifier (line break AL), > and ICU (quite correctly) inserts a line break opportunity between the > two. This split the cluster, and then everything went downhill after that. > > If you don't expect a line break here, shouldn't they be better as CM > for line breaking purposes rather than AL? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Sun Dec 6 13:35:54 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Sun, 6 Dec 2015 20:35:54 +0100 Subject: Sources for the B&W emoji samples in the PDFs In-Reply-To: <566472DC.8090805@unicode.org> References: <24EAB68332E54FF0B59E32F1AE3B8232@DougEwell> <566472DC.8090805@unicode.org> Message-ID: <56648E1A.3020304@gmail.com> Thank you, Doug and Rick! > If you can find the most recent version of the Symbola font updated > for Unicode 8.0, it contains a huge number of symbols and b/w > emoticons, etc. Yes, that's kind of the Go-to-font for pan-emoji support. It's a pity, that George won't continue development (although I am happy, that he provides the current version at his website again). As I understood, his ambitions are towards technically more advanced typefaces, and I find his new Textfonts project quite interesting. (Also I love his Unidings font, a very nice and well-concepted alternative to the Last Resort font of Michael Everson. Unfortunately also discontinued beyond Unicode 8.0.) But I digress. So, basically we found, that Symbola and the pictures in the Standard are more often than not different, which was the incentive to find out, where the ones in the standard might stem from. >> Those references are likely the best you're going to get. Many >> original images came from Apple and Google, and some from Japanese >> telcos, but it's important to reiterate -- possibly to the Twitter >> group as well -- that there are no "standard" or "official" (by >> which I mean normative) glyphs for emoji or any other characters. Any >> rendering that maintains the basic identity of the character is fine. We are all three quite sure of the concept of emoji :-) But while many other blocks have a known set of fonts (like Phoreus for Cherokee, to quote a recent example), I figured, that there'd be a list of known resources, where the samples from the standard are taken from. (In this regard: I monitor Twitter quite closely for the search term "Unicode", and time and again people complain there, that emoji render different on different devices, which seems to be indeed rather surprising for non-technical people.) So, thanks again for the answers! Cheers, Manuel From plug.gulp at gmail.com Tue Dec 8 21:24:39 2015 From: plug.gulp at gmail.com (Plug Gulp) Date: Wed, 9 Dec 2015 03:24:39 +0000 Subject: Devanagari and Subscript and Superscript Message-ID: Hi, I am trying to understand if there is a way to use Devanagari characters (and grapheme clusters) as subscript and/or superscript in unicode text. It will help if someone could please direct me to any document that explains how to achieve that. Is there a unicode marker that will treat the next grapheme cluster in the unicode text as super/subscript? For e.g. if one wants to represent "? raise to ???" how does one achieve that; is there a marker to represent it as follows: ? + SUP + ? + ? + ? where SUP acts as a marker for superscripting the next grapheme cluster. Similar for subscripting. Sorry if this is not the right place to ask this question; in that case please could you direct me to the right forum? Thanks and kind regards ~Plug From duerst at it.aoyama.ac.jp Tue Dec 8 23:18:45 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 9 Dec 2015 14:18:45 +0900 Subject: Devanagari and Subscript and Superscript In-Reply-To: References: Message-ID: <5667B9B5.3010208@it.aoyama.ac.jp> Hello Plug, I suggest using HTML: ?? ?? Regards, Martin. On 2015/12/09 12:24, Plug Gulp wrote: > Hi, > > I am trying to understand if there is a way to use Devanagari > characters (and grapheme clusters) as subscript and/or superscript in > unicode text. It will help if someone could please direct me to any > document that explains how to achieve that. Is there a unicode marker > that will treat the next grapheme cluster in the unicode text as > super/subscript? For e.g. if one wants to represent "? raise to ???" > how does one achieve that; is there a marker to represent it as > follows: ? + SUP + ? + ? + ? > where SUP acts as a marker for superscripting the next grapheme > cluster. Similar for subscripting. > > Sorry if this is not the right place to ask this question; in that > case please could you direct me to the right forum? > > Thanks and kind regards > > ~Plug > > . > From richard.wordingham at ntlworld.com Wed Dec 9 01:41:17 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 9 Dec 2015 07:41:17 +0000 Subject: Devanagari and Subscript and Superscript In-Reply-To: References: Message-ID: <20151209074117.4d5e7c54@JRWUBU2> On Wed, 9 Dec 2015 03:24:39 +0000 Plug Gulp wrote: > Hi, > > I am trying to understand if there is a way to use Devanagari > characters (and grapheme clusters) as subscript and/or superscript in > unicode text. The view is that such would not be 'plain text', and therefore need not be catered for in Unicode. On the other hand, the desire for spacing raised and lowered characters is sufficient that markup to produce them is widely available, as Martin D?rst pointed out. Non-spacing stacked characters are not common enough for general support to be available. In many Indic scripts, stacking is the normal arrangement, and is supplied via a script-specific special character that is overloaded with a vowel cancellation symbol. However, font-specific deviations from vertical stacking are arranged, and vowels marks are treated independently. There is no provision for vertical stacks to have horiziontal offshoots. (Scripts written vertically are a different case.) For characters stacked directly above and below not in the normal modern fashion of writing words, there can be special characters for special cases. For example, there are U+A8EE COMBINING DEVANAGARI LETTER PA in the Devanagari Extended block and U+0364 COMBINING LATIN SMALL LETTER E. Other, clumsier scheme-specific techniques are available other cases. See for example the writing of nuclides with an explicit atomic number in https://en.wikipedia.org/wiki/Nuclide. The notation needs a mass number at top left and an atomic number at bottom right. A fairly general case is the annotation of kanji known as 'ruby'. Sometimes an application or mark-up scheme will support this directly. Richard. From brille1 at hotmail.com Wed Dec 9 06:25:34 2015 From: brille1 at hotmail.com (Hans Meiser) Date: Wed, 9 Dec 2015 12:25:34 +0000 Subject: =?iso-8859-1?Q?Proposal_for_German_capital_letter_"=DF"?= Message-ID: Currently there is a vast problem trying to determine the lower case equivalent of a capitalized German word like "MASSE". This is due to the fact that an orthographic rule exists to convert lower case letter "?" to upper case letters "SS". So after converting a word from lower case to upper case one cannot unequivocally determine the original lower case word because the conversion is only surjective. This issue exists because the letter "?" originally was but a ligature of the small letter "sz" (using a legacy German font) which over time became a ligature of "ss". After the German spelling reform in 1996, "?" then became a letter of its own, and words containing the letter "?" are no longer equivalent to words containing an "ss" combination instead of the "?". So, for instance, "Ma?e" and "Masse" are not equal. In fact, "Ma?e" translates to "measurements" while "Masse" translates to "weight". This is a particular problem in electronic data processing - like, for instance, SQL data queries. Given above rule, "Ma?e" will become "MASSE", just like "Masse" becomes "MASSE" when converting a word to uppercase. But there is no way back to distinguish one from the other. I read that the UNICODE group is already striving for a solution to this problem and that they are searching for a capital letter equivalent of "?". My proposal is to introduce a capital letter equivalent of "?" that's resembling two capital "S" letters: "SS". So the capital letter equivalent of "?" would look like "SS" but was in fact a separate code point. Converting words from lower case to upper case and back will then become bijective, auto correction will become easier and the (false) ANSI SQL stopgap of declaring "?" and "ss" to be equal can be dropped. Your feedback is appreciated. Axel Dahmen - Germany -------------- next part -------------- An HTML attachment was scrubbed... URL: From albrecht.dreiheller at siemens.com Wed Dec 9 09:59:57 2015 From: albrecht.dreiheller at siemens.com (Dreiheller, Albrecht) Date: Wed, 9 Dec 2015 15:59:57 +0000 Subject: =?iso-8859-1?Q?AW:_Proposal_for_German_capital_letter_"=DF"?= In-Reply-To: References: Message-ID: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Just have a look at U+1E9E LATIN CAPITAL LETTER SHARP S in the block Latin Extended Additional http://www.unicode.org/charts/PDF/U1E00.pdf Kind regards Von: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Hans Meiser Gesendet: Mittwoch, 9. Dezember 2015 13:26 An: unicode at unicode.org Betreff: Proposal for German capital letter "?" Currently there is a vast problem trying to determine the lower case equivalent of a capitalized German word like "MASSE". This is due to the fact that an orthographic rule exists to convert lower case letter "?" to upper case letters "SS". So after converting a word from lower case to upper case one cannot unequivocally determine the original lower case word because the conversion is only surjective. This issue exists because the letter "?" originally was but a ligature of the small letter "sz" (using a legacy German font) which over time became a ligature of "ss". After the German spelling reform in 1996, "?" then became a letter of its own, and words containing the letter "?" are no longer equivalent to words containing an "ss" combination instead of the "?". So, for instance, "Ma?e" and "Masse" are not equal. In fact, "Ma?e" translates to "measurements" while "Masse" translates to "weight". This is a particular problem in electronic data processing - like, for instance, SQL data queries. Given above rule, "Ma?e" will become "MASSE", just like "Masse" becomes "MASSE" when converting a word to uppercase. But there is no way back to distinguish one from the other. I read that the UNICODE group is already striving for a solution to this problem and that they are searching for a capital letter equivalent of "?". My proposal is to introduce a capital letter equivalent of "?" that's resembling two capital "S" letters: "SS". So the capital letter equivalent of "?" would look like "SS" but was in fact a separate code point. Converting words from lower case to upper case and back will then become bijective, auto correction will become easier and the (false) ANSI SQL stopgap of declaring "?" and "ss" to be equal can be dropped. Your feedback is appreciated. Axel Dahmen - Germany -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.tranter at sheffield.ac.uk Wed Dec 9 09:55:59 2015 From: n.tranter at sheffield.ac.uk (Nicolas Tranter) Date: Wed, 9 Dec 2015 15:55:59 +0000 Subject: Hentaigana proposal Message-ID: I comment as a western Japanologist who teaches and researches using hentaigana. I have published with hentaigana using image files (resulting in two publisher errors) and will publish next year with hentaigana using the Koin Hentaigana font (Koin????????.tte), and anticipate typesetting problems. I refer to the 2015 proposal L2/15-239 to include hentaigana, including the appended paper by Takada Tomokazu, Yada Tsutomu and Saito Tatsuya ('The past, present and future of Hentaigana Standardization for Information Interchange'). I also refer to Yada Tsutomu's support of the proposal ('About the inclusion of standardized codepoints for Hentaigana', L2/15-318). As the names and numbering of proposed characters is an issue I deal with below, I also refer to individual hentaigana in the proposal by their MJ-codes as used in the proposers' own websites (e.g. http://mojikiban.ipa.go.jp/xb164/). SELECTION: The selection is good, consisting of 286 forms, although this would be realised as 299 characters. The earlier 2009 proposal referred to was based on the Mojikyo M113.ttf font, which has 213 hentaigana characters and includes a few major basic gaps. The Koin Hentaigana font has 549 characters, which excluding separate forms with voicing and 'half-voicing' diacritics consists of 330 hentaigana, but includes some very rare forms, including ones that do not occur in late period texts. The selection of 'academic' hentaigana is appropriate and lacks major gaps. On the other hand, the Ministry of Justice hentaigana requirements are ones that have been decided by the Ministry of Justice in 2004 for name registration purposes, and so, although one could argue easily with their 2004 decision (and I would), the fact that they are already official means it is pointless to argue with their inclusion in Unicode. It's been noted that a few hentaigana are almost identical to normal hiragana, especially *e* HENTAIGANA LETTER E VARIANT 4 = MJ090017 (cf. ?), *shi* HENTAIGANA LETTER SI VARIANT 2 = MJ090072 (cf. HIRAGANA LETTER SI ?) and *nu* HENTAIGANA LETTER NU VARIANT 2 = MJ090149 (cf. HIRAGANA LETTER NU ?): their differences are solely that the 'brush' is removed from the paper on a downward rather than a rightward flourish, reflecting vertical handwriting. Ordinarily I would argue against including them, but since the MoJ has recognised them as official variants they need to be included. The decision to propose in most cases one codepoint for the hentaigana derived from a single Chinese character is sensible, as also is the decision to allow multiple codepoints in certain cases where manuscripts use side-by-side significantly distinct forms derived from the same Chinese character and with the same value. An example of the latter is HENTAIGANA LETTER KA VARIANT 3 = MJ090025and KA VARIANT 4 = MJ090026, both pronounced *ka* and both derived from the Chinese character ?, but which are routinely both found in the same manuscript by the same hand as if they were separate graphemes from the Heian to the Meiji periods. POLYPHONY. Several hentaigana are truly polyphonous (e.g. the ?-derived hentaigana = *ne* MJ090151 or MJ090059 *ko*, or the ?-derived hentaigana = *me* MJ090222 or *ma* MJ090205). In particular, those hentaigana derived from ? and associated with *n* (MJ090298, MJ090299) historically (also the source of HIRAGANA LETTER N ?) are also used for *mu* (MJ090214, MJ090215) and *mo* (MJ090224, MJ090223). Diachronically, *n* in native Japanese words is usually derived from an earlier *mu*. Takada et al. includes a list of 10 kanji sources that this applies to in the proposed repertoire. (Strictly, this affects 11 hentaigana, because the proposal has two forms for ?-derived characters.) The proposal's solution is to assign different identifiers, e.g. ? = HENTAIGANA LETTER NE VARIANT 1 and HENTAIGANA LETTER KO VARIANT 2, ? = HENTAIGANA LETTER ME VARIANT 3 and HENTAIGANA LETTER MA VARIANT 7, and the two derived from ? = HENTAIGANA LETTER N VARIANT 1, N VARIANT 2, MU VARIANT 1, MU VARIANT 2, MO VARIANT 1 and MO VARIANT 2. This means that there would be characters that are given more than one codepoint and identifier but are formally and etymologically identical, adding 13 unnecessary repetitions to the character set. I would favour Yada's naming system, where the polyphonous characters are given a single codepoint and identifier, e.g. ? = HENTAIGANA LETTER NE-KO, ? = HENTAIGANA ME-MA, and two ?-derived forms = HENTAIGANA LETTER N-MU-MO 1 and N-MU-MO 2. STANDARD VARIATION: The suggestion that hentaigana be standard variation characters means that in the absence of appropriate font support they would be rendered as hiragana with the same value. (This appears to underlie the decision to propose different codepoints and names for the polyphonous hentaigana.) I do not support this. The two main uses of hentaigana are academic and by the MoJ. Academics will only use hentaigana if they specifically need them to be rendered as such rather than as hiragana, and because hentaigana as proposed for inclusion in Unicode and hiragana that are already encoded together constitute the same pre-1900 script proofreading a text to spot incorrect renderings would be very difficult. It would be easier for academics if lack of font support rendered hentaigana simply as blanks. Similarly, MoJ name registration normally involves recording the name both in registered spelling and in hiragana transcription, so having hentaigana show up as blanks would not cause a problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Wed Dec 9 11:16:35 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 9 Dec 2015 18:16:35 +0100 Subject: =?UTF-8?Q?Re:_AW:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: <566861F3.3010506@gmail.com> For more information on the capital sharp s (?) (converting Ma?e to MA?E), you can also look at Wikipedia https://en.wikipedia.org/wiki/Capital_%E1%BA%9E (more details in the german version https://en.wikipedia.org/wiki/Capital_%E1%BA%9E ) and Andreas St?tzner 2004 proposal to Unicode http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf Your proposal to have a character which look exactly like SS is problematic on many grounds, and could only have been introduced in Unicode as legacy character if it existed in character sets before the 1990s. Introducing it know would cause much more problem than it solves (e.g. allowing spoofing, making the encoding ambiguous, violating stability of the casing rules, etc.). If you want to have reversible casing distinguishing ss?SS and ??SS using ?, you can (in your software) bend the Unicode standard in one of the following ways: * make font where ? looks like SS (I?m not sure it is Unicode conformant) * use your own casing rule and add a ZWNJ (zero width non joiner character) such that ss?SS and ??S+ZWNJ + S. Both capital version should look the same. But doing so, you violate Unicode casing, and you may have problem when ZWNJ is also used in German typography to prevent wrong ligatures (see https://en.wikipedia.org/wiki/Zero-width_non-joiner)). Fred Le 09/12/2015 16:59, Dreiheller, Albrecht a ?crit : > > Just have a look at > > U+1E9ELATIN CAPITAL LETTER SHARP S > > in the block Latin Extended Additional > > http://www.unicode.org/charts/PDF/U1E00.pdf > > Kind regards > > *Von:*Unicode [mailto:unicode-bounces at unicode.org] *Im Auftrag von > *Hans Meiser > *Gesendet:* Mittwoch, 9. Dezember 2015 13:26 > *An:* unicode at unicode.org > *Betreff:* Proposal for German capital letter "?" > > Currently there is a vast problem trying to determine the lower case > equivalent of a capitalized German word like "MASSE". > > This is due to the fact that an orthographic rule exists to convert > lower case letter "?" to upper case letters "SS". So after converting > a word from lower case to upper case one cannot unequivocally > determine the original lower case word because the conversion is only > surjective. > > This issue exists because the letter "?" originally was but a ligature > of the small letter "sz" (using a legacy German font) which over time > became a ligature of "ss". > > After the German spelling reform in 1996, "?" then became a letter of > its own, and words containing the letter "?" are no longer equivalent > to words containing an "ss" combination instead of the "?". So, for > instance, "Ma?e" and "Masse" are not equal. In fact, "Ma?e" translates > to "measurements" while "Masse" translates to "weight". > > This is a particular problem in electronic data processing - like, for > instance, SQL data queries. Given above rule, "Ma?e" will become > "MASSE", just like "Masse" becomes "MASSE" when converting a word to > uppercase. But there is no way back to distinguish one from the other. > > I read that the UNICODE group is already striving for a solution to > this problem and that they are searching for a capital letter > equivalent of "?". > > My proposal is to introduce a capital letter equivalent of "?" that's > resembling two capital "S" letters: "SS". > > So the capital letter equivalent of "?" would look like "SS" but was > in fact a separate code point. Converting words from lower case to > upper case and back will then become bijective, auto correction will > become easier and the (false) ANSI SQL stopgap of declaring "?" and > "ss" to be equal can be dropped. > > > Your feedback is appreciated. > > Axel Dahmen - Germany > From gansmann at uni-bonn.de Wed Dec 9 11:52:07 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Wed, 09 Dec 2015 18:52:07 +0100 Subject: =?utf-8?Q?Proposal_for_German_capital_lette?= =?utf-8?Q?r_=22=C3=9F=22?= In-Reply-To: References: Message-ID: > My proposal is to introduce a capital letter equivalent of "?" that's resembling two capital "S" letters: "SS". Actually, the capital ? is already included in Unicode (?) because it was and is used as a separate letter (not looking like SS), though only rarely. It is now realised as a proper distinguishable letter in many fonts, which is arguably the best solution. I have a keyboard with this letter. Moreover, the Germany authority on spelling (Rat f?r Rechtschreibung) stated that it will acknowledge an individual letter if it gets established in use. Further reading: ? http://www.versaleszett.de/ ? http://german.stackexchange.com/a/8960/2594 ? http://j.mp/versaleszett ? http://www.typografie.info/3/page/wiki.html/_/fachbegriffe/grosses-eszett > After the German spelling reform in 1996, "?" then became a letter of its own, and words containing the letter "?" are no longer equivalent to words containing an "ss" combination instead of the "?". So, for instance, "Ma?e" and "Masse" are not equal. In fact, "Ma?e" translates to "measurements" while "Masse" translates to "weight". Actually, you had the very same problem with ?Masse? and ?Ma?e? before the spelling reform. From asmus-inc at ix.netcom.com Wed Dec 9 13:21:25 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 9 Dec 2015 11:21:25 -0800 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: References: Message-ID: <56687F35.5050103@ix.netcom.com> An HTML attachment was scrubbed... URL: From khaledhosny at eglug.org Wed Dec 9 13:42:16 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Wed, 9 Dec 2015 23:42:16 +0400 Subject: AW: Proposal for German =?utf-8?Q?capi?= =?utf-8?B?dGFsIGxldHRlciAiw58i?= In-Reply-To: <566861F3.3010506@gmail.com> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <566861F3.3010506@gmail.com> Message-ID: <20151209194216.GC12224@khaled-laptop> On Wed, Dec 09, 2015 at 06:16:35PM +0100, Fr?d?ric Grosshans wrote: > * use your own casing rule and add a ZWNJ (zero width non joiner character) > such that ss?SS and ??S+ZWNJ + S. Wouldn?t ZWJ be a more logical choice given that he wants to ?join? both S?s into a single character. Regards, Khaled From brille1 at hotmail.com Wed Dec 9 13:55:24 2015 From: brille1 at hotmail.com (Hans Meiser) Date: Wed, 9 Dec 2015 19:55:24 +0000 Subject: =?iso-8859-1?Q?Re:_Proposal_for_German_capital_letter_"=DF"?= In-Reply-To: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> References: , <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: I see. Yet, the u+1E9E doesn't quite look like two capital "S". So any program implementing a conversion conforming to Unicode will currently display/print in a wrong result: "MA?E" instead of the correctly converted result "MASSE". Both would be correctly encoded as u+004D u+0041 u+1E9E u+0045. Yet, AFAIK, the current glyph would currently be considered an error. Proposal: Shouldn't the glyph be amended to match the natural language? Cheers, Axel ________________________________ From: Dreiheller, Albrecht Sent: Wednesday, December 9, 2015 4:59 PM To: Hans Meiser; unicode at unicode.org Subject: AW: Proposal for German capital letter "?" Just have a look at U+1E9E LATIN CAPITAL LETTER SHARP S in the block Latin Extended Additional http://www.unicode.org/charts/PDF/U1E00.pdf Latin Extended Additional Latin Extended Additional Range: 1E00 1EFF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 8.0 Read more... Kind regards Von: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Hans Meiser Gesendet: Mittwoch, 9. Dezember 2015 13:26 An: unicode at unicode.org Betreff: Proposal for German capital letter "?" Currently there is a vast problem trying to determine the lower case equivalent of a capitalized German word like "MASSE". This is due to the fact that an orthographic rule exists to convert lower case letter "?" to upper case letters "SS". So after converting a word from lower case to upper case one cannot unequivocally determine the original lower case word because the conversion is only surjective. This issue exists because the letter "?" originally was but a ligature of the small letter "sz" (using a legacy German font) which over time became a ligature of "ss". After the German spelling reform in 1996, "?" then became a letter of its own, and words containing the letter "?" are no longer equivalent to words containing an "ss" combination instead of the "?". So, for instance, "Ma?e" and "Masse" are not equal. In fact, "Ma?e" translates to "measurements" while "Masse" translates to "weight". This is a particular problem in electronic data processing - like, for instance, SQL data queries. Given above rule, "Ma?e" will become "MASSE", just like "Masse" becomes "MASSE" when converting a word to uppercase. But there is no way back to distinguish one from the other. I read that the UNICODE group is already striving for a solution to this problem and that they are searching for a capital letter equivalent of "?". My proposal is to introduce a capital letter equivalent of "?" that's resembling two capital "S" letters: "SS". So the capital letter equivalent of "?" would look like "SS" but was in fact a separate code point. Converting words from lower case to upper case and back will then become bijective, auto correction will become easier and the (false) ANSI SQL stopgap of declaring "?" and "ss" to be equal can be dropped. Your feedback is appreciated. Axel Dahmen - Germany -------------- next part -------------- An HTML attachment was scrubbed... URL: From gansmann at uni-bonn.de Wed Dec 9 14:57:59 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Wed, 09 Dec 2015 21:57:59 +0100 Subject: =?utf-8?Q?Proposal_for_German_capital_lette?= =?utf-8?Q?r_=22=C3=9F=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: On Wed, 09 Dec 2015 20:55:24 +0100, Hans Meiser wrote: > Yet, AFAIK, the current glyph would currently be considered an error. See it like this: The point of spelling rules is to easy reading. However, the use of SS for capital ? is rather obstrusive, as it is not exactly frequent in everyday texts and if it is used, even professional designers and typesetters do it more often wrong than correct and produce something like FU?BALL. On the other hand, a well-designed capital ? is not even noticed by many readers. Finally, as I already said, the institution that decides about right and wrong in German orthography implicitly encourages you to use the capital ? if you prefer it. > Proposal: Shouldn't the glyph be amended to match the natural language? Nothing of this is really natural. If you go by what most people do, you would have to write FU?BALL. Also, I hypothesise that languages which passed a certain level of alphabetisation do not exhibit natural spelling changes beyond the single-word level anymore, as spelling dogmatists get too dominant ? just look at the English orthography. After this point, you can only have centralised changes like the spelling reforms. From everson at evertype.com Wed Dec 9 15:11:18 2015 From: everson at evertype.com (Michael Everson) Date: Wed, 9 Dec 2015 21:11:18 +0000 Subject: =?utf-8?Q?Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: On 9 Dec 2015, at 20:57, Gerrit Ansmann wrote: >> Proposal: Shouldn't the glyph be amended to match the natural language? > > Nothing of this is really natural. If you go by what most people do, you would have to write FU?BALL. In my new edition of the first German translation of ?Alice?s Adventures in Wonderland?, the editor and I made sure that the cakes said ?I? MICH!? and not ?I? MICH!?. :-) Michael Everson * http://www.evertype.com/ From richard.wordingham at ntlworld.com Wed Dec 9 15:45:19 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 9 Dec 2015 21:45:19 +0000 Subject: Proposal for German capital letter =?ISO-8859-1?B?It8i?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: <20151209214519.5a125dd0@JRWUBU2> On Wed, 9 Dec 2015 19:55:24 +0000 Hans Meiser wrote: > I see. > > Yet, the u+1E9E doesn't quite look like two capital "S". So any > program implementing a conversion conforming to Unicode will > currently display/print in a wrong result: "MA?E" instead of the > correctly converted result "MASSE". While the default simple uppercasing of "ma?e" will yield "MA?E", the default full uppercasing will yield "MASSE". I am not aware of a useful definition of 'conforming to Unicode' that applies to either transformation. > Both would be correctly encoded > as u+004D u+0041 u+1E9E u+0045. Yet, AFAIK, the current glyph would > currently be considered an error. > > Proposal: Shouldn't the glyph be amended to match the natural > language? No, the glyph corresponds to *a* natural form of German, as opposed to Standard German - which some would argue was not a natural language! Now, it may be argued that U+00DF has the same glyph as U+1E9E when next to a capital letter, but that is a font decision, not a Unicode decision. One could therefore define an uppercasing transformation that was a conformant Unicode process, and agreed with default uppercasing on NFD strings except for U+00DF, but differed by mapping U+00DF to U+1E9E. One might not notice any error in the printed output of this process, any more than one would notice U+006F LATIN SMALL LETTER O being transformed to U+041E CYRILLIC CAPITAL LETTER O. Richard. From verdy_p at wanadoo.fr Wed Dec 9 16:21:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 9 Dec 2015 23:21:12 +0100 Subject: =?UTF-8?Q?Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: <20151209214519.5a125dd0@JRWUBU2> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <20151209214519.5a125dd0@JRWUBU2> Message-ID: 2015-12-09 22:45 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Wed, 9 Dec 2015 19:55:24 +0000 > Hans Meiser wrote: > > > I see. > > > > Yet, the u+1E9E doesn't quite look like two capital "S". So any > > program implementing a conversion conforming to Unicode will > > currently display/print in a wrong result: "MA?E" instead of the > > correctly converted result "MASSE". > > While the default simple uppercasing of "ma?e" will yield "MA?E", the > default full uppercasing will yield "MASSE". > Full uppercasing rules are normally locale-sensitive, and thus there should exist a specific rule for German not yielding this result (see for example the rules for Turkish dotless i vs dotted i). I don't think these locale-sensitive rules are irrevocably stable as more locales can be added at any time for some languages needing specific pairs. The stabilized properties are for locale-neutral mappings only, in generic contexts where the language is not known (including for standard normalizations, or for the locale-neutral "root" collations and the associated DUCET). Even for the same language, these rules cannot be hardcoded in a stable way, orthographies are evoluting over time, unless you use a locale identifying the orthographic rule precisely (and the associated rulesets are checked and corrected to reach a stable consensus: if there's an evolution or variants, use another locale identifier) and that specific orthography is entirely known (this is difficult for historic orthographies or when there's no recognized language academy or national institution fixing the rule to use for some country or region, but even these institutions are working in their current working time and limiting their scope to some applications, they will not reforme the history). > I am not aware of a useful definition of 'conforming to Unicode' that applies to either transformation. I am not aware of a useful definition of 'conforming to Unicode' that > applies to either transformation. So if you look for an example look at how this is made for Turkish. Basically this is just a matter of tailoring for specific locales. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Wed Dec 9 16:57:42 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 9 Dec 2015 14:57:42 -0800 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: <5668B1E6.90602@ix.netcom.com> An HTML attachment was scrubbed... URL: From brille1 at hotmail.com Wed Dec 9 17:49:25 2015 From: brille1 at hotmail.com (Hans Meiser) Date: Wed, 9 Dec 2015 23:49:25 +0000 Subject: =?iso-8859-1?Q?Re:_Proposal_for_German_capital_letter_"=DF"?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> , Message-ID: Yes, they do it wrong because (1) they don't know better and (2) they let their software convert lower case text into upper case (a feature nearly every typographic software provides). Yet, if we let the majority of illiterate people decide what's right and what's wrong we could as easily decide to have 2 + 2 = 5. Here's an official text of the correct today's rules on how to write a capital "?" (it's in German): http://www.duden.de/sprachwissen/rechtschreibregeln/doppel-s-und-scharfes-s From everson at evertype.com Wed Dec 9 18:05:00 2015 From: everson at evertype.com (Michael Everson) Date: Thu, 10 Dec 2015 00:05:00 +0000 Subject: =?utf-8?Q?Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: <5668B1E6.90602@ix.netcom.com> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668B1E6.90602@ix.netcom.com> Message-ID: <10386FF9-39C1-4E7C-A9D0-575EB0AD7E16@evertype.com> On 9 Dec 2015, at 22:57, Asmus Freytag (t) wrote: > >> In my new edition of the first German translation of ?Alice?s Adventures in Wonderland?, the editor and I made sure that the cakes said ?I? MICH!? and not ?I? MICH!?. :-) > > And the correct spelling (modern) would have been "Iss mich" (or capitalized version as in your case). Well, we were updating from the 1869 Fraktur orthography to one suitable for the modern era. We did not use the Schlechtschreibung, in terms of our dissatisfaction with it, and in consideration of the timelessness of the Victorian text. Our choice of ?I? MICH!? as opposed to ?I? MICH!? or ?ISS MICH!? was based on good orthographic practice often found in Germany, regardless of whether it is official or not. Please note that ?official? and ?correct? are not the same things. It is OBVIOUS that if Ma?e and Masse are distinguished in lower-case then it is advantageous to users and their data if they upper-case to MA?E and MASSE. Michael Everson * http://www.evertype.com/ From mark at kli.org Wed Dec 9 18:30:21 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 09 Dec 2015 19:30:21 -0500 Subject: Proposal for German capital letter =?UTF-8?B?IsOfIg==?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> , Message-ID: <5668C79D.1050308@kli.org> On 12/09/2015 06:49 PM, Hans Meiser wrote: > Yes, they do it wrong because (1) they don't know better and (2) they let their software convert lower case text into upper case (a feature nearly every typographic software provides). > > Yet, if we let the majority of illiterate people decide what's right and what's wrong we could as easily decide to have 2 + 2 = 5. > > Here's an official text of the correct today's rules on how to write a capital "?" (it's in German): > > http://www.duden.de/sprachwissen/rechtschreibregeln/doppel-s-und-scharfes-s I remember when we went through all this the first time around, encoding ? in the first place. People were saying "But the Duden says no!!!" And someone then pointed out, "Please close your Duden and cast your gaze upon ITS FRONT COVER, where you will find written in inch-high capitals plain as day, "DER GRO?E DUDEN" (http://www.typografie.info/temp/GrosseDuden.jpg) So in terms of prescription vs description, the Duden pretty much torpedoes itself. ~mark From asmus-inc at ix.netcom.com Wed Dec 9 18:43:40 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 9 Dec 2015 16:43:40 -0800 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: <5668CABC.4050804@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Wed Dec 9 22:32:00 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 10 Dec 2015 13:32:00 +0900 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: <5668C79D.1050308@kli.org> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> Message-ID: <56690040.9090101@it.aoyama.ac.jp> On 2015/12/10 09:30, Mark E. Shoulson wrote: > I remember when we went through all this the first time around, encoding > ? in the first place. People were saying "But the Duden says no!!!" And > someone then pointed out, "Please close your Duden and cast your gaze > upon ITS FRONT COVER, where you will find written in inch-high capitals > plain as day, "DER GRO?E DUDEN" > (http://www.typografie.info/temp/GrosseDuden.jpg) So in terms of > prescription vs description, the Duden pretty much torpedoes itself. This is an interesting example of a phenomenon that turns up in many other contexts, too. A similar example is the use of accents on upper-case letters in French in France where 'officially', upper-case letters are written without accents. When working on internationalization, it's always good to keep eyes open and not just only follow the rules. However, the example is also somewhat misleading. The book in the picture is clearly quite old. The Duden that was cited is new. I checked with "Der Grosse Duden" on Amazon, but all the books I found had the officially correct spelling. On the other hand, I remember that when the upper-case sharp s came up for discussion in Unicode, source material showed that it was somewhat popular quite some time ago (possibly close in age with the old Duden picture). So we would have to go back and check the book in the picture to see what it says about ? to be able to claim that Duden was (at some point in time) inconsistent with itself. Regards, Martin. From marc.blanchet at viagenie.ca Wed Dec 9 23:35:35 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Thu, 10 Dec 2015 00:35:35 -0500 Subject: Proposal for German capital letter "=?utf-8?q?=C3=9F?=" In-Reply-To: <56690040.9090101@it.aoyama.ac.jp> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> Message-ID: <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> On 9 Dec 2015, at 23:32, Martin J. D?rst wrote: > On 2015/12/10 09:30, Mark E. Shoulson wrote: > >> I remember when we went through all this the first time around, >> encoding >> ? in the first place. People were saying "But the Duden says >> no!!!" And >> someone then pointed out, "Please close your Duden and cast your gaze >> upon ITS FRONT COVER, where you will find written in inch-high >> capitals >> plain as day, "DER GRO?E DUDEN" >> (http://www.typografie.info/temp/GrosseDuden.jpg) So in terms of >> prescription vs description, the Duden pretty much torpedoes itself. > > This is an interesting example of a phenomenon that turns up in many > other contexts, too. A similar example is the use of accents on > upper-case letters in French in France where 'officially', upper-case > letters are written without accents. while in Qu?bec, upper-case letters are written _with_ accents. l10n? Marc. > When working on internationalization, it's always good to keep eyes > open and not just only follow the rules. > > However, the example is also somewhat misleading. The book in the > picture is clearly quite old. The Duden that was cited is new. I > checked with "Der Grosse Duden" on Amazon, but all the books I found > had the officially correct spelling. On the other hand, I remember > that when the upper-case sharp s came up for discussion in Unicode, > source material showed that it was somewhat popular quite some time > ago (possibly close in age with the old Duden picture). So we would > have to go back and check the book in the picture to see what it says > about ? to be able to claim that Duden was (at some point in time) > inconsistent with itself. > > Regards, Martin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Thu Dec 10 01:57:14 2015 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 10 Dec 2015 08:57:14 +0100 Subject: =?UTF-8?Q?Aw=3A_Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: <56690040.9090101@it.aoyama.ac.jp> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org>, <56690040.9090101@it.aoyama.ac.jp> Message-ID: An HTML attachment was scrubbed... URL: From as at signographie.de Thu Dec 10 02:26:40 2015 From: as at signographie.de (=?iso-8859-1?Q?Andreas_St=F6tzner?=) Date: Thu, 10 Dec 2015 09:26:40 +0100 Subject: =?iso-8859-1?Q?Proposal_for_German_capital_letter_=22=DF=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org>, <56690040.9090101@it.aoyama.ac.jp> Message-ID: <3DECE5BF-524B-492A-BC90-E11876CB4DB9@signographie.de> Am 10.12.2015 um 08:57 schrieb J?rg Knappen: > The use of the capital sharp s in German is not only a historical artefact, it is recent and modern. some illustrations for that: https://www.facebook.com/versaleszett/?fref=ts Mit freundlichen Gr??en ? Andreas St?tzner _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu Dec 10 02:41:13 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 10 Dec 2015 17:41:13 +0900 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> Message-ID: <56693AA9.8040805@it.aoyama.ac.jp> Hello Marc, On 2015/12/10 14:35, Marc Blanchet wrote: >> This is an interesting example of a phenomenon that turns up in many >> other contexts, too. A similar example is the use of accents on >> upper-case letters in French in France where 'officially', upper-case >> letters are written without accents. > > while in Qu?bec, upper-case letters are written _with_ accents. l10n? They are written with accents also quite often in France, but the French just don't notice :-). Regards, Martin. From brille1 at hotmail.com Thu Dec 10 04:13:38 2015 From: brille1 at hotmail.com (Hans Meiser) Date: Thu, 10 Dec 2015 10:13:38 +0000 Subject: =?iso-8859-3?Q?Re:_Proposal_for_German_capital_letter_"=DF"?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> , Message-ID: Actually, MS Word offers an option to keep or drop accents when converting lower case to upper case in its spell checker options. I comprehend to the Turkish translation. They've got two different letter "i", one with and one without the dot ("?"). But that's all not pointing to the direction of what I'm up to. I'm not suggesting to change the Unicode table. The table is fine. What I'm suggesting is to change the glyph (the rendered outcome) to something that's resembling two capital letters "S". Here's a hyperlink to an image depicting of what I'm suggesting: https://www.dropbox.com/s/l9zifh1imef0re9/SS.png So, no matter whether the glyph will change - the rules and algorithms will be retained. It's quite like Richard (Wordingham) wrote yesterday: "It's a font decision, not a Unicode decision". Yet, Unicode needs to lead the way so font designers may then amend their fonts accordingly. From brille1 at hotmail.com Thu Dec 10 04:19:36 2015 From: brille1 at hotmail.com (Hans Meiser) Date: Thu, 10 Dec 2015 10:19:36 +0000 Subject: =?iso-8859-1?Q?Re:_Proposal_for_German_capital_letter_"=DF"?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> , , Message-ID: After all, the "?" is just a ligature of "ss" (or, to be precise: a ligature of "sz", originating from old German fonts - see hyperlink below), so I suggest the rendered outcome of the capital "?" to be just the same: A ligature of two capital "S". Here's a hyperlink to an old German font (notice the lower case "s" and "z"): http://www.myfont.de/fonts/infos/5602-Koch-Fette-Deutsche-Schrift.html From frederic.grosshans at gmail.com Thu Dec 10 04:45:22 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Thu, 10 Dec 2015 11:45:22 +0100 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: <56690040.9090101@it.aoyama.ac.jp> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> Message-ID: <566957C2.1060302@gmail.com> Le 10/12/2015 05:32, Martin J. D?rst a ?crit : > A similar example is the use of accents on upper-case letters in > French in France where 'officially', upper-case letters are written > without accents. Actually, the official body in charge of this (Acad?mie Fran?aise) has always recommended upper-case letters with accents , but the school teachers teach the other way, and accents on capital letters was technically challenging (in printing, writing machines and keyboard), so many people think the official recommendation is to drop them, and that is anyway complicated. But I often get question from non technical people on how I type ?, ?, or ?, which shows that they are natural. (French language Wikipedia has more details on this https://fr.wikipedia.org/wiki/Usage_des_majuscules_en_fran%C3%A7ais , including the fact that the rules in Switzerland are different.) Fr?d?ric From gansmann at uni-bonn.de Thu Dec 10 05:47:20 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Thu, 10 Dec 2015 12:47:20 +0100 Subject: =?utf-8?Q?Proposal_for_German_capital_lette?= =?utf-8?Q?r_=22=C3=9F=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: On Thu, 10 Dec 2015 11:19:36 +0100, Hans Meiser wrote: > After all, the "?" is just a ligature of "ss" (or, to be precise: a ligature of "sz", originating from old German fonts - see hyperlink below), so I suggest the rendered outcome of the capital "?" to be just the same: A ligature of two capital "S". It?s not that simple. Briefly: ? The ? has completed its transition from a ligature to a standalone letter at least hundred years ago. For example, in fraktur typesetting (or more precisely, typesetting with a long s), one spelt ??zeni?ch? and ?la?ziv? ? not ??eni?ch? and ?la?iv?. ? History is not necessarily a good argument at how things should be done, otherwise we would have to be VVRITINC LIKE THIS. ? As already mentioned, from readability?s point of view, a properly designed capital ? is less obstrusive than SS. For more details, see the links in my first reply, in particular http://j.mp/versaleszett. From charupdate at orange.fr Thu Dec 10 06:05:28 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 10 Dec 2015 13:05:28 +0100 (CET) Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_"=C3=9F"?= Message-ID: <1842555538.6168.1449749128410.JavaMail.www@wwinf2227> On Thu, 10 Dec 2015 11:45:22 +0100, Fr?d?ric Grosshans wrote: >Le 10/12/2015 05:32, Martin J. D?rst a ?crit : >> A similar example is the use of accents on upper-case letters in >> French in France where 'officially', upper-case letters are written >> without accents. We are welcome to look up the most official website of France: http://www.elysee.fr/ We learn that *actually* uppercase letters are diacriticized. But the footer shows that by the time, diacritics were cut away. The change is on-going, from "caps always undiacriticized" to "all-caps diacriticized and titlecase caps undiacriticized" and further to "always diacriticized" as recommended in one of the 'official' options. >Actually, the official body in charge of this (Acad?mie Fran?aise) has >always recommended upper-case letters with accents , but the school >teachers teach the other way, That is old school. Actual school books teach to always diacriticize the diacriticized letters, stating that there is strictly *no* rule not to do so. But admittingly, switching from old school to new school isn't really straigtforward. >and accents on capital letters was >technically challenging (in printing, writing machines and keyboard), Right, it was. Keyboard: This is why last year, the government placed an order for a complete computer keyboard layout at the French Standards body. Making such a keyboard layout easy to use, that's the challenge today. It's lastly been addressed (but that's not yet official). >so many people think the official recommendation is to drop them, and that >is anyway complicated. But I often get question from non technical >people on how I type ?, ?, or ?, which shows that they are natural. Many people dislike accents on capitals, and they really avoid them. But they grow fewer and fewer. Most people like the accents and are eager to place them. (Guess I'm a part of.) For everybody to see how to, and how important it is, here is one more fine website (in French): http://accentuez.mon.nom.free.fr/ Related to the thread's subject, there is a beta feedback item I sent by the time, but it was buried in a mass of other beta feedback. May I recall it here, to look whether some part could be useful? On this page: http://www.unicode.org/review/pri297/feedback.html, we find: There is further a point I got unfortunately not sooner aware of. It?s about uppercasing of the German ?. Looking at the properties of U+00DF in ucdxml.nounihan.flat.xml, I found that uc="0053 0053" only. In the meantime, German usage begins to shift towards 1E9E, as I already reported and suggested updating the NamesList and Code Charts annotation for this character. IMO there should be an applications Settings checkbox: ?? ? as uppercase for ??. I don?t know if it?s already implemented. However, since U+1E9E is now a part of most current fonts and is on keyboard thanks to the new German standard layouts, defining uppercase as uc="1E9E" might seem appropriate to avoid loosing the ? in text files. If the custom setting requires uppercasing U+00DF to double U+0053, the cf="0073 0073" value can be used to perform that. To understand the issue, it is necessary to remember that the uppercase latin letter SZ has been created and encoded on behalf of the German Standards body DIN to ensure that personal data are correctly stored and rendered. As in German, the ? is a distinctive part of orthography and is needed in names (if a person?s name is Stra?er or STRA?ER, writing STRASSER or STRASZER is false because these are other names, equally borne), not having an uppercase ? made much trouble and lead to some confusion. Today, fortunately this time is past, and the char props may be updated. All what is needed is already in the UCD except the new uppercase as a value of the uc property for U+00DF. Therefore I suggest that Unicode takes advice from the German Standards body (DIN) whether to set this property to its new value. [/quote] Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Dec 10 06:40:02 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 10 Dec 2015 13:40:02 +0100 Subject: =?UTF-8?Q?Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: <56690040.9090101@it.aoyama.ac.jp> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> Message-ID: 2015-12-10 5:32 GMT+01:00 Martin J. D?rst : > This is an interesting example of a phenomenon that turns up in many other > contexts, too. A similar example is the use of accents on upper-case > letters in French in France where 'officially', upper-case letters are > written without accents. When working on internationalization, it's always > good to keep eyes open and not just only follow the rules. > Please define "officially". If you consider the official French Academy, capitals MUST carry their accents. And most official institutions strongly support accents (inclucing the Imprimerie Nationale in its official typographic recomandations: it is the official printer of official publications for almost all national institutions, including all legal texts). Do you have any single example of capitals without accents? I know there are other commendations by private or semi-private companies but only for limited scopes of use: "La Poste" for addresses on envelops (where you theoretically also must use any punctuation,including hyphens, commas, abbreviating dots, but where you also have to use abreviations in many cases for city names and street names). La Poste is not really an official lingusitic institution, its needs there are only for printed address labels. And La Poste is no longer a monopole in France for postal services, other private postal services have their own recommandations and don't care about the historic recommandations made by La Poste. There are other recommandations used in various databases (e.g. the FANTOIR database made by municipalities and the French casatre for fiscal purposes), but the scope of use is not really for the French language itself, but for simple searches in that database. Here again there's no lowercase letters, and accents are frequently omitted. This is in fact a legacy inherited after several decenials of use of computers on systems that initially had no support of Unicode, and when many systems used various incompatible charsets, frequently undocumented: in those databses, basic ASCII still rules, but there are more modern formats adding other fields with more exact distinctions of case and accents. Even before computers, the French typewriters had capitals with accents. Accents started disapearing in the 1970's with modern computers, unfortuantely using softwares made in US and ignoring the French requirements. Accents are back today, but still not on French keyboards for PC, due to lack of support in default keyboard layouts (notably on Windows): they are present on virtual keyboards for smartphones, on keyboards for Mac, on layouts for Linux. Only Microsoft is very late on restoring accents on a supported layouts for Windows (it would then convince keyboard manufacturers to restore the missing accents on the keycaps). -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Dec 10 10:00:27 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Dec 2015 08:00:27 -0800 Subject: =?UTF-8?Q?Re:_Aw:_Re:_Proposal_for_German_capital_letter_=22=c3=9f?= =?UTF-8?Q?=22?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> Message-ID: <5669A19B.9010500@ix.netcom.com> An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Dec 10 10:37:49 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 10 Dec 2015 08:37:49 -0800 Subject: Hentaigana proposal In-Reply-To: References: Message-ID: Dear Mr. Tranter, I can't tell whether you intend to start a discussion on this discussion mailing list, or intend to submit feedback on a proposal. Maybe you are looking for discussion before you formalize your feedback. If you do intend to submit feedback, then, once you have formulated a position, please use http://www.unicode.org/reporting.html Please make it very clear in your feedback what documents you are referring to, what you think should be changed, and why. I suggest you put your important points first, background later. (I got a bit lost in your narrative about likes and dislikes; I don't think this narrative format would be successful as feedback to the time-constrained technical committee.) Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Dec 10 10:40:19 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Dec 2015 08:40:19 -0800 Subject: =?UTF-8?Q?Re:_Aw:_Re:_Proposal_for_German_capital_letter_=22=c3=9f?= =?UTF-8?Q?=22?= In-Reply-To: <5669A19B.9010500@ix.netcom.com> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> <5669A19B.9010500@ix.netcom.com> Message-ID: <5669AAF3.4050000@ix.netcom.com> An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Dec 10 12:56:50 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 10 Dec 2015 10:56:50 -0800 Subject: =?UTF-8?Q?Re=3A_Proposal_for_German_capital_letter_=22=C3=9F=22?= In-Reply-To: <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> Message-ID: This prompts a question: for case conversion bijectivity in fr_FR locale, should there be "invisible accents"? E.g. de?ja? -> DE(combining invisible acute accent)JA(combining invisible grave accent) -> de?ja? whereas in fr_CA locale, it is simply de?ja? -> DE?JA? -> de?ja? Leo On Wed, Dec 9, 2015 at 9:35 PM, Marc Blanchet wrote: > On 9 Dec 2015, at 23:32, Martin J. D?rst wrote: > > On 2015/12/10 09:30, Mark E. Shoulson wrote: > > I remember when we went through all this the first time around, encoding > ? in the first place. People were saying "But the Duden says no!!!" And > someone then pointed out, "Please close your Duden and cast your gaze > upon ITS FRONT COVER, where you will find written in inch-high capitals > plain as day, "DER GRO?E DUDEN" > (http://www.typografie.info/temp/GrosseDuden.jpg) So in terms of > prescription vs description, the Duden pretty much torpedoes itself. > > This is an interesting example of a phenomenon that turns up in many other > contexts, too. A similar example is the use of accents on upper-case letters > in French in France where 'officially', upper-case letters are written > without accents. > > while in Qu?bec, upper-case letters are written with accents. l10n? > > Marc. > > When working on internationalization, it's always good to keep eyes open and > not just only follow the rules. > > However, the example is also somewhat misleading. The book in the picture is > clearly quite old. The Duden that was cited is new. I checked with "Der > Grosse Duden" on Amazon, but all the books I found had the officially > correct spelling. On the other hand, I remember that when the upper-case > sharp s came up for discussion in Unicode, source material showed that it > was somewhat popular quite some time ago (possibly close in age with the old > Duden picture). So we would have to go back and check the book in the > picture to see what it says about ? to be able to claim that Duden was (at > some point in time) inconsistent with itself. > > Regards, Martin. From lisam at us.ibm.com Thu Dec 10 15:17:41 2015 From: lisam at us.ibm.com (Lisa Moore) Date: Thu, 10 Dec 2015 13:17:41 -0800 Subject: In Memoriam--Michael Kaplan Message-ID: <201512102117.tBALHlPh016062@d01av04.pok.ibm.com> As was announced earlier on Unicode email lists, the many people associated with the Unicode Consortium were much saddened to hear of the passing of Michael Kaplan. Please find this posting on our website at: http://www.unicode.org/consortium/memoriam.html#Michael_S_Kaplan Lisa -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Fri Dec 11 01:20:05 2015 From: eric.muller at efele.net (Eric Muller) Date: Thu, 10 Dec 2015 23:20:05 -0800 Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_=22=c3=9f=22?= In-Reply-To: <566957C2.1060302@gmail.com> References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> <566957C2.1060302@gmail.com> Message-ID: <566A7925.3030007@efele.net> On 12/10/2015 2:45 AM, Fr?d?ric Grosshans wrote: > Le 10/12/2015 05:32, Martin J. D?rst a ?crit : >> A similar example is the use of accents on upper-case letters in >> French in France where 'officially', upper-case letters are written >> without accents. > Actually, the official body in charge of this (Acad?mie Fran?aise) They actually mandate "Acad?mie *f*ran?aise". And "Imprimerie *n*ationale" (for Philippe; even if imprimerienationale.fr has forgotten that). > has always recommended upper-case letters with accents , but the > school teachers teach the other way, and accents on capital letters > was technically challenging (in printing, writing machines and keyboard), Thanks to gallica.fr and archive.org, it is easy to see what actually happened until the middle of the 20th century. What I have seen is that in both cold and hot metal, until the end of the 19th century, one only and always sees ? ? ? ? ? ? ?; on small caps, one can sometime find ? ? ? ?. That matches all the descriptions of the "casse parisienne" and "police" (how many "a", "b", "c", etc in a font) I have seen in typography manuals. Around the beginning of the 20th century, one start to see books without accented capitals (and unfortunately books with inconsistent use of the accented capitals). Eric. From charupdate at orange.fr Fri Dec 11 04:30:50 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 11 Dec 2015 11:30:50 +0100 (CET) Subject: =?UTF-8?Q?Re:_Proposal_for_German_capital_letter_"=C3=9F"?= In-Reply-To: References: <3E10480FE4510343914E4312AB46E74212B69AE4@DEFTHW99EH5MSX.ww902.siemens.net> <5668C79D.1050308@kli.org> <56690040.9090101@it.aoyama.ac.jp> <4B81AE39-C1BC-445C-9EBB-4097CFCCCD6A@viagenie.ca> Message-ID: <1229555934.9030.1449829851148.JavaMail.www@wwinf1m11> On Thu, 10 Dec 2015 10:56:50 -0800, Leo Broukhis wrote: > This prompts a question: for case conversion bijectivity in fr_FR > locale, should there be "invisible accents"? E.g. > de?ja? -> DE(combining invisible acute accent)JA(combining invisible > grave accent) -> de?ja? > whereas in fr_CA locale, it is simply > de?ja? -> DE?JA? -> de?ja? In fr_FR locale, it is, too. Thank you for your courtesy, invisible diacritics are indeed a very good idea if undiacriticized uppercase were really an actual need. But since your proposal is about case *conversion*, it's meant for *new* text, as opposed to historical editing. Introducing a mechanism to get accents off the caps without altering lowercase, is twice useless. First because undiacriticized uppercase is far from being an ideal, it's a mere second best that grew usual for a time but should have no more place. Second because it mainly would become useful in case conversion of *existing* all-caps that obviously has been written without the new invisible accents. Eric's finding [http://www.unicode.org/mail-arch/unicode-ml/y2015-m12/0041.html] that 'E' was always diacriticized but 'A' wasn't always, illustrates partly the pragmatic second-best solution of avoiding the accent on top when it often breaks away on lead typography letters, and partly the dislike of such on-tip accents which some people considered as "ugly". But in turn this dislike could have been the product of simply seldom seeing the accent on the tip of the 'A'. Fortunately all these byways are now past and useless. Subsequently, I feel the need to stronly underscore Ralf Herrmann's conclusion on 23 Jan 2011 in the blog post that Asmus linked to [http://www.unicode.org/mail-arch/unicode-ml/y2015-m12/0036.html]: The capital Eszett is now used more every day. It is included in several Windows 7 fonts and more and more type designers are designing a capital Eszett for newly released?typefaces. I would like to finish with a quote about the capital Eszett from 1879, which I consider as true today as it was then: ? ?Indeed?it is a new character; but maybe this newness is the only thing you can hold against it.? ? (Original quote: ?Allerdings ? es ist ein neues Zeichen; vielleicht ist aber die Neuheit das Einzige, was sich dagegen vorbringen l?sst.?) ? [/quote] IMHO the full achievement of Unicode is to be able to not only reproduce inherited practice, but above all, to enhance the actual one. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Dec 11 06:28:48 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 11 Dec 2015 12:28:48 +0000 Subject: Devanagari and Subscript and Superscript In-Reply-To: References: Message-ID: <20151211122848.03ad0d7b@JRWUBU2> On Wed, 9 Dec 2015 03:24:39 +0000 Plug Gulp wrote: > I am trying to understand if there is a way to use Devanagari > characters (and grapheme clusters) as subscript and/or superscript in > unicode text. Why do you want to do this? Are you asking about writing Devanagari vertically rather than horizontally? If that is what you want, you should be looking at mark-up such as is found in cascading style sheets (CSS). It is an important issue for CJK and Mongolian, and there have been questions as to what is needed for Indian scripts. (There's also an antiquarian interest for historical scripts, such as Phags-pa and even Egyptian - moves are afoot to support the hieroglyphic script as plain text.) Richard. From no-reply at dropboxmail.com Sat Dec 12 13:42:03 2015 From: no-reply at dropboxmail.com (Dropbox) Date: Sat, 12 Dec 2015 19:42:03 +0000 Subject: =?utf-8?q?Robert_Wheelock_invited_you_to_check_out_Dropbox?= Message-ID: <0000015197b6e200-34ddc3da-0fdb-44a3-ad59-9ba313a85cba-000000@us-west-2.amazonses.com> Hi there, Robert Wheelock wants you to try Dropbox! Dropbox lets you bring all your photos, docs, and videos with you anywhere and share them easily. Accept invite[1] Thanks! - The Dropbox Team ____________________________________________________ If you prefer not to receive invites from Dropbox, please go here[2]. Dropbox, Inc., PO Box 77767, San Francisco, CA 94107 [1]: https://www.dropbox.com/l/LDWtAJDp4HCdA5D2bUfEkp?text=1 [2]: https://www.dropbox.com/l/IOYWvoEF3bRrUBxGUGdAxh?text=1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From plug.gulp at gmail.com Tue Dec 15 05:55:02 2015 From: plug.gulp at gmail.com (Plug Gulp) Date: Tue, 15 Dec 2015 11:55:02 +0000 Subject: Devanagari and Subscript and Superscript In-Reply-To: <5667B9B5.3010208@it.aoyama.ac.jp> References: <5667B9B5.3010208@it.aoyama.ac.jp> Message-ID: On Wed, Dec 9, 2015 at 5:18 AM, Martin J. D?rst wrote: > > I suggest using HTML: > > ?? ?? > This will work only if the end-users are always going to use a web browser to view the text content. It will help if Unicode standard itself intrinsically supports generalised subscript/superscript text. I think the meaning of the text should be contained within the text itself rather than relying on external text markers and viewers. That way the text-content creator does not have to rely on what type of unicode compliant text viewer or editor the end user is using. The text should retain it's meaning irrespective of the type of unicode compliant text viewer or editor used. Similarly, if the text has to be saved in a database without losing it's meaning, then either it has to be saved with all the known markers of all the available editors, or some special processing needs to be incorporated to convert some saved marker to markers of various available text viewers and editors. Having generalised Unicode support for superscript and subscript will solve all these problems. Following is one of the use-cases where general Unicode support for superscript/subscript will help tremendously: A math teacher(??????? ??????) in a Marathi(?????) language school is writing notes, in her Unicode compliant plain text editor, to explain mathematical terms to her students. Following is an excerpt from the notes that explains terms such as exponents(??????) and base(????). (English translation is given below): "?????? ??????? ?????? ???????? ???? ???? ??????? ???? ?????? ???? ?????????? ???????? ???????????? ???????? ?????? ??? ???????. ??????????, ? ?? ?????? ?? ??????? ? ???? ????? ??? ????, ?????? ? x ? xnglish translation: "Exponent is a shorthand notation that denotes a multiplication of a number by itself a number of times. For example, if a number 5 is multiplied by itself 3 times i.e. 5 x 5 x 5, then it is represented in an exponential form as 5^3. This exponential term is referred to as "5 raise to the power of 3". Let us consider another example, "2 raise to the power of 10", i.e. 2 is multiplied by itself 10 times. This is written in exponential form as 2^10. So, in general any number b that is multiplied by itself k number of times is written as b^k and the term is referred to as "b raise to the power of k". The number b is called the base, and the number k is called the exponent. In short, exponential term is written as base^exponent." Please note that the teacher had to use a Circumflex Accent (Caret) to indicate superscript, which is an unwritten convention, in the absence of proper superscript support within Unicode. To make the text available to wider audience and still retain it's meaning, the teacher will have to partly rely on Unicode support, partly on the markers available in the various text viewers of her students, partly on the markers available in the text editors of the peer-reviewers of her text and partly on the unwritten convention(such as the caret). This conundrum can be resolved only if there is a generalised support for superscript and subscript within Unicode standard. The standard already has a section for superscript and subscript. Generalising and extending this support will help other languages and scripts. General support for all characters, words and sentences could be achieved by just three new formatting characters, e.g. SCR, SUP and SUB, similar to the way other formatting characters such as ZWS, ZWJ, ZWNJ etc are defined. The new formatting characters could be defined as: SCR: In a character stream, all the characters following this formatting character shall be treated as normal text until either the end of the character stream or the next SUP or SUB character is reached. This shall be the default marker i.e. if no marker is specified then the text shall be treated as normal text until either the end of the character stream or the next SUP or SUB character is reached. SUP: In a character stream, all the characters following this formatting character shall be treated as superscript text until either the end of the character stream or the next SCR or SUB character is reached. SUB: In a character stream, all the characters following this formatting character shall be treated as subscript text until either the end of the character stream or the next SCR or SUP character is reached. A general support within Unicode for subscripting and superscripting text(characters and words) will tremendously help languages and scripts that are not English/Latin. Thanks and kind regards, ~Plug >> >> Hi, >> >> I am trying to understand if there is a way to use Devanagari >> characters (and grapheme clusters) as subscript and/or superscript in >> unicode text. It will help if someone could please direct me to any >> document that explains how to achieve that. Is there a unicode marker >> that will treat the next grapheme cluster in the unicode text as >> super/subscript? For e.g. if one wants to represent "? raise to ???" >> how does one achieve that; is there a marker to represent it as >> follows: ? + SUP + ? + ? + ? >> where SUP acts as a marker for superscripting the next grapheme >> cluster. Similar for subscripting. >> >> Sorry if this is not the right place to ask this question; in that >> case please could you direct me to the right forum? >> >> Thanks and kind regards >> >> ~Plug >> >> . >> > From khaledhosny at eglug.org Tue Dec 15 09:08:56 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Tue, 15 Dec 2015 19:08:56 +0400 Subject: Devanagari and Subscript and Superscript In-Reply-To: References: <5667B9B5.3010208@it.aoyama.ac.jp> Message-ID: <20151215150856.GA14575@khaled-laptop> On Tue, Dec 15, 2015 at 11:55:02AM +0000, Plug Gulp wrote: > Please note that the teacher had to use a Circumflex Accent (Caret) to > indicate superscript, which is an unwritten convention, in the absence > of proper superscript support within Unicode. If the teacher is explaining actual math to his students, then the superscript is the least of his worries. Math typesetting is two dimensional, and is much more complex than regular formated text (not even regular plan text)that it needs its own typesetting engines. There are various plain text markup languages to markup math, if one really wants to represent complex mathematical notation in plain text. Regards, Khaled From doug at ewellic.org Tue Dec 15 11:46:06 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Dec 2015 10:46:06 -0700 Subject: Devanagari and Subscript and Superscript Message-ID: <20151215104606.665a7a7059d7ee80bb4d670165c8327d.b928c0b462.wbe@email03.secureserver.net> Plug Gulp wrote: > It will help if Unicode standard itself intrinsically supports > generalised subscript/superscript text. This falls outside the scope of "plain text" as defined by Unicode, in much the same way as bold and italic styles and colors and font faces and sizes. There are several rich-text formats besides HTML that support arbitrary subscript and superscript text. PDF and Word leap to mind. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From sisrivas at blueyonder.co.uk Tue Dec 15 12:00:16 2015 From: sisrivas at blueyonder.co.uk (srivas sinnathurai) Date: Tue, 15 Dec 2015 18:00:16 +0000 (GMT) Subject: Devanagari and Subscript and Superscript In-Reply-To: <20151215104606.665a7a7059d7ee80bb4d670165c8327d.b928c0b462.wbe@email03.secureserver.net> References: <20151215104606.665a7a7059d7ee80bb4d670165c8327d.b928c0b462.wbe@email03.secureserver.net> Message-ID: <85569083.271976.1450202416115.JavaMail.open-xchange@oxbe7.tb.ukmail.iss.as9143.net> Does the standard support the use of diacritics in plain text format, when used with all and any complex scripts? Regards Sinnathurai > > On 15 December 2015 at 17:46 Doug Ewell wrote: > > > Plug Gulp wrote: > > > It will help if Unicode standard itself intrinsically supports > > generalised subscript/superscript text. > > This falls outside the scope of "plain text" as defined by Unicode, in > much the same way as bold and italic styles and colors and font faces > and sizes. > > There are several rich-text formats besides HTML that support arbitrary > subscript and superscript text. PDF and Word leap to mind. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Dec 15 13:26:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Dec 2015 12:26:38 -0700 Subject: Devanagari and Subscript and Superscript Message-ID: <20151215122638.665a7a7059d7ee80bb4d670165c8327d.96ff80d5fe.wbe@email03.secureserver.net> srivas sinnathurai wrote: > Does the standard support the use of diacritics in plain text format, > when used with all and any complex scripts? It probably depends on what you mean by "support" and "diacritics." I can type a Tamil letter followed by a combining acute accent or diaeresis, and in Arial Unicode MS it actually looks halfway decent. Many years ago, William Overington famously put a combining circumflex on top of U+2604 COMET. You just type one character followed by another and hope for the best, display-wise. You don't get any other special behavior. I'm not sure if this was supposed to be a comment on my statement that arbitrary subscript and superscript is similar to other attributes that are not defined to be part of plain text. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Dec 15 18:01:05 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 16 Dec 2015 00:01:05 +0000 Subject: Devanagari and Subscript and Superscript In-Reply-To: <85569083.271976.1450202416115.JavaMail.open-xchange@oxbe7.tb.ukmail.iss.as9143.net> References: <20151215104606.665a7a7059d7ee80bb4d670165c8327d.b928c0b462.wbe@email03.secureserver.net> <85569083.271976.1450202416115.JavaMail.open-xchange@oxbe7.tb.ukmail.iss.as9143.net> Message-ID: <20151216000105.21f5fb62@JRWUBU2> On Tue, 15 Dec 2015 18:00:16 +0000 (GMT) srivas sinnathurai wrote: > Does the standard support the use of diacritics in plain text format, > when used with all and any complex scripts? Relatively few scalar value sequences are prohibited - just possibly sequences containing unassigned characters that are not non-characters, but I can't think of any others. (The prohibition on unpaired surrogates applies to coded character sequences, but surrogate characters aren't scalar values.) It would appear by Conformance Requirement C5, 'A process shall not assume that it is required to interpret any particular coded character sequence', that a process is at liberty to decline to interpret a sequence of scalar values, even if it has just interpreted it. I am not aware of any requirements in the standard to interpret specific character sequences. In general, the interpretation of character sequences is undefined. For example, a request for advice on the interpretation of the combination of U+0331 COMBINING MACRON BELOW and U+0E39 THAI CHARACTER SARA UU was answered with the instruction to consult the non-existent typographical tradition. It's been left to rendering engine writers to define the interpretation. Indeed, I am not sure that every sequence of defined scalar values has an interpretation. Most pairs of regional indicators don't have an interpretation, and the interpretation of each variation sequences may change at least twice, once when the base character becomes defined (or is defined not to be a possible base character), and again when the variation sequence is assigned an interpretation as an ill-defined (or grossly ill-defined) family of glyphs. Do U+0337 COMBINING SHORT SOLIDUS OVERLAY and U+20E5 COMBINING REVERSE SOLIDUS OVERLAY have a defined interpretation when their base character is to be represented by a mirrored glyph. Note that in general, the Unicode standard does not define when a character is to be represented by a mirrored glyph. This may be defined by a lower level protocol (the font file). Richard. From doug at ewellic.org Wed Dec 16 12:16:28 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 16 Dec 2015 11:16:28 -0700 Subject: Devanagari and Subscript and Superscript Message-ID: <20151216111628.665a7a7059d7ee80bb4d670165c8327d.e6eee701f7.wbe@email03.secureserver.net> I missed this yesterday. Plug Gulp wrote: > General support for all characters, words and sentences could be > achieved by just three new formatting characters, e.g. SCR, SUP and > SUB, similar to the way other formatting characters such as ZWS, ZWJ, > ZWNJ etc are defined. The new formatting characters could be defined > as: > > SCR: In a character stream, all the characters following this > formatting character shall be treated as [...] > > SUP: In a character stream, all the characters following this > formatting character shall be treated as [...] > > SUB: In a character stream, all the characters following this > formatting character shall be treated as [...] This isn't similar to ZWSP or ZWJ or ZWNJ. Those formatting characters are not stateful; they affect the rendering of, at most, the single characters immediately preceding and following them. The ones you suggest are stateful; they affect the rendering of arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48 ("ANSI") attribute switching, or ISO 2022 character-set switching. Unicode tries hard to avoid encoding such things. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From gwalla at gmail.com Wed Dec 16 12:17:06 2015 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 16 Dec 2015 10:17:06 -0800 Subject: Hentaigana proposal In-Reply-To: References: Message-ID: On Wed, Dec 9, 2015 at 7:55 AM, Nicolas Tranter wrote: > I comment as a western Japanologist who teaches and researches using > hentaigana. I have published with hentaigana using image files (resulting in > two publisher errors) and will publish next year with hentaigana using the > Koin Hentaigana font (Koin????????.tte), and anticipate typesetting > problems. I refer to the 2015 proposal L2/15-239 to include hentaigana, > including the appended paper by Takada Tomokazu, Yada Tsutomu and Saito > Tatsuya ('The past, present and future of Hentaigana Standardization for > Information Interchange'). I also refer to Yada Tsutomu's support of the > proposal ('About the inclusion of standardized codepoints for Hentaigana', > L2/15-318). As the names and numbering of proposed characters is an issue I > deal with below, I also refer to individual hentaigana in the proposal by > their MJ-codes as used in the proposers' own websites (e.g. > http://mojikiban.ipa.go.jp/xb164/). > > > > SELECTION: The selection is good, consisting of 286 forms, although this > would be realised as 299 characters. The earlier 2009 proposal referred to > was based on the Mojikyo M113.ttf font, which has 213 hentaigana characters > and includes a few major basic gaps. The Koin Hentaigana font has 549 > characters, which excluding separate forms with voicing and 'half-voicing' > diacritics consists of 330 hentaigana, but includes some very rare forms, > including ones that do not occur in late period texts. > > > > The selection of 'academic' hentaigana is appropriate and lacks major gaps. > On the other hand, the Ministry of Justice hentaigana requirements are ones > that have been decided by the Ministry of Justice in 2004 for name > registration purposes, and so, although one could argue easily with their > 2004 decision (and I would), the fact that they are already official means > it is pointless to argue with their inclusion in Unicode. > > > > It's been noted that a few hentaigana are almost identical to normal > hiragana, especially e HENTAIGANA LETTER E VARIANT 4 = MJ090017 (cf. ?), shi > HENTAIGANA LETTER SI VARIANT 2 = MJ090072 (cf. HIRAGANA LETTER SI ?) and nu > HENTAIGANA LETTER NU VARIANT 2 = MJ090149 (cf. HIRAGANA LETTER NU ?): their > differences are solely that the 'brush' is removed from the paper on a > downward rather than a rightward flourish, reflecting vertical handwriting. > Ordinarily I would argue against including them, but since the MoJ has > recognised them as official variants they need to be included. > > > > The decision to propose in most cases one codepoint for the hentaigana > derived from a single Chinese character is sensible, as also is the decision > to allow multiple codepoints in certain cases where manuscripts use > side-by-side significantly distinct forms derived from the same Chinese > character and with the same value. An example of the latter is HENTAIGANA > LETTER KA VARIANT 3 = MJ090025and KA VARIANT 4 = MJ090026, both pronounced > ka and both derived from the Chinese character ?, but which are routinely > both found in the same manuscript by the same hand as if they were separate > graphemes from the Heian to the Meiji periods. > > > > POLYPHONY. Several hentaigana are truly polyphonous (e.g. the ?-derived > hentaigana = ne MJ090151 or MJ090059 ko, or the ?-derived hentaigana = me > MJ090222 or ma MJ090205). In particular, those hentaigana derived from ? and > associated with n (MJ090298, MJ090299) historically (also the source of > HIRAGANA LETTER N ?) are also used for mu (MJ090214, MJ090215) and mo > (MJ090224, MJ090223). Diachronically, n in native Japanese words is usually > derived from an earlier mu. Takada et al. includes a list of 10 kanji > sources that this applies to in the proposed repertoire. (Strictly, this > affects 11 hentaigana, because the proposal has two forms for ?-derived > characters.) The proposal's solution is to assign different identifiers, > e.g. ? = HENTAIGANA LETTER NE VARIANT 1 and HENTAIGANA LETTER KO VARIANT 2, > ? = HENTAIGANA LETTER ME VARIANT 3 and HENTAIGANA LETTER MA VARIANT 7, and > the two derived from ? = HENTAIGANA LETTER N VARIANT 1, N VARIANT 2, MU > VARIANT 1, MU VARIANT 2, MO VARIANT 1 and MO VARIANT 2. This means that > there would be characters that are given more than one codepoint and > identifier but are formally and etymologically identical, adding 13 > unnecessary repetitions to the character set. I would favour Yada's naming > system, where the polyphonous characters are given a single codepoint and > identifier, e.g. ? = HENTAIGANA LETTER NE-KO, ? = HENTAIGANA ME-MA, and two > ?-derived forms = HENTAIGANA LETTER N-MU-MO 1 and N-MU-MO 2. Is there a reason for sticking with the "VARIANT 1"/"VARIANT 2" naming convention? The previous proposal was for standardized variation sequences, so this opaque numbering made sense (since "VARIANT 1" meant "using the first variation selector"), but the current one is to encode them all as atomic characters. Wouldn't it be more helpful to give them more descriptive names, possibly by identifying the particular ideographs each is derived from? For example, instead of HENTAIGANA LETTER E VARIANT 2, it could be HENTAIGANA LETTER E FROM CJK-76C8. This doesn't help with same-source variants, but physical features could work for that, e.g. HENTAIGANA LETTER YO VARIANT4 -> HENTAIGANA LETTER YO FROM CJK-8207 WITH CROSSBAR HENTAIGANA LETTER YO VARIANT5 -> HENTAIGANA LETTER YO FROM CJK-8207 WITH LOOP HENTAIGANA LETTER YO VARIANT6 -> HENTAIGANA LETTER YO FROM CJK-8207 WITH ZIGZAG It's more verbose but it seems like it would be useful to be able to identify which variant is which from the name instead of having to consult the code charts (which IIRC aren't normative) or some supplementary table. From leoboiko at namakajiri.net Wed Dec 16 12:31:46 2015 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Wed, 16 Dec 2015 16:31:46 -0200 Subject: Hentaigana proposal In-Reply-To: References: Message-ID: I like the more descriptive names, but I'd like to have this data available in some supplementary table available anyway, regardless of the naming scheme. 2015-12-16 16:17 GMT-02:00 Garth Wallace : > On Wed, Dec 9, 2015 at 7:55 AM, Nicolas Tranter > wrote: > > I comment as a western Japanologist who teaches and researches using > > hentaigana. I have published with hentaigana using image files > (resulting in > > two publisher errors) and will publish next year with hentaigana using > the > > Koin Hentaigana font (Koin????????.tte), and anticipate typesetting > > problems. I refer to the 2015 proposal L2/15-239 to include hentaigana, > > including the appended paper by Takada Tomokazu, Yada Tsutomu and Saito > > Tatsuya ('The past, present and future of Hentaigana Standardization for > > Information Interchange'). I also refer to Yada Tsutomu's support of the > > proposal ('About the inclusion of standardized codepoints for > Hentaigana', > > L2/15-318). As the names and numbering of proposed characters is an > issue I > > deal with below, I also refer to individual hentaigana in the proposal by > > their MJ-codes as used in the proposers' own websites (e.g. > > http://mojikiban.ipa.go.jp/xb164/). > > > > > > > > SELECTION: The selection is good, consisting of 286 forms, although this > > would be realised as 299 characters. The earlier 2009 proposal referred > to > > was based on the Mojikyo M113.ttf font, which has 213 hentaigana > characters > > and includes a few major basic gaps. The Koin Hentaigana font has 549 > > characters, which excluding separate forms with voicing and > 'half-voicing' > > diacritics consists of 330 hentaigana, but includes some very rare forms, > > including ones that do not occur in late period texts. > > > > > > > > The selection of 'academic' hentaigana is appropriate and lacks major > gaps. > > On the other hand, the Ministry of Justice hentaigana requirements are > ones > > that have been decided by the Ministry of Justice in 2004 for name > > registration purposes, and so, although one could argue easily with their > > 2004 decision (and I would), the fact that they are already official > means > > it is pointless to argue with their inclusion in Unicode. > > > > > > > > It's been noted that a few hentaigana are almost identical to normal > > hiragana, especially e HENTAIGANA LETTER E VARIANT 4 = MJ090017 (cf. ?), > shi > > HENTAIGANA LETTER SI VARIANT 2 = MJ090072 (cf. HIRAGANA LETTER SI ?) and > nu > > HENTAIGANA LETTER NU VARIANT 2 = MJ090149 (cf. HIRAGANA LETTER NU ?): > their > > differences are solely that the 'brush' is removed from the paper on a > > downward rather than a rightward flourish, reflecting vertical > handwriting. > > Ordinarily I would argue against including them, but since the MoJ has > > recognised them as official variants they need to be included. > > > > > > > > The decision to propose in most cases one codepoint for the hentaigana > > derived from a single Chinese character is sensible, as also is the > decision > > to allow multiple codepoints in certain cases where manuscripts use > > side-by-side significantly distinct forms derived from the same Chinese > > character and with the same value. An example of the latter is HENTAIGANA > > LETTER KA VARIANT 3 = MJ090025and KA VARIANT 4 = MJ090026, both > pronounced > > ka and both derived from the Chinese character ?, but which are routinely > > both found in the same manuscript by the same hand as if they were > separate > > graphemes from the Heian to the Meiji periods. > > > > > > > > POLYPHONY. Several hentaigana are truly polyphonous (e.g. the ?-derived > > hentaigana = ne MJ090151 or MJ090059 ko, or the ?-derived hentaigana = me > > MJ090222 or ma MJ090205). In particular, those hentaigana derived from ? > and > > associated with n (MJ090298, MJ090299) historically (also the source of > > HIRAGANA LETTER N ?) are also used for mu (MJ090214, MJ090215) and mo > > (MJ090224, MJ090223). Diachronically, n in native Japanese words is > usually > > derived from an earlier mu. Takada et al. includes a list of 10 kanji > > sources that this applies to in the proposed repertoire. (Strictly, this > > affects 11 hentaigana, because the proposal has two forms for ?-derived > > characters.) The proposal's solution is to assign different identifiers, > > e.g. ? = HENTAIGANA LETTER NE VARIANT 1 and HENTAIGANA LETTER KO VARIANT > 2, > > ? = HENTAIGANA LETTER ME VARIANT 3 and HENTAIGANA LETTER MA VARIANT 7, > and > > the two derived from ? = HENTAIGANA LETTER N VARIANT 1, N VARIANT 2, MU > > VARIANT 1, MU VARIANT 2, MO VARIANT 1 and MO VARIANT 2. This means that > > there would be characters that are given more than one codepoint and > > identifier but are formally and etymologically identical, adding 13 > > unnecessary repetitions to the character set. I would favour Yada's > naming > > system, where the polyphonous characters are given a single codepoint and > > identifier, e.g. ? = HENTAIGANA LETTER NE-KO, ? = HENTAIGANA ME-MA, and > two > > ?-derived forms = HENTAIGANA LETTER N-MU-MO 1 and N-MU-MO 2. > > Is there a reason for sticking with the "VARIANT 1"/"VARIANT 2" naming > convention? The previous proposal was for standardized variation > sequences, so this opaque numbering made sense (since "VARIANT 1" > meant "using the first variation selector"), but the current one is to > encode them all as atomic characters. Wouldn't it be more helpful to > give them more descriptive names, possibly by identifying the > particular ideographs each is derived from? For example, instead of > HENTAIGANA LETTER E VARIANT 2, it could be HENTAIGANA LETTER E FROM > CJK-76C8. This doesn't help with same-source variants, but physical > features could work for that, e.g. > > HENTAIGANA LETTER YO VARIANT4 -> HENTAIGANA LETTER YO FROM CJK-8207 > WITH CROSSBAR > HENTAIGANA LETTER YO VARIANT5 -> HENTAIGANA LETTER YO FROM CJK-8207 WITH > LOOP > HENTAIGANA LETTER YO VARIANT6 -> HENTAIGANA LETTER YO FROM CJK-8207 WITH > ZIGZAG > > It's more verbose but it seems like it would be useful to be able to > identify which variant is which from the name instead of having to > consult the code charts (which IIRC aren't normative) or some > supplementary table. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Dec 16 19:50:09 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Dec 2015 02:50:09 +0100 Subject: Devanagari and Subscript and Superscript In-Reply-To: <20151216111628.665a7a7059d7ee80bb4d670165c8327d.e6eee701f7.wbe@email03.secureserver.net> References: <20151216111628.665a7a7059d7ee80bb4d670165c8327d.e6eee701f7.wbe@email03.secureserver.net> Message-ID: 2015-12-16 19:16 GMT+01:00 Doug Ewell : > The ones you suggest are stateful; they affect the rendering of > arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48 > ("ANSI") attribute switching, or ISO 2022 character-set switching. > Unicode tries hard to avoid encoding such things. You can try as hard as you want, there are cases where it is impossible to avoid stateful encoding if we want to avoid desunifications, or even for some characters that cannot even work without stateful analysis. And this is not solved just by style markup when that "style" is in fact completely semantic. The situation must be taken into account with more care : - For example, the superscript Latin letter o, aka "ordinal masculine", which is not just a superscript but a notation adding the semantics of a abbreviation for the final letters, linked to the other letters before it, the whole being semantically a single word: the superscript style does not create such attachment, it creates a separate "word" inside it, so it was disunified from the letter o. - But it is not a good practive to encode in Unicode things that are just styles without clear semantics (so encoding SUB/SUP is really a bad idea). - On the opposite it is simply impossible to work with Egyptian hieroglyphs as the default clusters are clearly insufficient to create ANY kind of plain-text: you need extra markup to add the necessary semantic, not style, and this markup should be encodable as plain-text without external markup for the presentation when this presenation is fully semantic and clear (e.g. the Egyptian "cartouche" for names of kings). - Similar issue occur with SingWriting and other scripts that DO require always a complex (non-linear) layout where basic clusters are clearly insufficient in ALL texts, meaning that the characters that were encoded are almost **useless** in all plain-text documents: you need extra "format" characters to create some form of orthographic rule, independantly of the style or from an external markup language. I'm in favor of adding **semantic** format characters in Unicode, not stylistic-only format characters, as soon as there does exist a wellknown orthographic convention which whould work independantly of styling. But for now the encoded format characters only work on too small clusters, clusters are only linear and this is clearly not enough (even for instructing other kinds of text analysis (such as breakers). Then the renderers will be adapted and extended to work with more complex clusters with their internal structures with simpler clusters parts). Other renderers using the legacy rules will not be able to do that but will attempt to render some basic fallback (possibly with special visible glyphs for those controls). One kind of semantic format character which is useful and encoded is the "invisible parentheses" for mathematics, which can be encoded for example after a radical sign: use them around a number to define the extension of the radical to more than one digit (and make a clear visual and semantic distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and "sqrt(2)+sqrt(3)"). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Dec 17 15:19:53 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Dec 2015 14:19:53 -0700 Subject: UN/LOCODE perspective on character sets Message-ID: <20151217141953.665a7a7059d7ee80bb4d670165c8327d.9d9d3214e7.wbe@email03.secureserver.net> UN/LOCODE version 2015-2 has been released [1], and the Manual still contains the following about character sets: "27. Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange)." Spot the errors. [1] http://www.unece.org/cefact/codesfortrade/codes_index.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu Dec 17 15:55:00 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Dec 2015 22:55:00 +0100 Subject: UN/LOCODE perspective on character sets In-Reply-To: <20151217141953.665a7a7059d7ee80bb4d670165c8327d.9d9d3214e7.wbe@email03.secureserver.net> References: <20151217141953.665a7a7059d7ee80bb4d670165c8327d.9d9d3214e7.wbe@email03.secureserver.net> Message-ID: Good catch. Once again a lot of misconception by someone who wrote it without looking at conformance requirements in these standards. The so called "standard United States character set (437)" is also a proprietary legacy charset widely used in the US but not adopted as an US standard. It should have been named "IBM/MS DOS code page 437" without reference to US (in fact it was used worldwide as the default charset on many PC's). But basically what this says is that UN/LOCODE works only with the subset of characters found in both ISO-8859-1 and CP437, and this is what "diacritic signs, when practicable" means. Of course it is *interoperable" with ISO 10646-1, but only via a transcoding conversion. CP437 ***was*** widely used in trada date interchange, it this is no longer true since long (ISO 8859-1 was adopted much more widely and now ISO 10646-1 is prefered (most of the time using UTF-8). But there still exists some old files for dBase II/III (as used in the 1980's in old softwares running MSDOS) or similar that are encoded in CP437 but those old files are not updated with the changes needed in 2015. Modern databases are running via SQL engines with interfaces exposing ISO 10646-1 (UTF-8) or only ISO8859-1 in US and western Europe. UN/LOCODE should not target just US or Western Europe. It should work as a worldwide standard, so it has to accept names in languages such as Czech or Polish that need Latin letters with diacritics not found in ISO8859-1 but other legacy ISO8859-* charsets: those languages are not transliterated to simpler forms, unlike names in Russian, Chinese, Thai, Hebrew, Arabic that define their own standard romanizations requiring also other characters not found in ISO8859-1. For UNLOCODE, the romanizations should better use the international romanizations defined for toponyms. But there's not even any reference to those existing standards (widely used in Russia, Chinab Japan, Israel, and Arabic countries). This omission is not forgivable. My opinion is that this paragraph has in fact not been updated since very long as it should have been in this 2015-2 version. Due to that, the names listed in UN/LOCODE are very questionable (and anyway the location codes in UN/LOCODE are largely deprecated in favor of ISO3166-* codes, where available, or names used by IATA or OACI, or postal codes in coutnries that have defined them, or region codes defined by their national or regional statistics institute. 2015-12-17 22:19 GMT+01:00 Doug Ewell : > UN/LOCODE version 2015-2 has been released [1], and the Manual still > contains the following about character sets: > > "27. Place names in UN/LOCODE are given in their national language > versions as expressed in the Roman alphabet using the 26 characters of > the character set adopted for international trade data interchange, with > diacritic signs, when practicable (cf. Paragraph 3.2.2 of the UN/LOCODE > Manual). International ISO Standard character sets are laid down in ISO > 8859-1 (1987) and ISO10646-1 (1993). (The standard United States > character set (437), which conforms to these ISO standards, is also > widely used in trade data interchange)." > > Spot the errors. > > [1] http://www.unece.org/cefact/codesfortrade/codes_index.html > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Thu Dec 17 16:06:10 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Thu, 17 Dec 2015 23:06:10 +0100 Subject: Additional adoption form on codepoints.net Message-ID: <567331D2.1000007@gmail.com> Hi, please let me start by saying, that I think the adoption of characters is a very good idea to provide funding for the development of Unicode. To promote this idea, I thought it could be worthwhile to place an "adopt this codepoint" button on the description pages of code points on https://codepoints.net. But before going live with that, I'd love to hear feedback, especially if the people in charge of the adoption process share my feelings. This is the implementation on my staging site as of now (sans a bit polishing): https://beta.codepoints.net/U+2F45 With Javascript enabled you should see on the right three buttons, the last one labelled "Adopt this codepoint". On clicking a dialog opens, that provides the same input possibilities that the adoption page on unicode.org shows. Submitting the form leads to the processing form on unicode.org. I tried to be sensible and explicit as to how the affiliation situation is between codepoints.net and Unicode (there is none) and what happens, when the form is filled. If the consensus is, that I should go on with this, I'd like to ask some follow-up questions: * Will the current adoption form stay stable with regard to POST parameters it accepts and to its URL? * URL: Will the form be accessible by HTTPS in the future? * Copy: Is my wording OK? Should I change something (more details, legalese, ...)? * Can characters be double-adopted? If not, is there a machine-readable list, that I can access to remove the button on already adopted code points? Thanks for your time and consideration! Cheers, Manuel From leob at mailcom.com Thu Dec 17 16:38:21 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 17 Dec 2015 14:38:21 -0800 Subject: Additional adoption form on codepoints.net In-Reply-To: <567331D2.1000007@gmail.com> References: <567331D2.1000007@gmail.com> Message-ID: As far as I'm concerned, the pop-up contents should end with the link "Read more about codepoint adoption." In your brief description, you might want to add a proviso about the temporary nature of character adoption. Submitting forms to a 3rd party site is a bad idea. Leo On Thu, Dec 17, 2015 at 2:06 PM, Manuel Strehl wrote: > Hi, > > please let me start by saying, that I think the adoption of characters > is a very good idea to provide funding for the development of Unicode. > > To promote this idea, I thought it could be worthwhile to place an > "adopt this codepoint" button on the description pages of code points on > https://codepoints.net. But before going live with that, I'd love to > hear feedback, especially if the people in charge of the adoption > process share my feelings. > > This is the implementation on my staging site as of now (sans a bit > polishing): > > https://beta.codepoints.net/U+2F45 > > With Javascript enabled you should see on the right three buttons, the > last one labelled "Adopt this codepoint". On clicking a dialog opens, > that provides the same input possibilities that the adoption page on > unicode.org shows. Submitting the form leads to the processing form on > unicode.org. > > I tried to be sensible and explicit as to how the affiliation situation > is between codepoints.net and Unicode (there is none) and what happens, > when the form is filled. > > If the consensus is, that I should go on with this, I'd like to ask some > follow-up questions: > > * Will the current adoption form stay stable with regard to POST > parameters it accepts and to its URL? > * URL: Will the form be accessible by HTTPS in the future? > * Copy: Is my wording OK? Should I change something (more details, > legalese, ...)? > * Can characters be double-adopted? If not, is there a machine-readable > list, that I can access to remove the button on already adopted code points? > > Thanks for your time and consideration! > > Cheers, > Manuel From boldewyn at gmail.com Thu Dec 17 16:48:07 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Thu, 17 Dec 2015 23:48:07 +0100 Subject: Additional adoption form on codepoints.net In-Reply-To: References: <567331D2.1000007@gmail.com> Message-ID: <56733BA7.1060802@gmail.com> Thanks for the comment! > As far as I'm concerned, the pop-up contents should end with the link > "Read more about codepoint adoption." In your brief description, you > might want to add a proviso about the temporary nature of character > adoption. Good catch! The 12-month period is important to mention. Putting the "Read more" at the end sounds good to me, too. > Submitting forms to a 3rd party site is a bad idea. In principle I agree, But here I do that on purpose (the purpose being to prefill the "codepoint to adopt" field). Since the form on the adoption landing page doesn't fill in values given via form submit, I decided to re-build it and let it post directly to the next step. -Manuel From leob at mailcom.com Thu Dec 17 17:41:37 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 17 Dec 2015 15:41:37 -0800 Subject: Additional adoption form on codepoints.net In-Reply-To: <56733BA7.1060802@gmail.com> References: <567331D2.1000007@gmail.com> <56733BA7.1060802@gmail.com> Message-ID: By "should end with the link" I implied "should not contain anything that followed it at the time of my comment, including the form and the - thus obviated - disclaimer. :) If submitting a form to a 3rd party site results in an error for any reason, the error will be displayed by the target host. This will create a negative impression upon the target host, even if the bug is in the submitting page, therefore submitting forms to 3rd party sites should be avoided. You can provide an opportunity and a suggestion to copy (ctrl-C) the symbol in the pop-up. On Thu, Dec 17, 2015 at 2:48 PM, Manuel Strehl wrote: > Thanks for the comment! > >> As far as I'm concerned, the pop-up contents should end with the link >> "Read more about codepoint adoption." In your brief description, you >> might want to add a proviso about the temporary nature of character >> adoption. > Good catch! The 12-month period is important to mention. Putting the > "Read more" at the end sounds good to me, too. >> Submitting forms to a 3rd party site is a bad idea. > In principle I agree, But here I do that on purpose (the purpose being > to prefill the "codepoint to adopt" field). Since the form on the > adoption landing page doesn't fill in values given via form submit, I > decided to re-build it and let it post directly to the next step. > > -Manuel From unicode at maxtruxa.com Fri Dec 18 01:23:06 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Fri, 18 Dec 2015 08:23:06 +0100 Subject: Additional adoption form on codepoints.net In-Reply-To: <56733BA7.1060802@gmail.com> References: <567331D2.1000007@gmail.com> <56733BA7.1060802@gmail.com> Message-ID: If that's the only reason to post the form yourself, you could request the addition of an optional get parameter on the adoption page. If it's clear that this parameter is intended to be used by third party services the "stability problem" would be solved. Otherwise, great idea! Thanks for the comment! > As far as I'm concerned, the pop-up contents should end with the link > "Read more about codepoint adoption." In your brief description, you > might want to add a proviso about the temporary nature of character > adoption. Good catch! The 12-month period is important to mention. Putting the "Read more" at the end sounds good to me, too. > Submitting forms to a 3rd party site is a bad idea. In principle I agree, But here I do that on purpose (the purpose being to prefill the "codepoint to adopt" field). Since the form on the adoption landing page doesn't fill in values given via form submit, I decided to re-build it and let it post directly to the next step. -Manuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Dec 18 01:26:44 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 18 Dec 2015 08:26:44 +0100 Subject: UN/LOCODE perspective on character sets In-Reply-To: <20151217141953.665a7a7059d7ee80bb4d670165c8327d.9d9d3214e7.wbe@email03.secureserver.net> References: <20151217141953.665a7a7059d7ee80bb4d670165c8327d.9d9d3214e7.wbe@email03.secureserver.net> Message-ID: Haven't looked it over in detail, but here is the notice: http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf >From a quick scan: They've added latitude/longitude (to the minute, ~2km); that's great because often the names of locations are ambiguous. They still have deviations from the IATA codes, and various strange omissions. And (as you note) they don't include the native name, unless it can be spelled with a *subset* of Latin-1 characters (ugg). They list the ISO subdivision code sometimes, but no consistent inclusion relations for other codes (eg, they do have that San Francisco is in California, but they miss many other similar relations in other countries). And the latitude/longitude is often missing. More at http://www.unece.org/cefact/locode/welcome.html Mark On Thu, Dec 17, 2015 at 10:19 PM, Doug Ewell wrote: > UN/LOCODE version 2015-2 has been released [1], and the Manual still > contains the following about character sets: > > "27. Place names in UN/LOCODE are given in their national language > versions as expressed in the Roman alphabet using the 26 characters of > the character set adopted for international trade data interchange, with > diacritic signs, when practicable (cf. Paragraph 3.2.2 of the UN/LOCODE > Manual). International ISO Standard character sets are laid down in ISO > 8859-1 (1987) and ISO10646-1 (1993). (The standard United States > character set (437), which conforms to these ISO standards, is also > widely used in trade data interchange)." > > Spot the errors. > > [1] http://www.unece.org/cefact/codesfortrade/codes_index.html > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Fri Dec 18 01:47:03 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Fri, 18 Dec 2015 08:47:03 +0100 Subject: Additional adoption form on codepoints.net In-Reply-To: References: <567331D2.1000007@gmail.com> <56733BA7.1060802@gmail.com> Message-ID: Thank you! Yes, that's an implicit part of the "I'd like feedback from the people involved" :-) In fact, if such a GET parameter existed, I could remove the dialog and replace it with a simple link. This sounds like a good idea in principle. (It would also fix Leo Broukhis' issue.) Does anyone know, who is in charge of the actual form on the server, so that I can get in touch? Cheers, Manuel 2015-12-18 8:23 GMT+01:00 Max Truxa : > If that's the only reason to post the form yourself, you could request the > addition of an optional get parameter on the adoption page. If it's clear > that this parameter is intended to be used by third party services the > "stability problem" would be solved. > > Otherwise, great idea! > Thanks for the comment! > > > As far as I'm concerned, the pop-up contents should end with the link > > "Read more about codepoint adoption." In your brief description, you > > might want to add a proviso about the temporary nature of character > > adoption. > Good catch! The 12-month period is important to mention. Putting the > "Read more" at the end sounds good to me, too. > > Submitting forms to a 3rd party site is a bad idea. > In principle I agree, But here I do that on purpose (the purpose being > to prefill the "codepoint to adopt" field). Since the form on the > adoption landing page doesn't fill in values given via form submit, I > decided to re-build it and let it post directly to the next step. > > -Manuel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Dec 21 18:10:26 2015 From: everson at evertype.com (Michael Everson) Date: Tue, 22 Dec 2015 00:10:26 +0000 Subject: ISO 15924 Message-ID: ISO 15924 has been updated. ?Root? has been exceptionally reserved at the request of the CLDR-TC. Newa has been added for the Newa (Newar, Newari, Nep?la lipi) script Piqd nas been added for Klingon (Klingon Language Institure pIqaD) Zsye has been added as a variant of Zsym (compare Latf and Latg, variants of Latn) to indicate symbols used as emoji. A Glad Solstice to all. Michael Everson Registrar From costello at mitre.org Fri Dec 25 07:43:23 2015 From: costello at mitre.org (Costello, Roger L.) Date: Fri, 25 Dec 2015 13:43:23 +0000 Subject: Symbol for an upside down capital L, pointing to the right? Message-ID: Hi Folks, Here is the upside down capital L, pointing to the left: ? - TURNED SANS-SERIF CAPITAL L (U+2142) Is there a symbol for an upside down capital L, pointing to the right? /Roger From verdy_p at wanadoo.fr Fri Dec 25 08:04:11 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 25 Dec 2015 15:04:11 +0100 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: Greek Capital letter Gamma... Le 25 d?c. 2015 14:54, "Costello, Roger L." a ?crit : > Hi Folks, > > Here is the upside down capital L, pointing to the left: > > ? - TURNED SANS-SERIF CAPITAL L (U+2142) > > Is there a symbol for an upside down capital L, pointing to the right? > > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at doves.demon.co.uk Mon Dec 28 15:26:14 2015 From: jonathan at doves.demon.co.uk (Jonathan Coxhead) Date: Mon, 28 Dec 2015 13:26:14 -0800 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: <5681A8F6.4070109@doves.demon.co.uk> On 2015-12-25 5:43am, Costello, Roger L. wrote: > Hi Folks, > > Here is the upside down capital L, pointing to the left: > > ? - TURNED SANS-SERIF CAPITAL L (U+2142) > > Is there a symbol for an upside down capital L, pointing to the right? > > /Roger Maybe these would help you? ? TOP LEFT CORNER ? TOP RIGHT CORNER ? BOTTOM LEFT CORNER ? BOTTOM RIGHT CORNER ?Jonathan From asmus-inc at ix.netcom.com Mon Dec 28 17:48:05 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 28 Dec 2015 15:48:05 -0800 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: <5681A8F6.4070109@doves.demon.co.uk> References: <5681A8F6.4070109@doves.demon.co.uk> Message-ID: <5681CA35.8000106@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Dec 29 06:01:06 2015 From: everson at evertype.com (Michael Everson) Date: Tue, 29 Dec 2015 12:01:06 +0000 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: <5681CA35.8000106@ix.netcom.com> References: <5681A8F6.4070109@doves.demon.co.uk> <5681CA35.8000106@ix.netcom.com> Message-ID: <86679926-6CAA-4819-BF13-7981FE43762D@evertype.com> On 28 Dec 2015, at 23:48, Asmus Freytag (t) wrote: > Rather than engage in reflexive ad-hoc unification like this, it would be useful to find out why U+2142 was disunified from TOP RIGHT CORNER and any other symbols having two strokes at right angle with one of them pointing down. I think the letter-like symbols were added. Not ?disunified?. The default state is not that ?everything is already encoded?. 2142 and characters near it were added in Unicode 3.2, about the same time that the mathematical styled alphabets were added (3.1). I think you were involved with a lot of that Asmus. Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Tue Dec 29 07:09:13 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 29 Dec 2015 05:09:13 -0800 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: <86679926-6CAA-4819-BF13-7981FE43762D@evertype.com> References: <5681A8F6.4070109@doves.demon.co.uk> <5681CA35.8000106@ix.netcom.com> <86679926-6CAA-4819-BF13-7981FE43762D@evertype.com> Message-ID: <568285F9.1010504@ix.netcom.com> On 12/29/2015 4:01 AM, Michael Everson wrote: > On 28 Dec 2015, at 23:48, Asmus Freytag (t) wrote: > >> Rather than engage in reflexive ad-hoc unification like this, it would be useful to find out why U+2142 was disunified from TOP RIGHT CORNER and any other symbols having two strokes at right angle with one of them pointing down. > I think the letter-like symbols were added. Not ?disunified?. The default state is not that ?everything is already encoded?. > > 2142 and characters near it were added in Unicode 3.2, about the same time that the mathematical styled alphabets were added (3.1). I think you were involved with a lot of that Asmus. I certainly was more active then. If you follow the documents from the time, you'll find that U+2142 came from a set of letterlike characters that were (together) part of a set of mathematical characters being added. And, tellingly, it was not the only L shape in the set. As the goal was to cover existing sets in full, and the source was entity or character sets, rather than "examples in print", the analysis on some of the individual characters didn't go as deep as it has recently with the addition of one-off extensions. For example, the rationale for inclusion was the membership in one (or more) of the sets; there was no independent verification that each of the member of these sets would individually merit encoding, and no examples of usage of individual characters were collected, the sets as such being well-established. Still, a classification was carried out, based on information of usage available to various practitioners that were called as experts or were expert contributors to the proposal - but again, without further documentation for each individual character. The characters later encoded at positions U+2142 and U+2143 where identified as "normal", that is as variables (aka letter symbols) as opposed to operators, delimiters or the like. As the occur in the context of a turned capital G (U+2141) and inverted capital Y (U+2144) (all sans-serif) their classification appears to be well-motivated. From the latest stages of the proposal documents it's not fully clear whether they were grouped in the source character sets, or whether their presentation in the current order already reflects a "sorting" of like characters for purposes of the final encoding. However, the character sets investigated did include other symbols (floors, ceilings) that were "L" shaped, and presumably the letter-like symbols existed, from the start, in contrast. Together with the other sans-serif characters, even without documentation where in mathematical notation these are employed, makes it unlikely that their identification should be questioned (Again, the proposers may well have had more knowledge on usage than they documented, or some information may have only been present in hardcopy form - at the time not an uncommon occurrence). Anyway, unless we have specific reason to doubt that the classification of "normal" is indeed correct and that the shapes really are letters, let's assume that they were correctly identified and encoded as such. If we now have a putative mirror image of one of these symbols, we need to know whether it is a letter, or some other symbol, perhaps an operator or a delimiter. If neither, then we can exclude unification with the corners, floors and ceilings, etc. That's all, A./ From A.Schappo at lboro.ac.uk Wed Dec 30 10:16:09 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 30 Dec 2015 16:16:09 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <567331D2.1000007@gmail.com> References: <567331D2.1000007@gmail.com> Message-ID: <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> A few months ago I asked a class of 140+ first year Computer Science programme and Joint programme students - Who has heard of Unicode? about 20% of the students raised their hands. then I quickly followed it with the question ?and who understands Unicode? Every single student whose hand was raised put it down. Some of these students were really experienced programmers, having programmed from an early age. Many times over the years I have informally asked students studying in the UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international students) what they know of Unicode and the vast majority of the time they know nothing or next to nothing. The fundamental problem, as I see it, is that the teaching of Unicode is not on the curriculum of Schools, Colleges or Universities in the UK. IMHO, It should be! I do wherever and whenever I can, incorporate Unicode in my teaching e.g. recently I gave an introductory lecture on Regular Expressions and in my examples I demonstrated, using Unicode text and patterns and not just ASCII. One such example I used was ? /^?+??+$/ This regex is a reference to Hongkong and the visiting giant floating rubber duck?? My regex examples also include Emoji and Egyptian Hieroglyphs?? Does anyone on this list teach Unicode at an Educational Establishment, School, or College or University? Andr? Schappo From jsbien at mimuw.edu.pl Wed Dec 30 10:43:10 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Wed, 30 Dec 2015 17:43:10 +0100 Subject: Unicode in the Curriculum? In-Reply-To: <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> Message-ID: <20151230174310.18964kwjj3z86sgu@mail.mimuw.edu.pl> Quote/Cytat - Andre Schappo (?ro, 30 gru 2015, 17:16:09): > Does anyone on this list teach Unicode at an Educational > Establishment, School, or College or University? In a sense: https://usosweb.uw.edu.pl/kontroler.php?_action=katalog2/przedmioty/pokazPrzedmiot&kod=3322-TUS-OG Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From addison at lab126.com Wed Dec 30 10:45:34 2015 From: addison at lab126.com (Phillips, Addison) Date: Wed, 30 Dec 2015 16:45:34 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> Message-ID: <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> > A few months ago I asked a class of 140+ first year Computer Science > programme and Joint programme students - > > Who has heard of Unicode? I do a similar survey whenever I teach the remedial I18N and Unicode classes at Amazon. When I ask if software developers *ever* received any formal education on internationalization or on character encodings, results are almost universally negative--more like zero percent than 20%. Which is one reason why we have to spend a significant amount of effort maintaining a training and education program. I suspect I'm not alone in the industry in thinking that educational establishments could do a better job of preparing developers with at least the basics of Unicode, character encodings, and internationalization. Addison Phillips Principal SDE, I18N Architect (Amazon) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andre > Schappo > Sent: Wednesday, December 30, 2015 8:16 AM > To: Unicode Public > Subject: Unicode in the Curriculum? > > A few months ago I asked a class of 140+ first year Computer Science > programme and Joint programme students - > > Who has heard of Unicode? > > about 20% of the students raised their hands. > > then I quickly followed it with the question > > ?and who understands Unicode? > > Every single student whose hand was raised put it down. > > Some of these students were really experienced programmers, having > programmed from an early age. > > Many times over the years I have informally asked students studying in the > UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international > students) what they know of Unicode and the vast majority of the time they > know nothing or next to nothing. > > The fundamental problem, as I see it, is that the teaching of Unicode is not > on the curriculum of Schools, Colleges or Universities in the UK. IMHO, It > should be! > > I do wherever and whenever I can, incorporate Unicode in my teaching e.g. > recently I gave an introductory lecture on Regular Expressions and in my > examples I demonstrated, using Unicode text and patterns and not just ASCII. > > One such example I used was ? /^?+??+$/ > > This regex is a reference to Hongkong and the visiting giant floating rubber > duck?? > > My regex examples also include Emoji and Egyptian Hieroglyphs?? > > Does anyone on this list teach Unicode at an Educational Establishment, > School, or College or University? > > Andr? Schappo > From dzo at bisharat.net Wed Dec 30 13:30:55 2015 From: dzo at bisharat.net (Don Osborn) Date: Wed, 30 Dec 2015 14:30:55 -0500 Subject: Unicode in the Curriculum? In-Reply-To: <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> Message-ID: <568430EF.2080700@bisharat.net> Good question and interesting responses so far. I've taken the opportunity to expand on it quickly in the hopes of eliciting some information from Africa. See http://niamey.blogspot.com/2015/12/unicode-in-african-computer-science.html Note mention of the Hausa and Fulfulde apps developed by computer science students at American University of Nigeria. It may be that Unicode figures in the curriculum there. Don Osborn On 12/30/2015 11:45 AM, Phillips, Addison wrote: >> A few months ago I asked a class of 140+ first year Computer Science >> programme and Joint programme students - >> >> Who has heard of Unicode? > I do a similar survey whenever I teach the remedial I18N and Unicode classes at Amazon. When I ask if software developers *ever* received any formal education on internationalization or on character encodings, results are almost universally negative--more like zero percent than 20%. Which is one reason why we have to spend a significant amount of effort maintaining a training and education program. > > I suspect I'm not alone in the industry in thinking that educational establishments could do a better job of preparing developers with at least the basics of Unicode, character encodings, and internationalization. > > Addison Phillips > Principal SDE, I18N Architect (Amazon) > Chair (W3C I18N WG) > > Internationalization is not a feature. > It is an architecture. > > > > >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andre >> Schappo >> Sent: Wednesday, December 30, 2015 8:16 AM >> To: Unicode Public >> Subject: Unicode in the Curriculum? >> >> A few months ago I asked a class of 140+ first year Computer Science >> programme and Joint programme students - >> >> Who has heard of Unicode? >> >> about 20% of the students raised their hands. >> >> then I quickly followed it with the question >> >> ?and who understands Unicode? >> >> Every single student whose hand was raised put it down. >> >> Some of these students were really experienced programmers, having >> programmed from an early age. >> >> Many times over the years I have informally asked students studying in the >> UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international >> students) what they know of Unicode and the vast majority of the time they >> know nothing or next to nothing. >> >> The fundamental problem, as I see it, is that the teaching of Unicode is not >> on the curriculum of Schools, Colleges or Universities in the UK. IMHO, It >> should be! >> >> I do wherever and whenever I can, incorporate Unicode in my teaching e.g. >> recently I gave an introductory lecture on Regular Expressions and in my >> examples I demonstrated, using Unicode text and patterns and not just ASCII. >> >> One such example I used was ? /^?+??+$/ >> >> This regex is a reference to Hongkong and the visiting giant floating rubber >> duck?? >> >> My regex examples also include Emoji and Egyptian Hieroglyphs?? >> >> Does anyone on this list teach Unicode at an Educational Establishment, >> School, or College or University? >> >> Andr? Schappo >> > From boldewyn at gmail.com Wed Dec 30 15:37:21 2015 From: boldewyn at gmail.com (Manuel Strehl) Date: Wed, 30 Dec 2015 22:37:21 +0100 Subject: Unicode in the Curriculum? In-Reply-To: <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> Message-ID: <56844E91.4020907@gmail.com> Not technically a school, but I gave a Batman-themed high-level overview of Unicode at Munich's local JavaScript user group two years ago: http://www.manuel-strehl.de/publications/holy-batman/presentation It was well received, especially for its lighter tone on this perceived dry subject, and for real-world problems, that JS developers faced, and that I addressed in the talk. This is the gist of my response: To successfully introduce students to the concepts behind Unicode, it worked for me to start with problems and WTF-moments and work from there. I also taught university courses some years ago, where I had a similar tactic, which worked quite well, teaching XML to physics undergrads... Cheers, Manuel Am 30.12.2015 um 17:16 schrieb Andre Schappo: > A few months ago I asked a class of 140+ first year Computer Science programme and Joint programme students - > > Who has heard of Unicode? > > about 20% of the students raised their hands. > > then I quickly followed it with the question > > ?and who understands Unicode? > > Every single student whose hand was raised put it down. > > Some of these students were really experienced programmers, having programmed from an early age. > > Many times over the years I have informally asked students studying in the UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international students) what they know of Unicode and the vast majority of the time they know nothing or next to nothing. > > The fundamental problem, as I see it, is that the teaching of Unicode is not on the curriculum of Schools, Colleges or Universities in the UK. IMHO, It should be! > > I do wherever and whenever I can, incorporate Unicode in my teaching e.g. recently I gave an introductory lecture on Regular Expressions and in my examples I demonstrated, using Unicode text and patterns and not just ASCII. > > One such example I used was ? /^?+??+$/ > > This regex is a reference to Hongkong and the visiting giant floating rubber duck?? > > My regex examples also include Emoji and Egyptian Hieroglyphs?? > > Does anyone on this list teach Unicode at an Educational Establishment, School, or College or University? > > Andr? Schappo > > From chandrakantd at cdac.in Wed Dec 30 22:45:10 2015 From: chandrakantd at cdac.in (chandrakantd at cdac.in) Date: Thu, 31 Dec 2015 10:15:10 +0530 Subject: Unicode in the Curriculum? In-Reply-To: <56844E91.4020907@gmail.com> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <56844E91.4020907@gmail.com> Message-ID: I could find this link which was read by me long time ago. https://www.cs.tut.fi/~jkorpela/chars.html Regards, Chandrakant Dhutadmal -----Original Message----- From: Manuel Strehl Sent: Thursday, December 31, 2015 3:07 AM To: Unicode Public Subject: Re: Unicode in the Curriculum? Not technically a school, but I gave a Batman-themed high-level overview of Unicode at Munich's local JavaScript user group two years ago: http://www.manuel-strehl.de/publications/holy-batman/presentation It was well received, especially for its lighter tone on this perceived dry subject, and for real-world problems, that JS developers faced, and that I addressed in the talk. This is the gist of my response: To successfully introduce students to the concepts behind Unicode, it worked for me to start with problems and WTF-moments and work from there. I also taught university courses some years ago, where I had a similar tactic, which worked quite well, teaching XML to physics undergrads... Cheers, Manuel Am 30.12.2015 um 17:16 schrieb Andre Schappo: > A few months ago I asked a class of 140+ first year Computer Science > programme and Joint programme students - > > Who has heard of Unicode? > > about 20% of the students raised their hands. > > then I quickly followed it with the question > > ?and who understands Unicode? > > Every single student whose hand was raised put it down. > > Some of these students were really experienced programmers, having > programmed from an early age. > > Many times over the years I have informally asked students studying in the > UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international > students) what they know of Unicode and the vast majority of the time they > know nothing or next to nothing. > > The fundamental problem, as I see it, is that the teaching of Unicode is > not on the curriculum of Schools, Colleges or Universities in the UK. > IMHO, It should be! > > I do wherever and whenever I can, incorporate Unicode in my teaching e.g. > recently I gave an introductory lecture on Regular Expressions and in my > examples I demonstrated, using Unicode text and patterns and not just > ASCII. > > One such example I used was ? /^?+??+$/ > > This regex is a reference to Hongkong and the visiting giant floating > rubber duck?? > > My regex examples also include Emoji and Egyptian Hieroglyphs?? > > Does anyone on this list teach Unicode at an Educational Establishment, > School, or College or University? > > Andr? Schappo > > ------------------------------------------------------------------------------------------------------------------------------- [ C-DAC is on Social-Media too. Kindly follow us at: Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] This e-mail is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies and the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email is strictly prohibited and appropriate legal action will be taken. ------------------------------------------------------------------------------------------------------------------------------- From A.Schappo at lboro.ac.uk Thu Dec 31 05:08:06 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Thu, 31 Dec 2015 11:08:06 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> Message-ID: On 30 Dec 2015, at 16:45, Phillips, Addison wrote: >> A few months ago I asked a class of 140+ first year Computer Science >> programme and Joint programme students - >> >> Who has heard of Unicode? > > I do a similar survey whenever I teach the remedial I18N and Unicode classes at Amazon. When I ask if software developers *ever* received any formal education on internationalization or on character encodings, results are almost universally negative--more like zero percent than 20%. Which is one reason why we have to spend a significant amount of effort maintaining a training and education program. > > I suspect I'm not alone in the industry in thinking that educational establishments could do a better job of preparing developers with at least the basics of Unicode, character encodings, and internationalization. > > Addison Phillips > Principal SDE, I18N Architect (Amazon) > Chair (W3C I18N WG) > > Internationalization is not a feature. > It is an architecture. I have been hitting my head against the Academic Brick Wall for years WRT getting IT i18n and Unicode on the curriculum and I am losing. I did teach a final year elective module on IT i18n but a few months ago my University dropped it. I am continually puzzled by the lack of interest University Computer Science departments have in i18n. I appear to be a solitary UK University Computer Science voice when it comes to i18n. ?and I think this is where Industry comes in. I think that Industry should be lobbying/pressuring University Computer Science departments to get i18n and Unicode on the curriculum. If industry does not speak up then I cannot see anything changing in Academia. Academia will continue teaching text processing using ASCII only. Andr? Schappo >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andre >> Schappo >> Sent: Wednesday, December 30, 2015 8:16 AM >> To: Unicode Public >> Subject: Unicode in the Curriculum? >> >> A few months ago I asked a class of 140+ first year Computer Science >> programme and Joint programme students - >> >> Who has heard of Unicode? >> >> about 20% of the students raised their hands. >> >> then I quickly followed it with the question >> >> ?and who understands Unicode? >> >> Every single student whose hand was raised put it down. >> >> Some of these students were really experienced programmers, having >> programmed from an early age. >> >> Many times over the years I have informally asked students studying in the >> UK (1st, 2nd, 3rd year undergrad, MSc, PhD, home students, international >> students) what they know of Unicode and the vast majority of the time they >> know nothing or next to nothing. >> >> The fundamental problem, as I see it, is that the teaching of Unicode is not >> on the curriculum of Schools, Colleges or Universities in the UK. IMHO, It >> should be! >> >> I do wherever and whenever I can, incorporate Unicode in my teaching e.g. >> recently I gave an introductory lecture on Regular Expressions and in my >> examples I demonstrated, using Unicode text and patterns and not just ASCII. >> >> One such example I used was ? /^?+??+$/ >> >> This regex is a reference to Hongkong and the visiting giant floating rubber >> duck?? >> >> My regex examples also include Emoji and Egyptian Hieroglyphs?? >> >> Does anyone on this list teach Unicode at an Educational Establishment, >> School, or College or University? >> >> Andr? Schappo >> > From jcb+unicode at inf.ed.ac.uk Thu Dec 31 12:58:44 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Thu, 31 Dec 2015 18:58:44 +0000 (GMT) Subject: Unicode in the Curriculum? References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> Message-ID: On 2015-12-31, Andre Schappo wrote: > I have been hitting my head against the Academic Brick Wall for > years WRT getting IT i18n and Unicode on the curriculum and I am > losing. I did teach a final year elective module on IT i18n but a > few months ago my University dropped it. I am continually puzzled by > the lack of interest University Computer Science departments have in > i18n. I appear to be a solitary UK University Computer Science voice > when it comes to i18n. Well, I'd say that it's not the business of Computer Science degrees to teach specific technical skills. It's our business to help people learn about the fundamentals of the subject, so that they can acquire any specific skill on demand, and use that skill competently. In those areas where we do teach specific skills (e.g. machine learning techniques) we teach those that have some intellectual content to them. (This is why we don't teach programming languages as such - we teach a programming language as a means of learning a programming paradigm.) In my experience so far, using Unicode and doing i18n is not very interesting (killingly boring, actually) from a purely CS technical point of view, unless you happen to be one of the small minority who enjoys script and font layout issues - the interesting bits of doing i18n are in producing linguistically and culturally appropriate messages, and that's where one should bring in experts, not expect typical software developers to be able to do it. If you still have the materials for your course, it would be interesting to see how you managed to get an interesting (and examinable!) course out of i18n. I do in fact mention Unicode and i18n in my introductory programming course (which is not for CS students), but all I say is "you should know it's there, and if you become a competent programmer, then you can read the manuals and tutorials to learn what you need". -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.