From moyogo at gmail.com Thu Sep 4 01:37:48 2014 From: moyogo at gmail.com (Denis Jacquerye) Date: Thu, 4 Sep 2014 07:37:48 +0100 Subject: Nosy: Emoji Mr Potato Head font Message-ID: http://rrry.me/nosy/ with a live demo http://rrry.me/nosydemo/ -- Denis Moyogo Jacquerye -------------- next part -------------- An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Thu Sep 4 08:15:46 2014 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 4 Sep 2014 15:15:46 +0200 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: References: Message-ID: Evilly genial! It's not clear how to get the font, though. On Thu, Sep 4, 2014 at 8:37 AM, Denis Jacquerye wrote: > http://rrry.me/nosy/ with a live demo http://rrry.me/nosydemo/ > > -- > Denis Moyogo Jacquerye > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From magnus at bodin.org Thu Sep 4 09:11:09 2014 From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Thu, 4 Sep 2014 16:11:09 +0200 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: References: Message-ID: There is a red button stating "DOWNLOAD NOSY" with white letters. On Thu, Sep 4, 2014 at 3:15 PM, Pierpaolo Bernardi wrote: > Evilly genial! > > It's not clear how to get the font, though. > > > > On Thu, Sep 4, 2014 at 8:37 AM, Denis Jacquerye wrote: >> http://rrry.me/nosy/ with a live demo http://rrry.me/nosydemo/ >> >> -- >> Denis Moyogo Jacquerye >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From olopierpa at gmail.com Thu Sep 4 09:19:21 2014 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 4 Sep 2014 16:19:21 +0200 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 4:11 PM, Magnus Bodin ? wrote: > There is a red button stating "DOWNLOAD NOSY" with white letters. Yes, and it opens a popup which says it costs 0$ and asks for a credit card. From beckiergb at gmail.com Thu Sep 4 11:18:47 2014 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Thu, 4 Sep 2014 09:18:47 -0700 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: References: Message-ID: On Thu, Sep 4, 2014 at 7:19 AM, Pierpaolo Bernardi wrote: > On Thu, Sep 4, 2014 at 4:11 PM, Magnus Bodin ? wrote: > > There is a red button stating "DOWNLOAD NOSY" with white letters. > > Yes, and it opens a popup which says it costs 0$ and asks for a credit > card. With no way to verify an https connection. -------------- next part -------------- An HTML attachment was scrubbed... URL: From magnus at bodin.org Thu Sep 4 11:29:01 2014 From: magnus at bodin.org (=?utf-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Thu, 4 Sep 2014 18:29:01 +0200 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: References: Message-ID: <2B7D75AA-7028-4B78-940A-22922A79B6FF@bodin.org> Gift plead only. No fear. If you fill in zero in the dollar box, the credit card question politely dissapears. > 4 sep 2014 kl. 16:19 skrev Pierpaolo Bernardi : > >> On Thu, Sep 4, 2014 at 4:11 PM, Magnus Bodin ? wrote: >> There is a red button stating "DOWNLOAD NOSY" with white letters. > > Yes, and it opens a popup which says it costs 0$ and asks for a credit card. From mark at macchiato.com Thu Sep 4 14:32:59 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 4 Sep 2014 21:32:59 +0200 Subject: CLDR 26 final beta Message-ID: The CLDR version 26 data and specification is available for a final beta review. The modifications to the spec can be reviewed at: http://www.unicode.org/reports/tr35/proposed.html#Modifications The data is available at: http://unicode.org/Public/cldr/26/ Beta data charts are available at http://www.unicode.org/repos/cldr-aux/charts/26/index.html To file feedback, go to http://unicode.org/cldr/trac/newticket. At this point, we are particularly interested in "showstopper bugs"; others are absolutely worth filing, but would not be handed in this release. The final review is only open for a short period, until the morning of Wed, Sept 10. The data and spec may have further (small) changes during the course of the beta: a few outlying bugs are still being worked on, and the JSON data is not yet in the beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Fri Sep 5 06:54:31 2014 From: boldewyn at gmail.com (Manuel Strehl) Date: Fri, 5 Sep 2014 13:54:31 +0200 Subject: Nosy: Emoji Mr Potato Head font In-Reply-To: <2B7D75AA-7028-4B78-940A-22922A79B6FF@bodin.org> References: <2B7D75AA-7028-4B78-940A-22922A79B6FF@bodin.org> Message-ID: Or you could simply download Grumpy Emoji for the only face ever needed in Unicode: http://codepoints.github.io/grumpy/ SCNR. Apart from that, I think the "facetype" idea is neat. Interesting to see, how Opentype features are exploited. Manuel 2014-09-04 18:29 GMT+02:00 Magnus Bodin ? : > Gift plead only. No fear. > > If you fill in zero in the dollar box, the credit card question politely > dissapears. > > > > 4 sep 2014 kl. 16:19 skrev Pierpaolo Bernardi : > > > >> On Thu, Sep 4, 2014 at 4:11 PM, Magnus Bodin ? > wrote: > >> There is a red button stating "DOWNLOAD NOSY" with white letters. > > > > Yes, and it opens a popup which says it costs 0$ and asks for a credit > card. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m-prosser at uchicago.edu Mon Sep 8 11:42:19 2014 From: m-prosser at uchicago.edu (Miller Prosser) Date: Mon, 8 Sep 2014 11:42:19 -0500 Subject: Transliterate Ugaritic aleph letters Message-ID: <01b101cfcb83$db5d1aa0$92174fe0$@uchicago.edu> Dear fellow Unicode mail list members, I'm writing to ask for help regarding the transliteration of three letters in the Ugaritic language. Have I overlooked the Unicode symbols for transliterating the letters below? If so, please point me in the direction of the proper Unicode code points. If not, I shall begin the process to propose that these three codes be added. While Unicode has adopted a block of code points for the 30 cuneiform Ugaritic letters (http://www.unicode.org/charts/PDF/U10380.pdf), it seems that we have not yet defined code points for the transliteration of the three unique Ugaritic aleph signs. The three cuneiform signs in question are: 10380 Ugaritic letter alpha 1039B Ugaritic letter I 1039C Ugaritic letter U Standard transliteration of these letters would combine U+02BE over a, i, and u respectively. Any information would be greatly appreciated. Best, Miller Prosser, Ph.D. Research Database Consultant m-prosser at uchicago.edu OCHRE Data Service -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 4617 bytes Desc: not available URL: From everson at evertype.com Mon Sep 8 12:33:46 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 8 Sep 2014 18:33:46 +0100 Subject: Transliterate Ugaritic aleph letters In-Reply-To: <01b101cfcb83$db5d1aa0$92174fe0$@uchicago.edu> References: <01b101cfcb83$db5d1aa0$92174fe0$@uchicago.edu> Message-ID: On 8 Sep 2014, at 17:42, Miller Prosser wrote: > 10380 Ugaritic letter alpha > 1039B Ugaritic letter I > 1039C Ugaritic letter U > > Standard transliteration of these letters would combine U+02BE over a, i, and u respectively. You can use U+0357 just write a? i? u? Michael Everson * http://www.evertype.com/ From roozbeh at unicode.org Mon Sep 8 12:56:18 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Mon, 8 Sep 2014 10:56:18 -0700 Subject: Transliterate Ugaritic aleph letters In-Reply-To: References: <01b101cfcb83$db5d1aa0$92174fe0$@uchicago.edu> Message-ID: What Michael said, assuming the i loses its dot when the half-ring is displayed over it. If it keeps its dot, you should use instead. On Mon, Sep 8, 2014 at 10:33 AM, Michael Everson wrote: > On 8 Sep 2014, at 17:42, Miller Prosser wrote: > > > 10380 Ugaritic letter alpha > > 1039B Ugaritic letter I > > 1039C Ugaritic letter U > > > > Standard transliteration of these letters would combine U+02BE over a, > i, and u respectively. > > You can use U+0357 just write a? i? u? > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From m-prosser at uchicago.edu Mon Sep 8 13:26:00 2014 From: m-prosser at uchicago.edu (Miller Prosser) Date: Mon, 8 Sep 2014 13:26:00 -0500 Subject: Transliterate Ugaritic aleph letters In-Reply-To: References: <01b101cfcb83$db5d1aa0$92174fe0$@uchicago.edu> Message-ID: <007501cfcb92$579a7dc0$06cf7940$@uchicago.edu> Thank you very much for this solution. It works perfectly for the transliteration of the Ugaritic aleph signs. I see now that the answer is similar to that given for the transliteration of Egyptian "yod." http://www.unicode.org/faq/char_combmark.html#20 Much appreciated! Miller Prosser -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Everson Sent: Monday, September 8, 2014 12:34 PM To: unicode Unicode Discussion Subject: Re: Transliterate Ugaritic aleph letters On 8 Sep 2014, at 17:42, Miller Prosser wrote: > 10380 Ugaritic letter alpha > 1039B Ugaritic letter I > 1039C Ugaritic letter U > > Standard transliteration of these letters would combine U+02BE over a, i, and u respectively. You can use U+0357 just write a? i? u? Michael Everson * http://www.evertype.com/ _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From john.armstrong.rn3 at gmail.com Mon Sep 8 14:03:10 2014 From: john.armstrong.rn3 at gmail.com (John Armstrong) Date: Mon, 8 Sep 2014 15:03:10 -0400 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database Message-ID: [Apologies if this issue has already been resolved. I searched the Unicode.org site for discussions but I only found document dating from 2003 which touches on the issue: andrewcwest at alumni.princeton.edu RE: Unicode 4.0.1 Beta Review 1. kRSUnicode Field ( http://www.unicode.org/L2/L2003/03311-errata4.txt)] A CJK Han character is conventionally viewed as consisting of a radical plus a residual part or ?phonetic?. (For a character which is a radical the residual part is nothing. The term ?phonetic?, indicating that the residual part of the character points the pronunciation of the character, properly only applies to 90-95% of characters, but it applies in the examples below. ) The two parts of a character each consist of a specific arrange of strokes, and together account for all the strokes in the character. In particular, the number of strokes in the radical portion plus the number of strokes in the residual portion equals the total number of strokes in the character. The stroke count of a radical combined with a residual part is not always the same as the stroke count of the radical appearing on its own, but may be slightly or significantly less due to a minor or major abbreviation. (A radical may have several forms which are used in different positions of the whole character, say left or right side vs. top or bottom. These variants may have the same or different stroke counts.) Because of abbreviated variants the total stroke count for a character cannot be always be gotten by adding the stroke count of the radical in its standalone form to the stoke count of the residual portion. However, it can always be gotten by subtracting the stroke count of residual portion from the total stroke count of the character. The Unihan database provides the exact data needed to make this calculation: kTotalStrokes: stroke count for full character kRSUnicode: radical number and residual stroke count (in format [?]., where optional ? (apostrophe) in the latter indicates a widely used abbreviation for the radical with a significantly different appearance and a significantly (-3 or more) lower stroke. (But not all such forms are so marked ? examples are forms with radical numbers 140, 162,163,170. It may be that the marker is limited to abbreviations uses in Simplified as opposed to Traditional Chinese characters.) The formula is simply: radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes This formula generally gives correct results, but not always. In fact, according to reasonably accurate heuristic test I ran it produces incorrect (or at least ?suspicious?) results in 2236 of the total 74911, or 3%, of characters in the database that have both kTotalStrokes and kRSUnicode data. Moreover the rate is significantly higher for the characters in the BMP than in the SIP ? in fact it is really negligible in the latter. Most importantly it is 8.2% in the block containing all the most widely used characters, the base CJK Unified Ideographs block. The numbers for all the blocks are as follows: RANGE TOTAL* SUSPICIOUS PCT BMP BASE 20941 1727 8.2 CMP 302 29 9.6 CMPS 4 0 0.0 EXTA 6582 469 7.1 SIP EXTB 42711 6 0.0 EXTC 4149 5 0.1 EXTD 222 0 0.0 TOTAL 74911 2236 3.0 *with both kTotalStrokes and kRSUnicode Some of the suspicious cases are actually valid, but I believe that vast majority are truly incorrect, and that the rate of incorrect radical stroke counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the base CJK Unified Ideographs block. Here are a couple examples where the stroke counts are fairly small and the radicals and the residual parts (?phonetics?) widely occurring. The first illustrates the situation where the radical stroke count implied by kTotalStrokes and kRSUnicode is greater than the correct value, and the second that where the implied radical stroke count is less than the correct value. (The second situation is much more common than the first, accounting for at least 80% of the ?suspicious? items. Example 1: character U+4E9B ?a few? kTotalStrokes = 8 kSRUnicode = 7.5 radical number = 7 ?two? residual strokes = 5 implied radical stroke count = 3 (8 ? 5) correct radical stroke count = 2 diff = 1 (implied count one too high) The residual portion of the character occurs as an independent character U+6B64 ?this, these?. Its kTotalStrokes is 6 and its kRSUnicode = 77.2. The radical #77 ?stop? has 2 strokes in its standalone form, so the residual stroke count of 2 is consistent with a total count of 6. In the main character U+4E9B, therefore, the residual part has effectively lost a stroke in composition, being reduced from 6 to 5. (This actually seems to be the norm with this phonetic. Other examples are U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I?m sure more.) Example 2: is character U+5040 ?distinguished person; English person? kTotalStrokes = 10 kSRUnicode = 9.9 radical number = 9 ?person? residual strokes = 9 implied radical stroke count = 1 (10 - 9) correct radical stroke count = 2 diff = -1 (implied count one too low) Again the residual portion occurs as an independent character U+82F1 ?distinguished; English?. Its kTotalStrokes is 8 and its kRSUnicode is 140.5. Radical #140 ?grass? has 6 strokes in its standalone form but as the radical component of a larger character is always abbreviated to a form with 3 strokes. That is the case here. Thus residual count of 5 in the kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the character. This count of 8 agrees with the residual count for the full character U+5040 implied by its 10 kTotalStrokes, but is one less than the 9 residual strokes specified in the kRSUnicode. In both examples the discrepancy between kTotalStrokes and KRSUnicode arise out of different residual stroke counts and have nothing to do with the radical, be it its identity, the variant used, or the stroke count. While there are some exceptions, this is clearly the normal situation. It also makes sense. Most disagreements on stroke counts have to do with the residual as opposed to the radical portion of the characters. (Question of radical counts usually involve cases where the radical has more than one form in a given context, for example rad #140 ?grass?, which has 6 strokes in its full form but variants in the top position context with 3 and 4 strokes. Less commonly, they involve cases where the radical is fused with the residual portion or even lost altogether as part of a historical simplification.) As mentioned above, discrepancies of the type illustrated by the second example (implied radical stokes higher than correct) are much more common than discrepancies of the type illustrated by the first example (implied radical stokes less than correct). To the extent the discrepancies involve the residual stroke counts and have nothing to do with the radical, the situation can be reframed in terms of residual stroke counts as: Dominant pattern: the residual stroke count specified in kRSUnicode is greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2) Minor pattern: the residual stroke count specified in kRSUnicode is less than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1) The results of the heuristic test indicate that the great majority of cases of both patterns involve differences in residual stroke counts of one or occasionally two strokes. I believe this is in line with the variations in stroke counting that are observed in actual practice (dictionaries etc.). Still, the question needs to be asked, do the discrepancies (which occur in 5% of all characters in the base Unicode character set) simply represent different, but more or less equally valid, ways of counting strokes, or are they errors that need to be corrected or at least addressed in some way? In my view the answer depends on a more specific question: are kTotalStrokes and KRSUnicode intended to be consistent? That is, regardless of what exact count is chosen for a given character, should both terms reflect the same count? Here is how the two fields are described in the document Proposed Update to Unicode Standard Annex #38 Unicode 6.0.0 draft 1 ( http://www.unicode.org/reports/tr38/tr38-8.html): kTotalStrokes: ?The total number of strokes in the character (including the radical). _This value is for the character as drawn in the Unicode charts_.? kRSUnicode: ?A standard radical/stroke count for this character in the form ?radical.additional strokes?. The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The ?additional strokes? value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical. This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts. _The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard_. When I talk about kRSUnicode I always mean the first value in the list. Similarly my heuristic test always uses the first value. I mention this because of the way the last paragraph of the description refers specifically to this value. Both descriptions tie the specific values of the two fields to the specific glyphs used to draw/print the character in the Unicode charts (kTotalStrokes ?character as drawn in the Unicode charts?, kRSUnicode ?the glyph used to print the character within the Unicode Standard?). Given this, the answer to the question of whether the two fields should be consistent certainly seems to be yes. And this means that the cases where they are not, i.e. where there are discrepancies, are errors. If it?s conceded that the discrepancies do reflect errors, then I think it also needs to be conceded that they need to be addressed in some way. The most straightforward thing would be to go through all the cases and change either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer values appropriate to the specific glyph used in the standard. Given that kRSUnicode is used for ordering characters in the block (the radical number being used to determine what radical it is listed after and the residual count being used to determine where after the radical it appears ? except for ties, which are ordered arbitrarily), while to the best of my knowledge kTotalStrokes is not used for anything within the standard, the most practical thing would be to keep the existing kRSUnicode value wherever it is not obviously incorrect and adjust the kTotalStrokes to be consistent with it. But this involves changing a lot of data - including data for the most widely used characters, those in the base CJK Unified Ideograph block -, and may break systems that use the existing values. An alternative which I would suggest is to create a new field which could be called kRSUnicode2 or something similar and would have not two but three subfields (not counting apostrophe) [?].. where the first and third subfields are the same (same meaning, same values, barring clear errors) as in kRSUnicode and the added second subfield is the number of strokes in the radical as it appears in the character. This new field would contain all the stroke count information that?s needed for a character, including not only the residual strokes but also the radical strokes and, via calculation (adding the two values), the total strokes. The last can be compared with kTotalStrokes, but does not depend on it, and may be different. (Note that the presence of apostrophe would become largely predictable from a comparison of the radical stroke count in the first subfield with the count for the radical as a standalone character. In fact it would only be necessary to retain it if its purpose was not simply to indicate significantly abbreviated radicals in general but specifically to indicate forms that are used in Chinese Simplified but not the corresponding Traditional ones.) I see the following advantages to this approach: (1) No constraints are placed on existing kTotalStrokes or kRSUnicode values ? they can be left as is or changed at any point without implications for the new kSRUnicode2 values (2) No systems that use the existing kTotalStrokes or kRSUnicode fields will break or be affected in any way (though they could be changed to use the self-standing kRSUnicode2 field with possibly more satisfactory results) (3) All stroke information for a character is contained in a single field, kRSUnicode2, and can?t be inconsistent (though it can be wrong) (4) Stroke counting differences between fields can be directly found and quantified (particularly, by comparing the partial stroke information in kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2 (5) An initial version of the full set of the new kRSUnicode2 field values could be generated algorithmically from kTotalStrokes and kRSUnicode and then revised by human inspection focusing on the proportionally small amount (8% in the base block, 3% overall) of ?suspicious? cases detected by a heuristic procedure (which I?m sure could be made more accurate than the one I used, for example by bringing in more existing information sources) The main disadvantages I see are: (1) Confusion arising from the overlap between the old and new fields (2) The work involved (though anything other than dismissing or postponing the issue is going to involve work) If there is interest I will be glad to share the results of my heuristic test and the program (python) I used to produce them. John Armstrong Cambridge MA -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.armstrong.rn3 at gmail.com Tue Sep 9 09:02:26 2014 From: john.armstrong.rn3 at gmail.com (John Armstrong) Date: Tue, 9 Sep 2014 10:02:26 -0400 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii Message-ID: [Apologies if this issue has already been resolved. I searched the Unicode.org site for discussions but I only found document dating from 2003 which touches on the issue: andrewcwest at alumni.princeton.edu RE: Unicode 4.0.1 Beta Review 1. kRSUnicode Field ( http://www.unicode.org/L2/L2003/03311-errata4.txt)] A CJK Han character is conventionally viewed as consisting of a radical plus a residual part or "phonetic". (For a character which is a radical the residual part is nothing. The term "phonetic", indicating that the residual part of the character points the pronunciation of the character, properly only applies to 90-95% of characters, but it applies in the examples below. ) The two parts of a character each consist of a specific arrange of strokes, and together account for all the strokes in the character. In particular, the number of strokes in the radical portion plus the number of strokes in the residual portion equals the total number of strokes in the character. The stroke count of a radical combined with a residual part is not always the same as the stroke count of the radical appearing on its own, but may be slightly or significantly less due to a minor or major abbreviation. (A radical may have several forms which are used in different positions of the whole character, say left or right side vs. top or bottom. These variants may have the same or different stroke counts.) Because of abbreviated variants the total stroke count for a character cannot be always be gotten by adding the stroke count of the radical in its standalone form to the stoke count of the residual portion. However, it can always be gotten by subtracting the stroke count of residual portion from the total stroke count of the character. The Unihan database provides the exact data needed to make this calculation: kTotalStrokes: stroke count for full character kRSUnicode: radical number and residual stroke count (in format [']., where optional ' (apostrophe) in the latter indicates a widely used abbreviation for the radical with a significantly different appearance and a significantly (-3 or more) lower stroke. (But not all such forms are so marked - examples are forms with radical numbers 140, 162,163,170. It may be that the marker is limited to abbreviations uses in Simplified as opposed to Traditional Chinese characters.) The formula is simply: radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes This formula generally gives correct results, but not always. In fact, according to reasonably accurate heuristic test I ran it produces incorrect (or at least "suspicious") results in 2236 of the total 74911, or 3%, of characters in the database that have both kTotalStrokes and kRSUnicode data. Moreover the rate is significantly higher for the characters in the BMP than in the SIP - in fact it is really negligible in the latter. Most importantly it is 8.2% in the block containing all the most widely used characters, the base CJK Unified Ideographs block. The numbers for all the blocks are as follows: RANGE TOTAL* SUSPICIOUS PCT BMP BASE 20941 1727 8.2 CMP 302 29 9.6 CMPS 4 0 0.0 EXTA 6582 469 7.1 SIP EXTB 42711 6 0.0 EXTC 4149 5 0.1 EXTD 222 0 0.0 TOTAL 74911 2236 3.0 *with both kTotalStrokes and kRSUnicode Some of the suspicious cases are actually valid, but I believe that vast majority are truly incorrect, and that the rate of incorrect radical stroke counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the base CJK Unified Ideographs block. Here are a couple examples where the stroke counts are fairly small and the radicals and the residual parts ("phonetics") widely occurring. The first illustrates the situation where the radical stroke count implied by kTotalStrokes and kRSUnicode is greater than the correct value, and the second that where the implied radical stroke count is less than the correct value. (The second situation is much more common than the first, accounting for at least 80% of the "suspicious" items. Example 1: character U+4E9B 'a few' kTotalStrokes = 8 kSRUnicode = 7.5 radical number = 7 'two' residual strokes = 5 implied radical stroke count = 3 (8 - 5) correct radical stroke count = 2 diff = 1 (implied count one too high) The residual portion of the character occurs as an independent character U+6B64 'this, these'. Its kTotalStrokes is 6 and its kRSUnicode = 77.2. The radical #77 'stop' has 2 strokes in its standalone form, so the residual stroke count of 2 is consistent with a total count of 6. In the main character U+4E9B, therefore, the residual part has effectively lost a stroke in composition, being reduced from 6 to 5. (This actually seems to be the norm with this phonetic. Other examples are U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I'm sure more.) Example 2: is character U+5040 'distinguished person; English person' kTotalStrokes = 10 kSRUnicode = 9.9 radical number = 9 'person' residual strokes = 9 implied radical stroke count = 1 (10 - 9) correct radical stroke count = 2 diff = -1 (implied count one too low) Again the residual portion occurs as an independent character U+82F1 'distinguished; English'. Its kTotalStrokes is 8 and its kRSUnicode is 140.5. Radical #140 'grass' has 6 strokes in its standalone form but as the radical component of a larger character is always abbreviated to a form with 3 strokes. That is the case here. Thus residual count of 5 in the kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the character. This count of 8 agrees with the residual count for the full character U+5040 implied by its 10 kTotalStrokes, but is one less than the 9 residual strokes specified in the kRSUnicode. In both examples the discrepancy between kTotalStrokes and KRSUnicode arise out of different residual stroke counts and have nothing to do with the radical, be it its identity, the variant used, or the stroke count. While there are some exceptions, this is clearly the normal situation. It also makes sense. Most disagreements on stroke counts have to do with the residual as opposed to the radical portion of the characters. (Question of radical counts usually involve cases where the radical has more than one form in a given context, for example rad #140 'grass', which has 6 strokes in its full form but variants in the top position context with 3 and 4 strokes. Less commonly, they involve cases where the radical is fused with the residual portion or even lost altogether as part of a historical simplification.) As mentioned above, discrepancies of the type illustrated by the second example (implied radical stokes higher than correct) are much more common than discrepancies of the type illustrated by the first example (implied radical stokes less than correct). To the extent the discrepancies involve the residual stroke counts and have nothing to do with the radical, the situation can be reframed in terms of residual stroke counts as: Dominant pattern: the residual stroke count specified in kRSUnicode is greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2) Minor pattern: the residual stroke count specified in kRSUnicode is less than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1) The results of the heuristic test indicate that the great majority of cases of both patterns involve differences in residual stroke counts of one or occasionally two strokes. I believe this is in line with the variations in stroke counting that are observed in actual practice (dictionaries etc.). Still, the question needs to be asked, do the discrepancies (which occur in 5% of all characters in the base Unicode character set) simply represent different, but more or less equally valid, ways of counting strokes, or are they errors that need to be corrected or at least addressed in some way? In my view the answer depends on a more specific question: are kTotalStrokes and KRSUnicode intended to be consistent? That is, regardless of what exact count is chosen for a given character, should both terms reflect the same count? Here is how the two fields are described in the document Proposed Update to Unicode Standard Annex #38 Unicode 6.0.0 draft 1 ( http://www.unicode.org/reports/tr38/tr38-8.html): kTotalStrokes: "The total number of strokes in the character (including the radical). _This value is for the character as drawn in the Unicode charts_." kRSUnicode: "A standard radical/stroke count for this character in the form "radical.additional strokes". The radical is indicated by a number in the range (1..214) inclusive. An apostrophe (') after the radical indicates a simplified version of the given radical. The "additional strokes" value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical. This field is also used for additional radical-stroke indices where either a character may be reasonably classified under more than one radical, or alternate stroke count algorithms may provide different stroke counts. _The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard_. When I talk about kRSUnicode I always mean the first value in the list. Similarly my heuristic test always uses the first value. I mention this because of the way the last paragraph of the description refers specifically to this value. Both descriptions tie the specific values of the two fields to the specific glyphs used to draw/print the character in the Unicode charts (kTotalStrokes "character as drawn in the Unicode charts", kRSUnicode "the glyph used to print the character within the Unicode Standard"). Given this, the answer to the question of whether the two fields should be consistent certainly seems to be yes. And this means that the cases where they are not, i.e. where there are discrepancies, are errors. If it's conceded that the discrepancies do reflect errors, then I think it also needs to be conceded that they need to be addressed in some way. The most straightforward thing would be to go through all the cases and change either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer values appropriate to the specific glyph used in the standard. Given that kRSUnicode is used for ordering characters in the block (the radical number being used to determine what radical it is listed after and the residual count being used to determine where after the radical it appears - except for ties, which are ordered arbitrarily), while to the best of my knowledge kTotalStrokes is not used for anything within the standard, the most practical thing would be to keep the existing kRSUnicode value wherever it is not obviously incorrect and adjust the kTotalStrokes to be consistent with it. But this involves changing a lot of data - including data for the most widely used characters, those in the base CJK Unified Ideograph block -, and may break systems that use the existing values. An alternative which I would suggest is to create a new field which could be called kRSUnicode2 or something similar and would have not two but three subfields (not counting apostrophe) ['].. where the first and third subfields are the same (same meaning, same values, barring clear errors) as in kRSUnicode and the added second subfield is the number of strokes in the radical as it appears in the character. This new field would contain all the stroke count information that's needed for a character, including not only the residual strokes but also the radical strokes and, via calculation (adding the two values), the total strokes. The last can be compared with kTotalStrokes, but does not depend on it, and may be different. (Note that the presence of apostrophe would become largely predictable from a comparison of the radical stroke count in the first subfield with the count for the radical as a standalone character. In fact it would only be necessary to retain it if its purpose was not simply to indicate significantly abbreviated radicals in general but specifically to indicate forms that are used in Chinese Simplified but not the corresponding Traditional ones.) I see the following advantages to this approach: (1) No constraints are placed on existing kTotalStrokes or kRSUnicode values - they can be left as is or changed at any point without implications for the new kSRUnicode2 values (2) No systems that use the existing kTotalStrokes or kRSUnicode fields will break or be affected in any way (though they could be changed to use the self-standing kRSUnicode2 field with possibly more satisfactory results) (3) All stroke information for a character is contained in a single field, kRSUnicode2, and can't be inconsistent (though it can be wrong) (4) Stroke counting differences between fields can be directly found and quantified (particularly, by comparing the partial stroke information in kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2 (5) An initial version of the full set of the new kRSUnicode2 field values could be generated algorithmically from kTotalStrokes and kRSUnicode and then revised by human inspection focusing on the proportionally small amount (8% in the base block, 3% overall) of "suspicious" cases detected by a heuristic procedure (which I'm sure could be made more accurate than the one I used, for example by bringing in more existing information sources) The main disadvantages I see are: (1) Confusion arising from the overlap between the old and new fields (2) The work involved (though anything other than dismissing or postponing the issue is going to involve work) If there is interest I will be glad to share the results of my heuristic test and the program (python) I used to produce them. John Armstrong Cambridge MA -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Tue Sep 9 10:57:16 2014 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 9 Sep 2014 16:57:16 +0100 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database In-Reply-To: References: Message-ID: Hi John, You raise some interesting points, and I hope that one of the people who maintain the Unihan database can address your issues better than I can. I think that the reason why the main CJK block shows the greatest number of mismatches between kTotalStrokes and kRSUnicode is related to the way that CJK characters were ordered in the initial Unicode 1.0 repertoire, which seems to have been based on the glyph shape used in the particular source standard. To take U+5040 ? as an example, this is mainly a Cantonese character, and I guess that the source from which the Unicode character was derived used a traditional form of the character with a broken grass radical, giving a residual stroke count of 9, and it was thus ordered in the code charts as the first character with nine strokes under radical 9, hence kRSUnicode = 9.9 (you can see in the Unicode 2.0 code charts that U+5040 indeed has a 4-stroke grass radical). In a later version of Unicode a new font was used in which U+5040 was represented in the (single column) code charts with a 3-stroke grass radical glyph (see for example the Unicode 4.0 code charts ). kTotalStrokes was presumably based on the glyph forms given in one of these later versions of the code charts, and so for U+5040 kTotalStrokes = 10. The problem of stroke counts is now compounded by the use of multi-column code charts for CJK, with each character illustrated with multiple regional glyph forms. In many cases different glyph forms with differing stroke counts are shown in different columns for the same character, so the kTotalStrokes and kRSUnicode fields may not reflect the stroke count for all regional variants of the same character. Furthermore, when regional variants of the same character do have varying stroke counts it is not obvious which character form should be used to calculate the values of kTotalStrokes and kRSUnicode, which makes these two fields very problematic in my opinion. That kRSUnicode allows for multiple values, but only provides more than one value in a tiny handful of cases (mostly where the character can be classified under more than one radical), makes the situation even worse in my opinion. For processes that want to sort CJK characters, it is very useful to have a single nominal radical-stroke key for every encoded CJK character, but once you have multiple values for kRSUnicode (and no indication which value is preferable under which circumstances) then you are given a choice as to which value to use but no way of knowing which the best choice is. My solution would be to have a single kRSUnicode value giving a nominal radical-stroke value for each character, harmonized with kTotalStrokes, with stroke count for the two fields calculated consistently according to some defined criteria; and if there are more than one possible radicals for a particular character then just use the radical under which it appears in the Unicode code charts. In addition I would create individual kTotalStrokes and kRSUnicode fields for each source (G, H, J, K, T, U, V, etc.), which would give the preferred radical and stroke count for each regional glyph form given in the code charts. Andrew On 8 September 2014 20:03, John Armstrong wrote: > [Apologies if this issue has already been resolved. I searched the > Unicode.org site for discussions but I only found document dating from 2003 > which touches on the issue: andrewcwest at alumni.princeton.edu RE: Unicode > 4.0.1 Beta Review 1. kRSUnicode Field > (http://www.unicode.org/L2/L2003/03311-errata4.txt)] > > > > A CJK Han character is conventionally viewed as consisting of a radical plus > a residual part or ?phonetic?. (For a character which is a radical the > residual part is nothing. The term ?phonetic?, indicating that the residual > part of the character points the pronunciation of the character, properly > only applies to 90-95% of characters, but it applies in the examples below. > ) > > > > The two parts of a character each consist of a specific arrange of strokes, > and together account for all the strokes in the character. In particular, > the number of strokes in the radical portion plus the number of strokes in > the residual portion equals the total number of strokes in the character. > The stroke count of a radical combined with a residual part is not always > the same as the stroke count of the radical appearing on its own, but may be > slightly or significantly less due to a minor or major abbreviation. (A > radical may have several forms which are used in different positions of the > whole character, say left or right side vs. top or bottom. These variants > may have the same or different stroke counts.) > > > > Because of abbreviated variants the total stroke count for a character > cannot be always be gotten by adding the stroke count of the radical in its > standalone form to the stoke count of the residual portion. However, it can > always be gotten by subtracting the stroke count of residual portion from > the total stroke count of the character. The Unihan database provides the > exact data needed to make this calculation: > > > > kTotalStrokes: stroke count for full character > > > > kRSUnicode: radical number and residual stroke count (in format > [?]., where optional ? (apostrophe) in the latter > indicates a widely used abbreviation for the radical with a significantly > different appearance and a significantly (-3 or more) lower stroke. (But > not all such forms are so marked ? examples are forms with radical numbers > 140, 162,163,170. It may be that the marker is limited to abbreviations > uses in Simplified as opposed to Traditional Chinese characters.) > > > > The formula is simply: > > > > radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes > > > > This formula generally gives correct results, but not always. In fact, > according to reasonably accurate heuristic test I ran it produces incorrect > (or at least ?suspicious?) results in 2236 of the total 74911, or 3%, of > characters in the database that have both kTotalStrokes and kRSUnicode data. > Moreover the rate is significantly higher for the characters in the BMP than > in the SIP ? in fact it is really negligible in the latter. Most > importantly it is 8.2% in the block containing all the most widely used > characters, the base CJK Unified Ideographs block. The numbers for all the > blocks are as follows: > > > > RANGE TOTAL* SUSPICIOUS PCT > > > > BMP > > BASE 20941 1727 8.2 > > CMP 302 29 9.6 > > CMPS 4 0 0.0 > > EXTA 6582 469 7.1 > > > > SIP > > EXTB 42711 6 0.0 > > EXTC 4149 5 0.1 > > EXTD 222 0 0.0 > > > > TOTAL 74911 2236 3.0 > > > > *with both kTotalStrokes and kRSUnicode > > > > > > Some of the suspicious cases are actually valid, but I believe that vast > majority are truly incorrect, and that the rate of incorrect radical stroke > counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the base > CJK Unified Ideographs block. > > > > > > > > Here are a couple examples where the stroke counts are fairly small and the > radicals and the residual parts (?phonetics?) widely occurring. The first > illustrates the situation where the radical stroke count implied by > kTotalStrokes and kRSUnicode is greater than the correct value, and the > second that where the implied radical stroke count is less than the correct > value. (The second situation is much more common than the first, accounting > for at least 80% of the ?suspicious? items. > > > > > > Example 1: character U+4E9B ?a few? > > > > > > kTotalStrokes = 8 > > kSRUnicode = 7.5 > > radical number = 7 ?two? > > residual strokes = 5 > > implied radical stroke count = 3 (8 ? 5) > > correct radical stroke count = 2 > > diff = 1 (implied count one too high) > > > > > > The residual portion of the character occurs as an independent character > U+6B64 ?this, these?. Its kTotalStrokes is 6 and its kRSUnicode = 77.2. > The radical #77 ?stop? has 2 strokes in its standalone form, so the residual > stroke count of 2 is consistent with a total count of 6. In the main > character U+4E9B, therefore, the residual part has effectively lost a stroke > in composition, being reduced from 6 to 5. > > > > (This actually seems to be the norm with this phonetic. Other examples are > U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I?m sure > more.) > > > > > > Example 2: is character U+5040 ?distinguished person; English person? > > > > kTotalStrokes = 10 > > kSRUnicode = 9.9 > > radical number = 9 ?person? > > residual strokes = 9 > > implied radical stroke count = 1 (10 - 9) > > correct radical stroke count = 2 > > diff = -1 (implied count one too low) > > > > Again the residual portion occurs as an independent character U+82F1 > ?distinguished; English?. Its kTotalStrokes is 8 and its kRSUnicode is > 140.5. Radical #140 ?grass? has 6 strokes in its standalone form but as the > radical component of a larger character is always abbreviated to a form with > 3 strokes. That is the case here. Thus residual count of 5 in the > kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the > character. This count of 8 agrees with the residual count for the full > character U+5040 implied by its 10 kTotalStrokes, but is one less than the 9 > residual strokes specified in the kRSUnicode. > > > > In both examples the discrepancy between kTotalStrokes and KRSUnicode arise > out of different residual stroke counts and have nothing to do with the > radical, be it its identity, the variant used, or the stroke count. While > there are some exceptions, this is clearly the normal situation. It also > makes sense. Most disagreements on stroke counts have to do with the > residual as opposed to the radical portion of the characters. (Question of > radical counts usually involve cases where the radical has more than one > form in a given context, for example rad #140 ?grass?, which has 6 strokes > in its full form but variants in the top position context with 3 and 4 > strokes. Less commonly, they involve cases where the radical is fused with > the residual portion or even lost altogether as part of a historical > simplification.) > > > > As mentioned above, discrepancies of the type illustrated by the second > example (implied radical stokes higher than correct) are much more common > than discrepancies of the type illustrated by the first example (implied > radical stokes less than correct). To the extent the discrepancies involve > the residual stroke counts and have nothing to do with the radical, the > situation can be reframed in terms of residual stroke counts as: > > > > Dominant pattern: the residual stroke count specified in kRSUnicode is > greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2) > > > > Minor pattern: the residual stroke count specified in kRSUnicode is less > than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1) > > > > The results of the heuristic test indicate that the great majority of cases > of both patterns involve differences in residual stroke counts of one or > occasionally two strokes. I believe this is in line with the variations in > stroke counting that are observed in actual practice (dictionaries etc.). > Still, the question needs to be asked, do the discrepancies (which occur in > 5% of all characters in the base Unicode character set) simply represent > different, but more or less equally valid, ways of counting strokes, or are > they errors that need to be corrected or at least addressed in some way? > > > > In my view the answer depends on a more specific question: are kTotalStrokes > and KRSUnicode intended to be consistent? That is, regardless of what exact > count is chosen for a given character, should both terms reflect the same > count? > > > > Here is how the two fields are described in the document Proposed Update to > Unicode Standard Annex #38 Unicode 6.0.0 draft 1 > (http://www.unicode.org/reports/tr38/tr38-8.html): > > > > kTotalStrokes: > > > > ?The total number of strokes in the character (including the radical). _This > value is for the character as drawn in the Unicode charts_.? > > > > kRSUnicode: > > > > ?A standard radical/stroke count for this character in the form > ?radical.additional strokes?. The radical is indicated by a number in the > range (1..214) inclusive. An apostrophe (') after the radical indicates a > simplified version of the given radical. The ?additional strokes? value is > the residual stroke-count, the count of all strokes remaining after > eliminating all strokes associated with the radical. > > > > This field is also used for additional radical-stroke indices where either a > character may be reasonably classified under more than one radical, or > alternate stroke count algorithms may provide different stroke counts. > > > > _The first value is intended to reflect the same radical as the kRSKangXi > field and the stroke count of the glyph used to print the character within > the Unicode Standard_. > > > > When I talk about kRSUnicode I always mean the first value in the list. > Similarly my heuristic test always uses the first value. I mention this > because of the way the last paragraph of the description refers specifically > to this value. > > > > Both descriptions tie the specific values of the two fields to the specific > glyphs used to draw/print the character in the Unicode charts (kTotalStrokes > ?character as drawn in the Unicode charts?, kRSUnicode ?the glyph used to > print the character within the Unicode Standard?). Given this, the answer > to the question of whether the two fields should be consistent certainly > seems to be yes. And this means that the cases where they are not, i.e. > where there are discrepancies, are errors. > > > > If it?s conceded that the discrepancies do reflect errors, then I think it > also needs to be conceded that they need to be addressed in some way. The > most straightforward thing would be to go through all the cases and change > either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer values > appropriate to the specific glyph used in the standard. > > > > Given that kRSUnicode is used for ordering characters in the block (the > radical number being used to determine what radical it is listed after and > the residual count being used to determine where after the radical it > appears ? except for ties, which are ordered arbitrarily), while to the best > of my knowledge kTotalStrokes is not used for anything within the standard, > the most practical thing would be to keep the existing kRSUnicode value > wherever it is not obviously incorrect and adjust the kTotalStrokes to be > consistent with it. > > > > But this involves changing a lot of data - including data for the most > widely used characters, those in the base CJK Unified Ideograph block -, and > may break systems that use the existing values. > > > > An alternative which I would suggest is to create a new field which could be > called kRSUnicode2 or something similar and would have not two but three > subfields (not counting apostrophe) > > > > [?].. > > > > where the first and third subfields are the same (same meaning, same values, > barring clear errors) as in kRSUnicode and the added second subfield is the > number of strokes in the radical as it appears in the character. > > > > This new field would contain all the stroke count information that?s needed > for a character, including not only the residual strokes but also the > radical strokes and, via calculation (adding the two values), the total > strokes. The last can be compared with kTotalStrokes, but does not depend > on it, and may be different. > > > > (Note that the presence of apostrophe would become largely predictable from > a comparison of the radical stroke count in the first subfield with the > count for the radical as a standalone character. In fact it would only be > necessary to retain it if its purpose was not simply to indicate > significantly abbreviated radicals in general but specifically to indicate > forms that are used in Chinese Simplified but not the corresponding > Traditional ones.) > > > > I see the following advantages to this approach: > > > > (1) No constraints are placed on existing kTotalStrokes or kRSUnicode > values ? they can be left as is or changed at any point without implications > for the new kSRUnicode2 values > > > > (2) No systems that use the existing kTotalStrokes or kRSUnicode fields > will break or be affected in any way (though they could be changed to use > the self-standing kRSUnicode2 field with possibly more satisfactory results) > > > > (3) All stroke information for a character is contained in a single field, > kRSUnicode2, and can?t be inconsistent (though it can be wrong) > > > > (4) Stroke counting differences between fields can be directly found and > quantified (particularly, by comparing the partial stroke information in > kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2 > > > > (5) An initial version of the full set of the new kRSUnicode2 field values > could be generated algorithmically from kTotalStrokes and kRSUnicode and > then revised by human inspection focusing on the proportionally small amount > (8% in the base block, 3% overall) of ?suspicious? cases detected by a > heuristic procedure (which I?m sure could be made more accurate than the one > I used, for example by bringing in more existing information sources) > > > > The main disadvantages I see are: > > > > (1) Confusion arising from the overlap between the old and new fields > > > > (2) The work involved (though anything other than dismissing or postponing > the issue is going to involve work) > > > > If there is interest I will be glad to share the results of my heuristic > test and the program (python) I used to produce them. > > > > John Armstrong > > Cambridge MA > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From rscook at wenlin.com Tue Sep 9 13:19:49 2014 From: rscook at wenlin.com (Richard COOK) Date: Tue, 9 Sep 2014 11:19:49 -0700 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii In-Reply-To: References: Message-ID: <171C83BF-16CC-46EE-88BD-BCED89BE694F@wenlin.com> >> On Sep 9, 2014, at 8:28 AM, Richard COOK wrote: > On Sep 8, 2014, at 12:03 PM, John Armstrong wrote: Mr. Armstrong, I see that my reply to your message bounced from the main Unicode list, due to length constraints. At any rate, the message did go through on the Unihan list, where people involved in Unihan development can read it. In sum, I was suggesting that you might prepare a list of variant property values for kRSUnicode and kTotalStrokes. This would feed into ongoing work on those properties. --- Richard Cook ????? W?nl?n Institute, Inc. ? From john.armstrong.rn3 at gmail.com Tue Sep 9 15:16:41 2014 From: john.armstrong.rn3 at gmail.com (John Armstrong) Date: Tue, 9 Sep 2014 16:16:41 -0400 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database Message-ID: <540f6045.89158c0a.21e9.31b3@mx.google.com> Thanks for the comments Andrew. When I wrote up my second example I was not yet thinking of the complicating factor of the coexistence 3- and 4-stroke variants of Radical #140 'grass' in top position. #140 is not the main radical in the example but it is the radical of the "phonetic", and if the glyph used to determine kTotalStrokes had the 4-stroke ("broken") form instead of the three stroke one which it had in all the forms I looked at, the example would be a case of a discrepancy due to a variation in radical stroke count (albeit it within the residual portion of a larger character). I'll keep this case in mind as I go forward. I will also keep in mind that the counts for characters in the standard may be based on different glyphs than the ones that appear in current printings and electronic displays of the items. The fact that the creators and maintainers of the standard itself get tripped up in this way shows how fragile unification can be. -- John -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.armstrong.rn3 at gmail.com Tue Sep 9 16:22:45 2014 From: john.armstrong.rn3 at gmail.com (John Armstrong) Date: Tue, 9 Sep 2014 17:22:45 -0400 Subject: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii Message-ID: <540f6fc1.115e8c0a.14b3.ffff93ae@mx.google.com> Thanks for your long and detailed reply Richard. (The full version came to me directly so I could see it.) It will take me some time to digest it, but I since you suggest I submit something to UTC I want to clarify the extent of my knowledge of and ongoing involvement with Han characters. I started out life as a linguist but have worked in software for the past 20 years. My main work now involves web crawling and page and entity identification focusing strongly on English language sources. I ran into the issue I've described in this mailing list will doing a personal project involving correlating Sino-Japanese and Sino-Korean vocabulary. I actually am more interested in the readings of the characters (Japanese On and Korean Eum) than the characters themselves, but I am trying to leverage the fact the two languages normally write cognate Sino-X words with the same characters (plus or minus variations in form). I wanted to have stroke counts of the form Total-Rad-Residual such as I am used to from Jack Halpern's Japanese character dictionaries and was fully confident that I could make them by simply combining the kTotalStrokes and kRSUnicode fields in the manner I indicate in my message. But I started noticing occasional examples where the implied radical stroke counts seemed to large or too small, and I modified a tool I already had to try to detect cases of this algorithmically. I don't know Chinese and I have a lot of trouble making out Chinese characters when they are printed in normal size due lack of familiarity made worse by poor eyesight and a touch of dyslexia. I am certainly willing to give UTC a complete list of characters (through Extension D) and their status as suspicious or not along with some stats that the tool uses to make its decisions. In fact that I already have. Beyond that I might be able to commit to a submitting list of kTotalStrokes that should be corrected to match the lRSUncodes. I definitely to not have either the time or the knowledge to determine the correctness ate kRSUnicode values or do anything with variants. But I'm not sure I am the best person to do this. Based on the information about CDL I see on the Wenlin Institute website I sense you already have a full compositional model and could use it to produce a list of corrections that would be far more accurate than anything I could do. In terms of changing or adding fields, while I think the original separation of kTotalStrokes and kRSUnicode was a poor design choice (though maybe unavoidable for historical reasons), I'm thinking more and more that it's not worth making a change just to fix the issue I'm raising, and a better next step would be to represent characters as specific formally recognized radical variants (with fixed stroke counts) + residual stroke counts. This would be a first step towards a compositional model but could be done without getting into all the complexity and difficulty of a full recursive model. What do you think? Feel free to respond off-list if you prefer. -- John -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Sep 19 17:56:07 2014 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 19 Sep 2014 18:56:07 -0400 Subject: What happened to...? Message-ID: <541CB487.50004@kli.org> An HTML attachment was scrubbed... URL: From rick at unicode.org Fri Sep 19 18:26:38 2014 From: rick at unicode.org (Rick McGowan) Date: Fri, 19 Sep 2014 16:26:38 -0700 Subject: What happened to...? In-Reply-To: <541CB487.50004@kli.org> References: <541CB487.50004@kli.org> Message-ID: <541CBBAE.9050104@unicode.org> Hi Mark, This document ended up being delayed all the way into meeting #133, so the resolution is in those minutes: http://www.unicode.org/L2/L2012/12343.htm#133-A62 Regards, Rick On 9/19/2014 3:56 PM, Mark E. Shoulson wrote: > that, http://www.unicode.org/L2/L2011/11373-linguistic-doubt.pdf > proposes some From everson at evertype.com Fri Sep 19 19:16:11 2014 From: everson at evertype.com (Michael Everson) Date: Sat, 20 Sep 2014 01:16:11 +0100 Subject: What happened to...? In-Reply-To: <541CBBAE.9050104@unicode.org> References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> Message-ID: "Declines to take action? is pretty thin. On 20 Sep 2014, at 00:26, Rick McGowan wrote: > Hi Mark, > > This document ended up being delayed all the way into meeting #133, so the resolution is in those minutes: > > http://www.unicode.org/L2/L2012/12343.htm#133-A62 > > Regards, > Rick > > > On 9/19/2014 3:56 PM, Mark E. Shoulson wrote: >> that, http://www.unicode.org/L2/L2011/11373-linguistic-doubt.pdf proposes some > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode Michael Everson * http://www.evertype.com/ From ken.whistler at sap.com Fri Sep 19 19:38:24 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Sat, 20 Sep 2014 00:38:24 +0000 Subject: What happened to...? In-Reply-To: References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> Message-ID: Michael, > "Declines to take action? is pretty thin. A proposal which is declined by the UTC doesn't automatically create an obligation to write an extended dissertation explaining the rationale and putting that rationale on record. It might be one thing if there were a lot of controversy involved, and one group of participants asked for a rationale to be recorded, despite not having a consensus to move on something -- but this one wasn't even close. Nobody in the committee felt encoding was justified in this case. And not every mark on paper -- not even every mark *printed* in typeset material on paper -- is automatically an obvious candidate for encoding with a simple, plain text character representation. --Ken From asmusf at ix.netcom.com Fri Sep 19 20:13:11 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 19 Sep 2014 18:13:11 -0700 Subject: What happened to...? In-Reply-To: References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> Message-ID: <541CD4A7.9000601@ix.netcom.com> On 9/19/2014 5:38 PM, Whistler, Ken wrote: > Michael, > >> "Declines to take action? is pretty thin. > A proposal which is declined by the UTC doesn't automatically > create an obligation to write an extended dissertation explaining > the rationale and putting that rationale on record. It might be > one thing if there were a lot of controversy involved, and one > group of participants asked for a rationale to be recorded, > despite not having a consensus to move on something -- but > this one wasn't even close. Nobody in the committee felt > encoding was justified in this case. > > And not every mark on paper -- not even every mark *printed* > in typeset material on paper -- is automatically an obvious > candidate for encoding with a simple, plain text character > representation. True, but a rationale (note that's not necessarily a dissertation) never hurts. "Declines to take action? may look like it is equivalent to "Nobody in the committee felt encoding was justified in this case", but it really isn't. The former allows for all sorts of non-substantive reasons, but the latter is pretty clear: the submitter failed to make the case. What you are looking for is something equivalent to "summary dismissal" of a legal action, but even there this usually gets some rationale or it has the benefit of a standardized legal principle (don't know for a fact, but sounds plausible). A./ > > --Ken > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From everson at evertype.com Sat Sep 20 04:11:20 2014 From: everson at evertype.com (Michael Everson) Date: Sat, 20 Sep 2014 10:11:20 +0100 Subject: What happened to...? In-Reply-To: References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> Message-ID: On 20 Sep 2014, at 01:38, Whistler, Ken wrote: > Michael, > >> "Declines to take action? is pretty thin. > > A proposal which is declined by the UTC doesn't automatically create an obligation to write an extended dissertation explaining the rationale and putting that rationale on record. It can be frustrating feedback to the proposer. Michael Everson * http://www.evertype.com/ From mark at macchiato.com Sat Sep 20 04:32:57 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 20 Sep 2014 02:32:57 -0700 Subject: What happened to...? In-Reply-To: <541CD4A7.9000601@ix.netcom.com> References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> <541CD4A7.9000601@ix.netcom.com> Message-ID: I agree that we should minute at least some reason for declining. It need only be a sentence or two. (BTW I wasn't at that discussion.) {phone} On Sep 20, 2014 3:17 AM, "Asmus Freytag" wrote: > On 9/19/2014 5:38 PM, Whistler, Ken wrote: > >> Michael, >> >> "Declines to take action? is pretty thin. >>> >> A proposal which is declined by the UTC doesn't automatically >> create an obligation to write an extended dissertation explaining >> the rationale and putting that rationale on record. It might be >> one thing if there were a lot of controversy involved, and one >> group of participants asked for a rationale to be recorded, >> despite not having a consensus to move on something -- but >> this one wasn't even close. Nobody in the committee felt >> encoding was justified in this case. >> >> And not every mark on paper -- not even every mark *printed* >> in typeset material on paper -- is automatically an obvious >> candidate for encoding with a simple, plain text character >> representation. >> > > True, but a rationale (note that's not necessarily a dissertation) never > hurts. > > "Declines to take action? may look like it is equivalent to "Nobody in the > committee felt > encoding was justified in this case", but it really isn't. The former > allows for all sorts of non-substantive reasons, but the latter is pretty > clear: the submitter failed to make the case. > > What you are looking for is something equivalent to "summary dismissal" of > a legal action, but even there this usually gets some rationale or it has > the benefit of a standardized legal principle (don't know for a fact, but > sounds plausible). > > > > A./ > >> >> --Ken >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Sat Sep 20 18:49:02 2014 From: mark at kli.org (Mark E. Shoulson) Date: Sat, 20 Sep 2014 19:49:02 -0400 Subject: What happened to...? In-Reply-To: <541CBBAE.9050104@unicode.org> References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> Message-ID: <541E126E.6050403@kli.org> Thanks; I probably should have looked at later meetings too. ~mark On 09/19/2014 07:26 PM, Rick McGowan wrote: > Hi Mark, > > This document ended up being delayed all the way into meeting #133, so > the resolution is in those minutes: > > http://www.unicode.org/L2/L2012/12343.htm#133-A62 > > Regards, > Rick > > > On 9/19/2014 3:56 PM, Mark E. Shoulson wrote: >> that, http://www.unicode.org/L2/L2011/11373-linguistic-doubt.pdf >> proposes some From jonathan at doves.demon.co.uk Sat Sep 20 19:35:37 2014 From: jonathan at doves.demon.co.uk (Jonathan Coxhead) Date: Sat, 20 Sep 2014 17:35:37 -0700 Subject: Is this the oldest d20 on Earth? Message-ID: <541E1D59.9030606@doves.demon.co.uk> Here's an icosahedral dice from the Ptolemaic period: http://www.metmuseum.org/collection/the-collection-online/search/551070 I find myself idly wondering whether the identities of the characters are all known and encoded ... Cheers ?Jonathan From rscook at wenlin.com Sat Sep 20 22:25:10 2014 From: rscook at wenlin.com (Richard Cook) Date: Sat, 20 Sep 2014 20:25:10 -0700 Subject: Is this the oldest d20 on Earth? In-Reply-To: <541E1D59.9030606@doves.demon.co.uk> References: <541E1D59.9030606@doves.demon.co.uk> Message-ID: On Sep 20, 2014, at 5:35 PM, Jonathan Coxhead wrote: > > Here's an icosahedral dice from the Ptolemaic period: > > http://www.metmuseum.org/collection/the-collection-online/search/551070 > > I find myself idly wondering whether the identities of the characters are all known and encoded ... > The enlarged image doesn't show all of the sides. > Cheers > > ?Jonathan > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From beckiergb at gmail.com Sat Sep 20 22:53:07 2014 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Sat, 20 Sep 2014 20:53:07 -0700 Subject: Is this the oldest d20 on Earth? In-Reply-To: References: <541E1D59.9030606@doves.demon.co.uk> Message-ID: According to this post, the 20 sides are the first 20 letters of the Greek/Coptic alphabet, with a stylized form of alpha (where the crossbar is a V) and lunate sigma (which looks like C instead of ?). http://www.artisandice.com/blog/ptolemaic-d20/ Lunate sigma is U+03F9 (uppercase) and U+03F2 (lowercase). So yes, all 20 symbols are known and encoded. -- Rebecca Bettencourt On Sat, Sep 20, 2014 at 8:25 PM, Richard Cook wrote: > On Sep 20, 2014, at 5:35 PM, Jonathan Coxhead > wrote: > > > > Here's an icosahedral dice from the Ptolemaic period: > > > > http://www.metmuseum.org/collection/the-collection-online/search/551070 > > > > I find myself idly wondering whether the identities of the characters > are all known and encoded ... > > > > The enlarged image doesn't show all of the sides. > > > Cheers > > > > ?Jonathan > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Mon Sep 22 13:11:36 2014 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 22 Sep 2014 12:11:36 -0600 Subject: What happened to...? In-Reply-To: References: <541CB487.50004@kli.org> <541CBBAE.9050104@unicode.org> <541CD4A7.9000601@ix.netcom.com> Message-ID: <54206658.6040609@khwilliamson.com> On 09/20/2014 03:32 AM, Mark Davis ?? wrote: > I agree that we should minute at least some reason for declining. It > need only be a sentence or two. I would hope that the requesters get a detailed explanation of the rejection. It would be very wrong not to do so. If so, then the minutes could just copy and paste, deleting unnecessary detail. > > (BTW I wasn't at that discussion.) > > {phone} > > On Sep 20, 2014 3:17 AM, "Asmus Freytag" > wrote: > > On 9/19/2014 5:38 PM, Whistler, Ken wrote: > > Michael, > > "Declines to take action? is pretty thin. > > A proposal which is declined by the UTC doesn't automatically > create an obligation to write an extended dissertation explaining > the rationale and putting that rationale on record. It might be > one thing if there were a lot of controversy involved, and one > group of participants asked for a rationale to be recorded, > despite not having a consensus to move on something -- but > this one wasn't even close. Nobody in the committee felt > encoding was justified in this case. > > And not every mark on paper -- not even every mark *printed* > in typeset material on paper -- is automatically an obvious > candidate for encoding with a simple, plain text character > representation. > > > True, but a rationale (note that's not necessarily a dissertation) > never hurts. > > "Declines to take action? may look like it is equivalent to "Nobody > in the committee felt > encoding was justified in this case", but it really isn't. The > former allows for all sorts of non-substantive reasons, but the > latter is pretty clear: the submitter failed to make the case. > > What you are looking for is something equivalent to "summary > dismissal" of a legal action, but even there this usually gets some > rationale or it has the benefit of a standardized legal principle > (don't know for a fact, but sounds plausible). > > > > A./ > > > --Ken > > > _________________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/__listinfo/unicode > > > > _________________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/__listinfo/unicode > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From dzo at bisharat.net Fri Sep 26 09:41:20 2014 From: dzo at bisharat.net (dzo at bisharat.net) Date: Fri, 26 Sep 2014 07:41:20 -0700 Subject: Current support for N'Ko Message-ID: Some observations concerning N'Ko support in browsers may be of interest: http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html This is pursuant to reposting a translation in N'Ko of a World Heath Organization FAQ on ebola. That translation was one of several facilitated by Athinkra LLC, and available at https://sites.google.com/site/athinkra/ebola-faqs Don Osborn From lang.support at gmail.com Fri Sep 26 18:10:08 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 27 Sep 2014 09:10:08 +1000 Subject: Current support for N'Ko In-Reply-To: References: Message-ID: Hi Don, I will give a detailed reply offline to you and Charles. I am slowly working on notes on web deployment of various languages in my spare time. Been held up unpicking Myanmar script and possible errata/additions to UTN11 But N'ko is on my list of scripts to document. I will need to look at your pages and unpick them. But couple of reflections. Your blog post is dealing with multiple issues. * bidi support in html5 and css3, and to what extent are scipts like N'ko taken into account. * What rendring system is being usd by browser * what font is being used: opentype,graphite, aat .. this will affect rendring in browsers. For opentype which script is being used which will affect which opentype features will be processed. so getting the font stackright is important. and the font stack will differ from browser to browser. I need to check for existance of a cross platform N'ko font. * NEVER try to copy and paste text from PDF. It is a preprint format and should be treated as such. Andrew On 27/09/2014 12:45 AM, wrote: > > Some observations concerning N'Ko support in browsers may be of interest: > > http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html > > This is pursuant to reposting a translation in N'Ko of a World Heath Organization FAQ on ebola. That translation was one of several facilitated by Athinkra LLC, and available at https://sites.google.com/site/athinkra/ebola-faqs > > Don Osborn > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Sep 29 07:58:31 2014 From: frederic.grosshans at gmail.com (=?windows-1252?Q?Fr=E9d=E9ric_Grosshans?=) Date: Mon, 29 Sep 2014 14:58:31 +0200 Subject: Current support for N'Ko In-Reply-To: References: Message-ID: <54295777.1010900@gmail.com> Le 27/09/2014 01:10, Andrew Cunningham a ?crit : > * NEVER try to copy and paste text from PDF. It is a preprint format > and should be treated as such. Well... Having access to the raw text is often useful (for example, to allow blinds to have acces to the content of pdf documents, or to search a word in a scanned historical document), and cut and pasting text from PDF often works, even if the ?rich text? formating is lost. In the case of the Ebola FAQs (https://sites.google.com/site/athinkra/ebola-faqs) discussed here, it almost worked perfectly on my computer (Ubuntu Linux 14.04) for N?Ko (diacritics are shifted by one character) and Vai. Of course, the Adlam was not working (somehow converted to Arabic), bus it was expected, since Adlam is not (yet?) in Unicode. From prosfilaes at gmail.com Mon Sep 29 13:07:41 2014 From: prosfilaes at gmail.com (David Starner) Date: Mon, 29 Sep 2014 11:07:41 -0700 Subject: Current support for N'Ko In-Reply-To: References: Message-ID: On Fri, Sep 26, 2014 at 4:10 PM, Andrew Cunningham wrote: > * NEVER try to copy and paste text from PDF. It is a preprint format and > should be treated as such. I'd try and cut and paste from print if I could. People are going to cut and paste from anything if it saves them a little time. If you disable cut and pasting from PDF, those who have easy access to OCR may just print to image and OCR it to cut and paste. To say don't do this is unproductive. -- Kie ekzistas vivo, ekzistas espero. From lang.support at gmail.com Mon Sep 29 15:03:16 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Tue, 30 Sep 2014 06:03:16 +1000 Subject: Current support for N'Ko In-Reply-To: <54295777.1010900@gmail.com> References: <54295777.1010900@gmail.com> Message-ID: On 29/09/2014 11:02 PM, "Fr?d?ric Grosshans" wrote: > > Le 27/09/2014 01:10, Andrew Cunningham a ?crit : > >> * NEVER try to copy and paste text from PDF. It is a preprint format and should be treated as such. > > Well... Having access to the raw text is often useful (for example, to allow blinds to have acces to the content of pdf documents, or to search a word in a scanned historical document), and cut and pasting text from PDF often works, even if the ?rich text? formating is lost. > The problem is that often the actual text isnt necessarily ths same as the original text used to generate the pdf. Results will vary according to fonts used and tools used to generate the pdf. Even adobe acrobat contains different tools which can give vastly different results. It is best to think of PDF as dealing with glyphs rather than characters. I tend to mainly work with complex scripts, and the results with those is usually not encouraging. I know there is ActualText, but honestly I dont actually ever remember seeing a complex script PDF I could copy and paste from without post-processing of the text. The average person creating PDF files has no knowledge of how to achieve optimal results. Nko is one of the easier scripts to deal with thankfully. > In the case of the Ebola FAQs ( https://sites.google.com/site/athinkra/ebola-faqs) discussed here, it almost worked perfectly on my computer (Ubuntu Linux 14.04) for N?Ko (diacritics are shifted by one character) and Vai. Of course, the Adlam was not working (somehow converted to Arabic), bus it was expected, since Adlam is not (yet?) in Unicode. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Mon Sep 29 15:13:54 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Tue, 30 Sep 2014 06:13:54 +1000 Subject: Current support for N'Ko In-Reply-To: References: Message-ID: On 30/09/2014 4:11 AM, "David Starner" wrote: > > On Fri, Sep 26, 2014 at 4:10 PM, Andrew Cunningham > wrote: > > * NEVER try to copy and paste text from PDF. It is a preprint format and > > should be treated as such. > > > I'd try and cut and paste from print if I could. People are going to > cut and paste from anything if it saves them a little time. If you > disable cut and pasting from PDF, those who have easy access to OCR > may just print to image and OCR it to cut and paste. To say don't do > this is unproductive. > Ok what I should say is that in best case scenario for complex script text you can copy and paste nd then do post processing on extracted text to get the actual text. Post processing may involve reordering characters, or systematic conversions of glyph sequences. In worse case scenario you get utter garbage you can not reconstruct pdf files from. Searching and indexing is even more problematic. Honestly, for languages I work with it would be quicker and more accurate in many csses to use OCR (even at 80% accuracy) that cut and paste from PDF. As I said in previous email results and effectiveness will differ depending on fonts used and PDF generator used. PDF was designed for preprint, not archival purposes. > -- > Kie ekzistas vivo, ekzistas espero. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Sep 29 19:10:21 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 30 Sep 2014 00:10:21 +0000 Subject: Current support for N'Ko In-Reply-To: References: Message-ID: <260fc92bfb5041349738b8b1b42c8774@CY1PR0301MB0698.namprd03.prod.outlook.com> Don, You mention testing IE 8. That's a 5.5-year-old version that shipped before N'Ko script was supported on any platform. It's interesting that anything worked. You also mentioned IE11 on Windows 7 but testing without the Deja Vu fonts. Windows has supported N'Ko since Windows 8. Did you try testing with that and using the Ebrima font? Btw, the text appears to display correctly on my Windows Phone. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of dzo at bisharat.net Sent: Friday, September 26, 2014 8:11 PM To: unicode at unicode.org Cc: charles.riley at yale.edu Subject: Current support for N'Ko Some observations concerning N'Ko support in browsers may be of interest: http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html This is pursuant to reposting a translation in N'Ko of a World Heath Organization FAQ on ebola. That translation was one of several facilitated by Athinkra LLC, and available at https://sites.google.com/site/athinkra/ebola-faqs Don Osborn _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From dzo at bisharat.net Tue Sep 30 07:09:14 2014 From: dzo at bisharat.net (dzo at bisharat.net) Date: Tue, 30 Sep 2014 05:09:14 -0700 Subject: Current support for N'Ko In-Reply-To: <260fc92bfb5041349738b8b1b42c8774@CY1PR0301MB0698.namprd03.prod.outlook.com> References: <260fc92bfb5041349738b8b1b42c8774@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: Peter, The "weird trick" in the HTML font command to get it to display on IE 8 was "serif." When I first posted (having copied over text from a Word document with formatting - another in the list of not so good practices), some N'Ko text showed but most was empty boxes. On looking at the HTML in Blogger, I found that the visible text was where the span/font command included serif after DejaVu Sans. Anything without "serif" - including DejaVu Sans alone - produced the empty boxes. Probably the generic serif command let the system find another font? Wrt IE 11, I will have to go back to that computer. Actually it was in a public library - not a bad way to get an idea of how a random system might see content in a script like N'Ko. (This also brings to mind a policy-related item on computer systems in countries where diverse scripts are used: That on various levels, incentives or regulations should be in place mandating inclusion of relevant fonts to facilitate display of languages/scripts in the country. Back in 2008, I had the experience of not being able to display extended Latin characters of Bambara on new Windows systems in the business center of a major hotel in Bamako, Mali. I'm quite sure that most computer systems in hotels, cyber caf?s, government offices, and the few schools that have them in that country, will not be able to display N'Ko or Tifinagh properly, even if extended Latin is by now more widely supported.) Don On 29-09-2014 17:10, Peter Constable wrote: > Don, > > You mention testing IE 8. That's a 5.5-year-old version that shipped > before N'Ko script was supported on any platform. It's interesting > that anything worked. You also mentioned IE11 on Windows 7 but testing > without the Deja Vu fonts. Windows has supported N'Ko since Windows 8. > Did you try testing with that and using the Ebrima font? > > Btw, the text appears to display correctly on my Windows Phone. > > > Peter > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > dzo at bisharat.net > Sent: Friday, September 26, 2014 8:11 PM > To: unicode at unicode.org > Cc: charles.riley at yale.edu > Subject: Current support for N'Ko > > Some observations concerning N'Ko support in browsers may be of > interest: > > http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html > > This is pursuant to reposting a translation in N'Ko of a World Heath > Organization FAQ on ebola. That translation was one of several > facilitated by Athinkra LLC, and available at > https://sites.google.com/site/athinkra/ebola-faqs > > Don Osborn > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode