From unicode at unicode.org Thu Feb 1 01:03:31 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Feb 2018 08:03:31 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180201013858.383c7313@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Wed, 31 Jan 2018 19:45:56 +0100 > Philippe Verdy via Unicode wrote: > > > 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < > > unicode at unicode.org>: > > > > On Mon, 29 Jan 2018 14:15:04 +0100 > > > was meant to be an example of a > > > searched string. For example, > > COMBINING DOT BELOW> contains, under canonical equivalence, the > > > substring . Your regular > > > expressions would not detect this relationship. > > > My regular expression WILL detect this: scanning the text implies > > first composing it to "full equivalent decomposition form" (without > > even reordering it, and possibly recompose it to NFD) while reading > > it and bufering it in forward direction (it just requires the > > decomposition pairs from the UCD, including those that are "excluded" > > from NFC/NFD). > > No. To find BELOW>, you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100": > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > To be consistent, to find you > would construct > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * > ( > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > > | > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > > ) > > (A final ')' got lost between brain and text; I have restored it.) > This was a minor omission, ONLY this final parenthese was missing, as it was truncated from its last single line where this was the only character (don't know why it was truncated there, but is easy to restore). You did not correct anything else. > > However, decomposes to > DOT BELOW>. It doesn't match your regular expression, for between > COMBINING > DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which > ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not match because the combining macron blocks the combining dot below and combining diaeresis from commuting, and so there's no canonical equivalence. or cannot be matched in any case by searching using canonical equivalence rules. So this regexp is perfectly correct. No error at all (except the missing final parenthese), and my argument remains valid. > > The regexp exgine will then only process the "fully decomposed" input > > text to find matches, using the regexp transformed as above (which > > has some initial "complex" setup to "fully decompose" the initial > > regexp), but only once when constructing it, but not while processing > > the input text which can be then done straightforward with its full > > decomposition easily performed on the fly without any additional > > buffering except the very small lookahead whose length is never > > longer than the longest "canonical" decompositions in the UCD, i.e. > > at most 2 code points of look ahead). > > Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND > YPOGEGRAMMENI decomposes to . > > Conversion to NFD on input only requires a small buffer for natural > orthographies. I suspect the worst in natural language will come from > either narrow IPA transcriptions or Classical Greek. > OK, the canonical decompositions may expand to more than 2, because some canonical decomposition pairs may contain decomposable pais, but this is still bounded (4 here). The complete set of full decompositions from the UCD is well known, it fits in a resonnably small static table for each version of Unicode (and its size grows very slowly only when there are new character encoded that are decomposable without breaking the statibility rules about all existing combining characters). Even if this expands the input text to 4 times its length (in number of code points), it will still be a very small input look ahead buffer. Very few entries in this table will be decomposable to more than 2 and this only occurs for the oldest characters in Unicode, notably in Greek because there's a **single** case of a combining character that has a canonical decomposition pair (this comes from the encoding of a combining character mapped for compatibility from a legacy non-Unicode charset). All the other pairs are a base character (cc=0), possibly decomposable again only one time (e.g. Vietnamese Latin letters and a single non decomposable combining character with cc>0), and fully decomposable to 3 characters: these encoded characters have multiple diacritics, and are quite rare in the UCD except in the extended Latin blocks. > > The automata is of course using the classic NFA used by regexp engine > > (and not the DFA which explodes combinatorially in all regexps), but > > which is still fully deterministic (the current "state" in the > > automata is not a single integer for the node number in the traversal > > graph, but a set of node numbers, and all regexps have a finite > > number of nodes in the traversal graph, this number being > > proportional to the length of the regexp, it does not need lot of > > memory, and the size of the current "state" is also fully bounded, > > never larger than the length of the regexp). Optimizing some > > contextual parts of the NFA to DFA is possible (to speedup the > > matching process and reduce the maximum storage size of the "current > > state") but only if it does not cause a growth of the total number of > > nodes in the traversal graph, or as long as this growth of the total > > number does not exceed some threshold e.g. not more than 2 or 3 times > > the regexp size). > > In your claim, what is the length of the regexp for searching for ? in a > trace? Is it 3, or is it abut 14? If the former, I am very interested > in how you do it. If the latter, I would say you already have a form > of blow up in the way you cater for canonical equivalence. > For searching ?, the transformed regexp is just [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * The NFA traversal graph contains nodes at location pointed below by apostrophes: ' ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] ' * ' ' It has 5 nodes only (assuming that the regexp engine will compute lookup tables to build the character classes). When there's a quantifier (like "*" here) which is not "{1,1}" after each character or character class it inserts a node after it. No node is inserted in the NFA traversal graph for non-capturing parentheses, but nodes may be inserted for capturing parentheses, and these's the two nodes represening the start of the regexp and the end of the regexp (1 node is also insert before the "^" or "$" context delimiters to match the start and and of input lines (they are character classes, excluded from the capture), or start and end of input text (for regexps using the "multiline" flag) > Even with the dirty trick of normalising the searched trace for input > (I wanted the NFA propagation to be defined by the trace - I also didn't > want to have to worry about the well-formedness of DFAs or NFAs), I > found that the number of states for a concatenation of regular > languages of traces was bounded above by the product of the number of > states This worst case occurs when each regexps can match a zero-length input string (i.e. its final node in the traversal graph is a quantifier like "{0,m}" or "*" or "?" that applies to the whole regexp), or the traversal graph is made of parallel branches starting from the same point and having each one such quantifier. The traversal graph needs to resolves the parentheses and capturing groups to combine them into single quantifier nodes, this transform from the bounded non resolved graph does not cause its expansion in size (number of nodes), but instead causes it to be compacted (characters classes from the link input going to the same target node an be factorized by computing their intersection and separate it from each branch and merge it to a separate branch and you drop the remaining branches where there remaining character classes without this intersection is empty). This is simple to compute. For this worst case, you don't generate a product of the two NFA traversal graphs, you can directly concatenate each graph, the growth in size remaing propertional to the total lenth of the initial regexp, with a small bounded factor (this factor depends on how you represent each node (with character classes possibly including SOT and EOT, or without character classes by separate nodes for each character or SOT or EOT). It does not seem unreasonable to build these character classes and compute their unions/intersections where necessary across branches when factorizing them. you just need to take care of nodes added for non-{1,1} quantifiers. The regexp engine can also choose to expand {m,} or {m,n} quantifiers where m > 1, by concatenating at most m occurences of the subgraph before it (it can do that at least once so that /(a){2,}/ (without capturing groups for parantheses here) is treated like if it was /(a)(a)*/, and if the subgraph for /(a)/ is small (e.g. not more than 64 nodes, for example) it can perform this expansion more times. For example The NFA traversal graph for /(a|b}{10,}/ is (assuming that a and b are orthogonal graphs and not single characters that can be combined by computing character classes): ' / \ 'a 'b \ / '{10,} | ' it has 5 nodes (including those for SOT and EOT). Expanding it one time becomes: ' / \ 'a 'b / \ / \ 'a 'b 'a 'b \ / \ / '{9,} '{9,} \ / ' This is for this preexpansion of quantifiers that you can see the graph expansion. If you expand /A{m,n}/ one time to /AA{m-1,n-1}/ where the graph for /A{m,n}/ has (k+2) nodes (including SOT and EOT), then the new graph will have (2k+2) nodes if n>2, or only (2k) nodes if n=2. For example, in /a{2}/ has 4 nodes (still marked by leading apostrophes here), and so k=2: ' | 'a | '{2,2} | ' You can expand the '{2} quantifier one time it gives a graph with 2k nodes (not 2k+2 because n=2 in this quantifier and the expansion finally drops the the remaining {1,1} quantifier), i.e. ' | 'a | 'a | ' In that case, the expansion does not grow the graph size because n=2 only in the quantifier and k=1 is small enough (the subgraph on which the quantifier loops is just a single character or character class !) The regexp always has the option of precomuting this expansion... or not. The expansion does not lower the number of "active states" in the graph, it just allows faster traversal of the graph by avoiding to pass through quantifier nodes (that need counters in their state and require an additional step). I suggest not expanding quantifiers {m,n} not more than one to 4 times and not doing it at all if the subgraph is large enough (not more than 4 nodes). Above this the gain of performance is marginal given that your graph will have its size grow dramatically, and you'll reduce the locality of data memory caches in processors for the graph itself, and you'll need to allocate more dynamic memory. This tuning is to experiment by profiling your actual implementation of the graph traversal for the matcher (when scannin the input), and the resources (time and storage space) needed to compile the regexp into this expanded graph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 1 02:19:48 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Feb 2018 09:19:48 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: 2018-02-01 8:03 GMT+01:00 Philippe Verdy : > > > 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > >> >> For example, in /a{2}/ has 4 nodes (still marked by leading apostrophes > here), and so k=2: > > ' <---- > \ / \ > 'a | > | ^ > '{2,2} | > / \ / > ' ---> > I forgot to describe how I represent the graph (when compiling regexps only). And I forgot the second (looping) link from the quantifier (added above) The graph is just a vector of nodes stored as a linear array indexed by integers. Nodes (with leaing apostrophes in the notation above) are objects with one of 4 types: SOT, character class, quantifier, or EOT. - The SOT and EOT node types are trivial and has no other properties, they exists once and only once in all graphs, they can be the same actual type. - the quantifier node type has two integer properties: min and max taking a positive or null value, and INFINITE for the unbounded quantifer (e.g. the "+", "*", or "{2,}" quantifiers) with the constraint that min<=max. INFINITE can be represented in an integer as -1 or MAXINT. It also has another computed property, the counter index, i.e. an index in the array of counter values you'll allocate for the "state" variable you'll use in the matcher. It has two other properties: the next node number (in the represented graph) to take whever the counter has reached the [min, max] interval or not, and possibly a "greedy" flag to specify which condition (false or true) you'll evaluate first (instead of this flag you can create two separate subtypes for greedy quantifiers and non-greedy quantifiers to avoid this test at runtime in the matcher). - the character class node type can have subtypes : single character, basic character class (like [abc]), or a more complex character class, implemented by a character class method "is(c)" where c is the input code point from the text being scanned. The functional method can evaluate for example Unicode character properties such as "isupper(c)" evaluating the [:gc=Lu:] character class, or "isdigit(c)" evaluating the [:gc=Nd:] character class, or isin(c, "string") to detect if c is present in the given string containing a list of characters. or isnotin(c, "string") for the inverse. In Javascript, it is trivial to build functions on the fly, in C/C++ you need a representation to call the appropriate method with some parameters. It You may also have two additional node types for capturing groups : SOC (start of capturing group) and EOC (end of capturing group) both with a single property: the capturing group index. You'll allocate a new index for the array of captured groups to allocate in the "state" variable used by the matcher. This type of node is unconditionally traversed, but traversing it just consists into storing the current input position in the array of captured groups (and making sure that, while runniong the macher, you'll keep the past scanned input text in a buffer as soon as you've started caputuring any group) All node types have also an array of output node index (this array is unordered: all of them will be taken simultaneously by a step in the matcher, so you can segregate node types instead of using polymorphic nodes, and then store in each node an array of indexes for each output node type),and then instead of using a single array of nodes for storing the graph, use a separate array for quantifiers nodes, and an separate array for character class nodes. As SOT will never be part of the output nodes to link to, you'll just need a boolean flag to say if the EOT node is part of the output nodes from a node in your graph, -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 1 13:20:04 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Feb 2018 19:20:04 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: <20180201192004.4ab8c9c5@JRWUBU2> On Thu, 1 Feb 2018 08:03:31 +0100 Philippe Verdy via Unicode wrote: > 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: >> On Wed, 31 Jan 2018 19:45:56 +0100 >> Philippe Verdy via Unicode wrote: >>> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < >>> unicode at unicode.org>: >>>> For example, >>> COMBINING DOT BELOW> contains, under canonical equivalence, the >>>> substring . Your regular >>>> expressions would not detect this relationship. >>> My regular expression WILL detect this: scanning the text implies >>> first composing it to "full equivalent decomposition >>> form" (without even reordering it, and possibly recompose it to >>> NFD) while reading it and bufering it in forward direction (it >>> just requires the decomposition pairs from the UCD, including >>> those that are "excluded" from NFC/NFD). >> To be consistent, to find >> you would construct (i.e. Philippe Verdy would construct) >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * >> ( >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> >> | >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> >> ) >> However, >> decomposes to > MACRON, COMBINING DOT BELOW>. It doesn't match your regular >> expression, for between COMBINING >> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for >> which ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not > match because the combining macron blocks the combining dot below and > combining diaeresis from commuting, and so there's no canonical > equivalence. I'm not sure what you mean by 'commuting' in this case, but either your statement or your deduction is wrong! Although adjacent characters with the same non-zero canonical combining class cannot be interchanged, that does not stop the members of the pair commuting with their neighbour with a different non-zero ccc whilst preserving canonical equivalence. Thus the searched string is canonically equivalent . In your scheme, assuming you are looking for the most compact match, one should generate the regular expression: [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * ( [[ [^[[:cc=0:]]] - [[:cc=below:]] ]]* | [[ [^[[:cc=0:]]] - [[:cc=above:]] ]]* ) Have you got a program doing this and reporting to you, or did you assemble the construction by hand? Constructing regular expressions is known to be tricky. You cannot replace this by a more restrictive albeit wordier regex as you suggested on Sunday 28 January (http://www.unicode.org/mail-arch/unicode-ml/y2018-m01/0145.html). There is no upper bound on the length of matching expressions. >>> The automata is of course using the classic NFA used by regexp >>> engine (and not the DFA which explodes combinatorially in all >>> regexps), but which is still fully deterministic (the current >>> "state" in the automata is not a single integer for the node >>> number in the traversal graph, but a set of node numbers, and all >>> regexps have a finite number of nodes in the traversal graph, >>> this number being proportional to the length of the regexp, it >>> does not need lot of memory, and the size of the current "state" >>> is also fully bounded, never larger than the length of the >>> regexp). Optimizing some contextual parts of the NFA to DFA is >>> possible (to speedup the matching process and reduce the maximum >>> storage size of the "current state") but only if it does not >>> cause a growth of the total number of nodes in the traversal >>> graph, or as long as this growth of the total number does not >>> exceed some threshold e.g. not more than 2 or 3 times the regexp >>> size). >> In your claim, what is the length of the regexp for searching for ? >> in a trace? Is it 3, or is it abut 14? If the former, I am very >> interested in how you do it. If the latter, I would say you >> already have a form of blow up in the way you cater for canonical >> equivalence. > For searching ?, the transformed regexp is just > > [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * > > The NFA traversal graph contains nodes at location pointed below by > apostrophes: > > ' > ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] > ' * > ' > ' > > It has 5 nodes only (assuming that the regexp engine will compute > lookup tables to build the character classes). I asked about ???, not ???. Anyway, you?ve effectively answered the question. You?re talking about regular expressions for strings, not regular expressions for traces. > When there's a > quantifier (like "*" here) which is not "{1,1}" after each character > or character class it inserts a node after it. No node is inserted in > the NFA traversal graph for non-capturing parentheses, but nodes may > be inserted for capturing parentheses, and these's the two nodes > represening the start of the regexp and the end of the regexp (1 node > is also insert before the "^" or "$" context delimiters to match the > start and and of input lines (they are character classes, excluded > from the capture), or start and end of input text (for regexps using > the "multiline" flag) Quantifier nodes like {2,4} probably break down for non-deterministic expressions like (a(bc)?|b){2,4}. The string "ab" contains 2 iterations, but the longer string "abc" contains one iteration. > > Even with the dirty trick of normalising the searched trace for > > input (I wanted the NFA propagation to be defined by the trace - I > > also didn't want to have to worry about the well-formedness of DFAs > > or NFAs), I found that the number of states for a concatenation of > > regular languages of traces was bounded above by the product of the > > number of states > For this worst case, you don't generate a product of the two NFA > traversal graphs, you can directly concatenate each graph, the growth > in size remaing propertional to the total lenth of the initial > regexp, with a small bounded factor (this factor depends on how you > represent each node (with character classes possibly including SOT > and EOT, or without character classes by separate nodes for each > character or SOT or EOT). It does not seem unreasonable to build > these character classes and compute their unions/intersections where > necessary across branches when factorizing them. you just need to > take care of nodes added for non-{1,1} quantifiers. Concatenation doesn?t work with traces when the first expression can end in non-starters and the second expression can begin with them. This is a real issue when parsing sequences of characters for grammatical correctness, which is why I got interested in traces. A regular trace expression of the form [:ccc=1:][:ccc=2:]?[:ccc=n:] seems to require 2^n states in your scheme. As I effectively only apply the regex to NFD input strings, I use fewer states. However, the efficiency of my scheme depends on the order of the commuting factors - reverse order would require the 2^n states. Richard. From unicode at unicode.org Thu Feb 1 17:45:07 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Feb 2018 23:45:07 +0000 Subject: Internationalised Computer Science Exercises - Correction In-Reply-To: <20180201013858.383c7313@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: <20180201234507.27831346@JRWUBU2> On Thu, 1 Feb 2018 01:38:58 +0000 Richard Wordingham via Unicode wrote: > I believe the concurrent star of a language A is (|A|)*, where > > |A| = {x ? A : {x}* is a regular language} > > (The definition works for the trace of fully decomposed Unicode > character strings under canonical equivalence.) I misremembered. The notation is (/A/)*, where starter-free x ? A is dealt with by converting x into its maximal substrings all of whose characters are of the same canonical combining class and putting them in /A/ in place of x. > Concurrent star is not a perfect generalisation. If ab = ba, then > X = {aa, ab, b} has the annoying property that X* is a regular trace > language, but |X|* is a proper subset of X*. For Unicode, X would be > a rather unusual regular language. So this is the other way round. We will get /X/ = {aa, a, b}, so X* is a proper subset of /X/*. Richard. From unicode at unicode.org Mon Feb 5 15:37:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 5 Feb 2018 21:37:30 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180201192004.4ab8c9c5@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> <20180201192004.4ab8c9c5@JRWUBU2> Message-ID: <20180205213730.3abe8ce1@JRWUBU2> On Thu, 1 Feb 2018 19:20:04 +0000 Richard Wordingham via Unicode wrote: > A regular trace expression of the form > > [:ccc=1:][:ccc=2:]?[:ccc=n:] > > seems to require 2^n states in your scheme. As I effectively only > apply the regex to NFD input strings, I use fewer states. However, > the efficiency of my scheme depends on the order of the commuting > factors - reverse order would require the 2^n states. I've overstated the compactness of my scheme. Firstly, I split the state for an optionally final matched character into two states according to whether it is to be the final character or not. Secondly, the DFA for a Unicode character is quite large. I've kept it simple and identify most states by the matched Unicode character, which means I have nearly a million states, whereas I could probably whittle it down to more like a thousand or so, at a vast increase in complexity. Richard. From unicode at unicode.org Wed Feb 7 17:47:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 7 Feb 2018 23:47:11 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? Message-ID: <20180207234712.63df3470@JRWUBU2> I am having trouble identifying just what is represented by ? U+0E46 THAI CHARACTER MAIYAMOK. My problem is that the grammatical texts that I have state that when the Thai punctuation mark mai yamok (??????) is used with words, it is flanked by spaces, a position reiterated by the Thai Wikipedia entry on the mark at http://th.wikipedia.org/wiki/??????. It is not clear to me whether the Unicode character includes those spaces or not. I have encountered fonts whose glyph for U+0E46 has so much space on the left that I believe is intended to give the appearance of a preceding space. The glyphs in the reference chart appear to be centred, so I cannot tell whether spaces are incorporated. It does appear that those who believe U+0E46 is flanked by spaces between words omit the following space before Western punctuation marks. So, does U+0E46 include either of those flanking spaces, and, if so, which? A related question is whether dictionary items like "?????? ?", which lacks a corresponding simplex "??????", constitute a single word in any sense relevant to Unicode, the CLDR or ICU. I think a spell-checker will work better if they do. Richard. From unicode at unicode.org Wed Feb 7 22:16:21 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 7 Feb 2018 20:16:21 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <20180207234712.63df3470@JRWUBU2> References: <20180207234712.63df3470@JRWUBU2> Message-ID: In the example, "?????? ?", there's a space character in the text, which seems right. There's no space between MAIYAMOK and the closing quotation mark, which also seems right. If a font included extra spacing around MAIYAMOK, the display of something like... THAI CHARACTER MAIYAMOK (?) ...would be off, I'd think. > ... when the Thai punctuation mark mai yamok (??????) > is used with words, it is flanked by spaces, ... Is there a contrasting use where this mark is not used with words? Maybe numbers? From unicode at unicode.org Wed Feb 7 23:23:06 2018 From: unicode at unicode.org (Theppitak Karoonboonyanan via Unicode) Date: Thu, 8 Feb 2018 12:23:06 +0700 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2> Message-ID: On Thu, Feb 8, 2018 at 11:16 AM, James Kass via Unicode wrote: > In the example, "?????? ?", there's a space character in the text, > which seems right. There's no space between MAIYAMOK and the closing > quotation mark, which also seems right. If a font included extra > spacing around MAIYAMOK, the display of something like... > THAI CHARACTER MAIYAMOK (?) > ...would be off, I'd think. I think the thin space in the glyph is a hack, not the norm. The regulation as defined by the Thai Royal Institute is to use a space or thin space before MAIYAMOK. And if it's followed by a word, use another space after. But the current uses are not consistent. For example: - ????????? (without space at all) - ???? ? ???? (with space before and after, as regulated) - ????? ???? (without space before, but with one after) The argument for not using space before MAIYAMOK is that most like break algorithms will break line before it, tearing it apart from its associated word, which is undesirable. To mitigate this, while fullfilling the regulation when printed, some font creators hack the leading space into the glyph and suggest their users not to prepend a space before MAIYAMOK at all. But the hack also affects people who follow the regulation, as they get too wide space between the word and MAIYAMOK. An apparent way to do it properly is to use NBSP before MAIYAMOK and a normal space after, and not to include any leading space in the glyph, but it seems inconvenent to input NBSP in common text editors. >> ... when the Thai punctuation mark mai yamok (??????) >> is used with words, it is flanked by spaces, ... > > Is there a contrasting use where this mark is not used with words? > Maybe numbers? None. MAIYAMOK is only used with words by definition. Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From unicode at unicode.org Thu Feb 8 00:02:28 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 7 Feb 2018 22:02:28 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2> Message-ID: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 8 00:20:15 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 7 Feb 2018 22:20:15 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> References: <20180207234712.63df3470@JRWUBU2> <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> Message-ID: Asmus Freytag wrote, > Any text editor that has the ability to handle > slightly more complex input scenarios could be > programmed to convert SP to NBSP before MAIYAMOK. Yes. If I were developing a Thai text editor it would also globally replace any instances of SPACE + MAIYAMOK with NBSP + MAIYAMOK upon File-Save automatically. (To handle updating older files and copy-pasting text from external sources.) From unicode at unicode.org Thu Feb 8 01:57:37 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Feb 2018 07:57:37 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> References: <20180207234712.63df3470@JRWUBU2> <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> Message-ID: <20180208075737.415863bc@JRWUBU2> On Wed, 7 Feb 2018 22:02:28 -0800 Asmus Freytag via Unicode wrote: > On 2/7/2018 9:23 PM, Theppitak Karoonboonyanan via Unicode wrote: > An apparent way to do it properly is to use NBSP before > MAIYAMOK and a normal space after, and not to include > any leading space in the glyph, but it seems inconvenent > to input NBSP in common text editors. > > Any text editor that has the ability to handle slightly more complex > input scenarios could be programmed to convert SP to NBSP before > MAIYAMOK. > > A./ > For any compliant tailorable implementation of the Unicode line-breaking algorithm, the correct method is for U+0E46 to be tailored to have line_break=exclamation. (U+0021 EXCLAMATION MARK is often offset by a space in Thai.) I know it works in ICU Version 53; I haven't tested later versions. Although NBSP is available in Windows-874 and IANA-registered tis620 (as 0xA0), it is not available in TIS-620, the national 8-bit standard. What of a word break between a letter and MAIYAMOK in text tagged as Thai? Should it be never, always or sometimes? Should it depend on whether there is an intermediate space? Richard. From unicode at unicode.org Thu Feb 8 01:59:14 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Feb 2018 07:59:14 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2> Message-ID: <20180208075914.0e8e5609@JRWUBU2> On Wed, 7 Feb 2018 20:16:21 -0800 James Kass via Unicode wrote: > Is there a contrasting use where this mark is not used with words? > Maybe numbers? The only other use I've seen is quotation of the mark - putting it in parentheses seems quite common. Richard. From unicode at unicode.org Fri Feb 9 14:31:11 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 9 Feb 2018 21:31:11 +0100 (CET) Subject: Cross-Locale Keyboard Features for the General Public Message-ID: <5529894.24216.1518208271272.JavaMail.www@wwinf1c20> Approx. 400 or more subscribers of Unicode Public happen not to be subscribed to CLDR-Users. Now there is a thread that might be of some interest also to non-CLDR?users. It?s about some main functionalities of keyboards intended for many locales, not about specific details of a particular locale tailoring. Please take a look at: http://unicode.org/pipermail/cldr-users/2018-February/000731.html to learn more if interested. Regards, Marcel From unicode at unicode.org Sun Feb 11 17:26:49 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Mon, 12 Feb 2018 00:26:49 +0100 Subject: Fwd: Unicode Emoji 11.0 characters now final for 2018 In-Reply-To: References: <5A7B4E41.7000503@unicode.org> Message-ID: Enthusiastic reactions to the new emojis announcement: https://xkcd.co m/1953/ PB On Wed, Feb 7, 2018 at 8:06 PM, wrote: > Emoji 11.0 data has been > released, with 157 new emoji such as: > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 13 02:34:35 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Tue, 13 Feb 2018 14:04:35 +0530 Subject: Emoji blooper Message-ID: Recently sent this message to a friends list: ???????????? Apparently one font has the trumpet facing left and one has it facing right! So before hitting Send in GMail's web interface, the text appeared fine but after doing so, in my browser it is showing as if the music is emanating from the back of the trumpet! LOL. -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Tue Feb 13 02:39:56 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Tue, 13 Feb 2018 14:09:56 +0530 Subject: Emoji blooper In-Reply-To: References: Message-ID: To illustrate? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? -------------- next part -------------- A non-text attachment was scrubbed... Name: trumpet-left+wrong.png Type: image/png Size: 2065 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: trumpet-right-on-both-counts.png Type: image/png Size: 1757 bytes Desc: not available URL: From unicode at unicode.org Tue Feb 13 13:27:48 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 13 Feb 2018 11:27:48 -0800 Subject: Emoji blooper In-Reply-To: References: Message-ID: On my machine (Chromebox+Gmail), the trumpets point down to the lower left. If you want to convey precise images, then send images... markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 02:53:28 2018 From: unicode at unicode.org (Erik Pedersen via Unicode) Date: Wed, 14 Feb 2018 00:53:28 -0800 Subject: Why so much emoji nonsense? Message-ID: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Dear Unicode Digest list members, Emoji, in my opinion, are almost entirely outside the scope of the Unicode project. Unlike text composed of the world?s traditional alphabetic, syllabic, abugida or CJK characters, emoji convey no utilitarian and unambiguous information content. Let us, therefore, abandon Emoji support in Unicode as a project that failed. If corporations want to maintain support for Emoji, let?s require them to use only the Private Use Area and, henceforth, confine Unicode expansion to attested characters from so far unsupported scripts. Kind regards, Erik Bj?rn Pedersen ? Victoria, B.C., Canada From unicode at unicode.org Wed Feb 14 03:18:51 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 14 Feb 2018 09:18:51 +0000 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On Wed, Feb 14, 2018 at 12:55 AM Erik Pedersen via Unicode < unicode at unicode.org> wrote: > Dear Unicode Digest list members, > > Emoji, in my opinion, are almost entirely outside the scope of the Unicode > project. Unlike text composed of the world?s traditional alphabetic, > syllabic, abugida or CJK characters, emoji convey no utilitarian and > unambiguous information content. Let us, therefore, abandon Emoji support > in Unicode as a project that failed. If corporations want to maintain > support for Emoji, let?s require them to use only the Private Use Area and, > henceforth, confine Unicode expansion to attested characters from so far > unsupported scripts. > Because ' has so much unambiguous information content. Or even just c. (What's the phonetic value of that letter? Okay, I'll be "easy" on you; what's the phonetic value of that letter in English? What about e?) Also, who are the full members of Unicode? http://www.unicode.org/consortium/members.html says Google, Apple, Huawei, Facebook, Microsoft, etc. By show of hands, who wants a substantial part of the user's data to become incompatible? I think they just voted this down. Even ignoring that, this road has been crossed. Unicode will not tear out anything, but if they could, people could probably survive Cuneiform or Linear A going by the wayside. A not insubstantial part of the Unicode data in the world includes emoji, and removing it would break everything. Like many standards before that were radical changes, a new Unicode standard without emoji would be dead in the water, and someone else would create a competing back-compatible character standard and everyone would forget about Unicode? and start using The One CCS?. It's like demanding that C use bounds checking on its arrays, or that "island" go back to being spelled "iland" now that we recognize it's not related to "isle". Even if mistakes were made, they were carved into stone, and going back is not an option. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 05:23:06 2018 From: unicode at unicode.org (Konstantin Ritt via Unicode) Date: Wed, 14 Feb 2018 14:23:06 +0300 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-14 12:18 GMT+03:00 David Starner via Unicode : > Even if mistakes were made, they were carved into stone, and going back is > not an option. > Sure. However that doesn't mean Unicode should keep adding more and more emoji nonsense. A billion of cat faces, pile of poo, * skin tone Santa/vampire/superwoman/levitating man, keycaps and clocks - are they really that important for the Standard to be encoded separately?! Well, that was a rhetorical question... Regards, Konstantin -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 07:25:50 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 18:55:50 +0530 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: >From a mail which I had sent to two other Unicode contributors just a few days ago: Frankly I agree that this whole emoji thing is a Pandora box. It should have been restricted to emoticons to express facial or physical gestures which are insufficiently representable by words. When it starts representing objects like ???? then it becomes a problem as to where to draw the line. I mean I can see the argument for ?? representing gratitude, but which fruits are valid and which not... And which food items are valid and which not, else you would get proposals for idli and dosa emojis as well! (Those who don't know what those are see https://en.wikipedia.org/wiki/Idli and https://en.wikipedia.org/wiki/Dosa) It seems to me that graphical items previously rejected as such are now being encoded. I mean, if other things like bat ball etc then "why not this one" cannot be refused, but the question is whether encoding bat ball in the first place was keeping with the original intention or spirit of Unicode. Anyhow, what is done is done and the Pandora's box is now open and I don't envy the ESC their job. I don't know, maybe sometimes they may just feel like hitting "ESC" too! -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Wed Feb 14 10:14:06 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 21:44:06 +0530 Subject: UNICODE vehicle vanity registration? Message-ID: Given that in the US vanity vehicle registrations with arbitrary alphanumeric sequences upto 7 characters are permitted (I am correct I hope?), I wonder who (here?) owns the UNICODE registration? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Wed Feb 14 10:24:37 2018 From: unicode at unicode.org (Stephane Bortzmeyer via Unicode) Date: Wed, 14 Feb 2018 17:24:37 +0100 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: Message-ID: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> On Wed, Feb 14, 2018 at 09:44:06PM +0530, Shriramana Sharma via Unicode wrote a message of 6 lines which said: > Given that in the US vanity vehicle registrations with arbitrary > alphanumeric sequences upto 7 characters are permitted (I am correct > I hope?), I wonder who (here?) owns the UNICODE registration? Won't work in New York, unfortunately https://dmv.ny.gov/learn-about-personalized-plates "A character is a letter (A-Z), number (0-9) or space. Each space counts as one character." From unicode at unicode.org Wed Feb 14 10:29:53 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 21:59:53 +0530 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: Sorry but "UNICODE" does fit within those rules doesn't it? On 14-Feb-2018 21:54, "Stephane Bortzmeyer" wrote: On Wed, Feb 14, 2018 at 09:44:06PM +0530, Shriramana Sharma via Unicode wrote a message of 6 lines which said: > Given that in the US vanity vehicle registrations with arbitrary > alphanumeric sequences upto 7 characters are permitted (I am correct > I hope?), I wonder who (here?) owns the UNICODE registration? Won't work in New York, unfortunately https://dmv.ny.gov/learn-about-personalized-plates "A character is a letter (A-Z), number (0-9) or space. Each space counts as one character." -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 10:32:42 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 14 Feb 2018 16:32:42 +0000 Subject: UNICODE vehicle vanity registration? In-Reply-To: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: You can use ????? in California. Someone has U+1F913 ?? ( https://www.instagram.com/p/BVYtIHensDu/) Andrew On 14 February 2018 at 16:24, Stephane Bortzmeyer via Unicode < unicode at unicode.org> wrote: > On Wed, Feb 14, 2018 at 09:44:06PM +0530, > Shriramana Sharma via Unicode wrote > a message of 6 lines which said: > > > Given that in the US vanity vehicle registrations with arbitrary > > alphanumeric sequences upto 7 characters are permitted (I am correct > > I hope?), I wonder who (here?) owns the UNICODE registration? > > Won't work in New York, unfortunately > > https://dmv.ny.gov/learn-about-personalized-plates > > "A character is a letter (A-Z), number (0-9) or space. Each space > counts as one character." > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 10:34:15 2018 From: unicode at unicode.org (Stephane Bortzmeyer via Unicode) Date: Wed, 14 Feb 2018 17:34:15 +0100 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: <20180214163415.xbsktiqut4xskzfa@nic.fr> On Wed, Feb 14, 2018 at 09:59:53PM +0530, Shriramana Sharma wrote a message of 54 lines which said: > Sorry but "UNICODE" does fit within those rules doesn't it? I doubt that the Departement of Motor Vehicles will accept "but it is in category Ll" as a good reason :-) From unicode at unicode.org Wed Feb 14 11:15:44 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 14 Feb 2018 17:15:44 +0000 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> On 14 Feb 2018, at 16:29, Shriramana Sharma via Unicode wrote: > > Sorry but "UNICODE" does fit within those rules doesn't it? Yes. Stephane has misunderstood. (Shriramana meant the literal text ?UNICODE?, which is indeed composed of letters A-Z and meets the definition quoted.) I?d hope that Mark Davis has ?UNICODE? on his car. However, I?m not sure how relevant it really is to this mailing list. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Feb 14 11:45:27 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 14 Feb 2018 17:45:27 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> On 14 Feb 2018, at 13:25, Shriramana Sharma via Unicode wrote: > > From a mail which I had sent to two other Unicode contributors just a > few days ago: > > Frankly I agree that this whole emoji thing is a Pandora box. It > should have been restricted to emoticons to express facial or physical > gestures which are insufficiently representable by words. When it > starts representing objects like ???? then it becomes a problem as to > where to draw the line. A lot of the emoji were encoded because they were in use on Japanese mobile phones. A fair proportion of those may very well not meet the selection factors (see ) required for new emoji, but they were definitely within the scope of the Unicode project as encoding them provides interoperability. As for newer emoji, whether they are encoded or not is up to the UTC, and as I say, they apply (or are supposed to apply) the criteria on the ?Submitting Emoji Proposals? page. There is certainly an argument that the encoding of new emoji should be discouraged in favour of functionality at higher layers (e.g. tags in HTML), but, honestly, I think that ship has probably sailed. Similarly there are, I think, good reasons to object to the skin tone and gender modifiers, but we?ve already opened that can of worms and so will now have to put up with demands for red hair (or quite probably, freckles, monobrows, different hats, hair, beard and moustache styles and so on). Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Feb 14 12:37:01 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 15 Feb 2018 00:07:01 +0530 Subject: UNICODE vehicle vanity registration? In-Reply-To: <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> Message-ID: On 14-Feb-2018 22:45, "Alastair Houghton" wrote: I?d hope that Mark Davis has ?UNICODE? on his car. However, I?m not sure how relevant it really is to this mailing list. You're right. My apologies. It *is* somewhat OT to the actual purpose of this list. But I figured if anyone knew the answer to my question they'd be here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 13:14:22 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 11:14:22 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: Alastair Houghton wrote, > ...but they were definitely within the scope of the > Unicode project as encoding them provides interoperability. That's one way of looking at it. Another way would be that the emoji were definitely outside the scope of the Unicode project as encoding them violated Unicode's initial encoding principles. The opposition was strong, but resistance was futile. Anyone interested in the arguments made at the time should check the Unicode public list archives in late 2008 and early 2009. Here's the link for January 2009: http://www.unicode.org/mail-arch/unicode-ml/y2009-m01/index.html Surprisingly, though, I have found at least one roundabout use for the emoji. When reading message boards and comment pages I've found that it's quite simple to skip any messages which are peppered with emoji without missing anything of substance. As far as interoperability goes, there's scads of emoji in the wild which aren't currently in Unicode. Every kind of hobby or interest seems to generate emoji specific to that area of interest. From unicode at unicode.org Wed Feb 14 13:50:01 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 14 Feb 2018 11:50:01 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote: > Unlike text composed of the world?s traditional alphabetic, syllabic, abugida or CJK characters, emoji convey no utilitarian and unambiguous information content. I think this represents a misunderstanding of the function of emoji in written communication, as well as a rather narrow concept of how writing systems work and why they have evolved. RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS The invention and development of word spacing, punctuation, and casing, among other elements of typography, represent the addition of meta-level information to written communication that assists in legibility, helps identify lexical and syntactic units, conveys prosody, and other information that is not well conveyed by simply setting down letters of an alphabet one right after the other. Emoticons were invented, in large part, to fill another major hole in written communication -- the need to convey emotional state and affective attitudes towards the text. This is the kind of information that face-to-face communication has a huge and evolutionarily deep bandwidth for, but which written communication typically fails miserably at. Just adding a little happy face :-) or sad face :-( to a short email manages to convey some affect much more easily and effectively than adding on entire paragraphs trying to explain how one feels about what was just said. Novelists have the skill to do that in text without using little pictographic icons, but most of us are not professional writers! Note that emoticons were invented almost as soon as people started communicating in digital mediums like email -- so long predate anything Unicode came up with. Other kinds of emoji that we've been adding recently may have a somewhat more uncertain trajectory, but the ones that seem to be most successful are precisely those which manage to connect emotionally with people, and which assist them in conveying how they *feel* about what they are writing. So I would suggest that people not just dismiss (or diss) this ongoing phenomenon. Emoji are widely used for many good reasons. And of course, like any other aspect of writing, get mis-used in various ways, as well. But you can be sure that their impact on the evolution of world writing is here to stay and will be the topic of serious scholastic papers by scholars of writing for decades to come. ;-) --Ken From unicode at unicode.org Wed Feb 14 14:49:57 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 14 Feb 2018 21:49:57 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-14 20:50 GMT+01:00 Ken Whistler via Unicode : > > On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote: > >> Unlike text composed of the world?s traditional alphabetic, syllabic, >> abugida or CJK characters, emoji convey no utilitarian and unambiguous >> information content. >> > > I think this represents a misunderstanding of the function of emoji in > written communication, as well as a rather narrow concept of how writing > systems work and why they have evolved. > > RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS > RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! The concept of vowels as distinctive letters came later, even the letter A was initially a representation of a glottal stop consonnant, sometimes mute, only written to indicate a word that did not start by a consonnant in their first syllable, letter. This has survived today in abjads and abugidas where vowels became optional diacritics, but that evolved as plain diacritics in Indic abugidas. The situation is even more complex because clusters of consonnants were also represented in early vowel-less alphabets to represent full syllables (this has formed the base of todays syllabaries when only some glyph variants of the base consonnant was introduced to distinguish their vocalization; Indic abugidas with their complex clusters where vowel diacritic create contextual variant forms of the base consonnant is also a remnant of this old age): the separation of phonetic consonnants came only later. Today's alphabets have a long history of evolution and adaptation to new needs for more precise communication and easier distinctions in languages that have also evolved; some new letters or diacritics were progressively abandonned, and but as the historic alphabets have persisted, then came the concept of digrams to represent a single sound by multiple letters, instead of inventing a new letter or diacritic, because the language in which these digrams were used almost never needed the phonetic letter pairs or their phonology (or such letter pair was too rarely needed that such use of digrams did not make the text undecipherable given the context of use). Over time the alphabets became less and less representative of the phonology (which evolved more rapidly than orthographies for texts that languages wanted to preserve, or because various local phonetic variants of the languages could stil lremain unified by keeping mute letters or letters representing sounds realized differently across regions). The invention of bicameral scripts later allowed easier distinction or reading when contextual forms could be used to emphasize the structure without necessarily using punctuation signs (the lowercase letters came from handwriting, because the initial engraved letters were to difficult to trace with a plum or pencil: letters were joined). Punctuation signs came later which could have deprecated the use of bicameral orthography, but languages have constinued to borrow terms from other languages, and the bicameral distinction became important to preserve. The invention of printing also produced artefacts in the orthography by the adoption of many abbreviation signs (because the paper or parchemins were expensive), and forced some simplifications of the handwritten style with a plum or pencil. Our recent age of computers (or even before the mechanical typewritters) have also dramatically simplified the alphabets because the character set was severely reduced by limitations of the initial technologies (this could have potentially killed all the abjads, abugidas, syllabaries or ideo-phonographic scripts during the 20th century, if there was not a popular resistance to preserve the culture of the initial texts written by humans, and notably the precious religious books): it is still difficult today to preserve many of the non-alphabetic scripts, and there's also difficulties to preserve the meaning diacritics in abjads and abugidas and even in alphabets, as well as bicameral distinctions. Finally the preservation of letters inherited from etymology to allow readers to infer semantics from words is difficult: this is the wellknown problem of orthographic reforms that tend to remove mute letters, remove some phonetic distinctions in letters and infer more and more the semantic from the context: we are in fact slowly returning to the old age of: RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! And the use (or abuse) of emojis is returning us to the prehistory when people draw animals on walls of caverns: this was a very slow communication, not giving a rich semantic, full of ambiguities about what is really meant, and in fact a severe loss of knowledge where people will not communicate easily and rapidly. The Emojis are a threat to the inherited culture, knowledge and science in general: we won't understand what was meant, and will loose our language to a point where it will be very unproductive and will generate more conflicts against people... Since the begining of the 20th century (and notably since WW2) we have developed lot of communication means, but we also see recently a severe degradation of litteracy and a growing social fracture for accessing the knowledge: the huge recent development of audio/video instead of text is a sever threat to preservation of culture: these audio/video contents are much more difficult to preserve than text. We can expect a degradation of general knowledge by the population, and a growing gap with those that have access to the inherited culture if we don't preserve (with Unicode) our text heritage which has proven to be very productive and allowed the development of science, and allowed to coordinate variable societies and allowed to communicat with people with variable cultures or across generations... -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 15:37:05 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 14 Feb 2018 21:37:05 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: On Wed, Feb 14, 2018 at 11:16 AM James Kass via Unicode wrote: > That's one way of looking at it. Another way would be that the emoji > were definitely outside the scope of the Unicode project as encoding > them violated Unicode's initial encoding principles. > They were characters being interchanged as text in current use. They are more inside the scope than many of the line-drawing characters for 8-bit computers that have been there since day one, and analogous to many of the dingbats that have also been there since day one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 16:33:48 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 14:33:48 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: David Starner wrote, > They were characters being interchanged as text > in current use. They were in-line graphics being interchanged as though they were text. And they still are. And we still disagree. From unicode at unicode.org Wed Feb 14 16:34:22 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 14 Feb 2018 14:34:22 -0800 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: Message-ID: <19f51a1f-2dd5-969f-7d00-33ae0887fd6e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 17:26:24 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 14 Feb 2018 15:26:24 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <2efcd3a8-f41d-82c0-754c-5e17db993f4d@att.net> On 2/14/2018 12:49 PM, Philippe Verdy via Unicode wrote: > > > RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! > > [ ... lots to say about the history of writing ... ] > And the use (or abuse) of emojis is returning us to the prehistory > when people draw animals on walls of caverns: this was a very slow > communication, not giving a rich semantic, full of ambiguities about > what is really meant, and in fact a severe loss of knowledge where > people will not communicate easily and rapidly. =-O Perhaps Philippe was missing my point about how and why emoji are actually used. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 17:47:04 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 14 Feb 2018 15:47:04 -0800 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 19:14:26 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Thu, 15 Feb 2018 01:14:26 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: On Wed, Feb 14, 2018 at 2:35 PM James Kass via Unicode wrote: > David Starner wrote, > > > They were characters being interchanged as text > > in current use. > > They were in-line graphics being interchanged as though they were > text. And they still are. And we still disagree. > They were units of things being interchanged in formats of MIME types starting with text/ . From the beginning, Unicode has supported all the cruft that's being interchanged in formats of MIME types starting with text/. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 19:49:05 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 17:49:05 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: On Wed, Feb 14, 2018 at 5:14 PM, David Starner wrote: > They were units of things being interchanged in formats of MIME types > starting with text/ . From the beginning, Unicode has supported all the > cruft that's being interchanged in formats of MIME types starting with > text/. Yes, except that Unicode "supported" all manner of things being interchanged by setting aside a range of code points for private use. Which enabled certain cell phone companies to save some bandwidth by assigning various popular in-line graphics to PUA code points. The "problem" was that these phone companies failed to get together on those PUA code point assignments, so they could not exchange their icons in a standard fashion between competing phone systems. [Image of the world's smallest violin playing.] I've personally exchanged text data with others using the PUA for both Klingon and Ewellic. [winks] From unicode at unicode.org Wed Feb 14 20:20:49 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 15 Feb 2018 11:20:49 +0900 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> On 2018/02/15 10:49, James Kass via Unicode wrote: > Yes, except that Unicode "supported" all manner of things being > interchanged by setting aside a range of code points for private use. > Which enabled certain cell phone companies to save some bandwidth by > assigning various popular in-line graphics to PUA code points. The original Japanese cell phone carrier emoji where defined in the unassigned area of Shift_JIS, not Unicode. Shift_JIS doesn't have an official private area, but using the empty area by companies had already happened for Kanji (by IBM, NEC, Microsoft). Also, there was some transcoding software initially that mapped some of the emoji to areas in Unicode besides the PUA, based on very simplistic conversion. > The > "problem" was that these phone companies failed to get together on > those PUA code point assignments, so they could not exchange their > icons in a standard fashion between competing phone systems. [Image > of the world's smallest violin playing.] Emoji were originally a competitive device. As an example, NTT Docomo allowed the ticket service PIA to have an emoji for their service, most probably in order to entice them to sign up to participate in the original I-mode (first case of Web on mobile phones) service. Of course, that specific emoji (or was it several) wasn't encoded in Unicode because of trademark issues. Regards, Martin. From unicode at unicode.org Wed Feb 14 20:59:14 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 18:59:14 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote: > The original Japanese cell phone carrier emoji where defined in the > unassigned area of Shift_JIS, not Unicode. Thank you (and another list member) for reminding that it was originally hacked SJIS rather than proper PUA Unicode. From unicode at unicode.org Thu Feb 15 08:16:28 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 15 Feb 2018 15:16:28 +0100 (CET) Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> Message-ID: <706530656.61950.1518704188693@ox.hosteurope.de> James Kass via Unicode : > Martin J. D?rst > >> The original Japanese cell phone carrier emoji where defined in the >> unassigned area of Shift_JIS, not Unicode. > > Thank you (and another list member) for reminding that it was > originally hacked SJIS rather than proper PUA Unicode. Japanese telcos were also not the first to use this space for pictographs and ideographs. Look at Sharp electronic typewriters from the early 1990s for instance (which can also be considered laptop computers), e.g. WD-A521 or WD-A551 or WD-A750. They already included much of what later became J-Phone / Vodafone / Softbank emojis. From unicode at unicode.org Thu Feb 15 11:21:26 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 09:21:26 -0800 Subject: Unicode of Death 2.0 Message-ID: This article: https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-crash-apple/?ncid=mobilenavtrend The single Unicode symbol referred to in the article results from a string of Telugu characters. The article doesn't list or display the characters, so Mac users can visit the above link. A link in one of the comments leads to a page which does display the characters. From unicode at unicode.org Thu Feb 15 12:58:10 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 15 Feb 2018 19:58:10 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: That's probably not a bug of Unicode but of MacOS/iOS text renderers with some fonts using advanced composition feature. Similar bugs could as well the new advanced features added in Windows or Android to support multicolored emojis, variable fonts, contextual glyph transforms, style variants, or more font formats (not just OpenType); the bug may also be in the graphic renderer (incorrect clipping when drawing the glyph into the glyph cache, with buffer overflows possibly caused by incorrectly computed splines), and it could be in the display driver (or in the hardware accelerator having some limitations on the compelxity of multipolygons to fill and to antialias), causing some infinite recursion loop, or too deep recursion exhausting the stack limit; Finally the bug could be in the OpenType hinting engine moving some points outside the clipping area (the math theory may say that such plcement of a point outside the clipping area may be impossible, but various mathematical simplifcations and shortcuts are used to simplify or accelerate the rendering, at the price of some quirks. Even the SVG standard (in constant evolution) could be affected as well in its implementation. There are tons of possible bugs here. 2018-02-15 18:21 GMT+01:00 James Kass via Unicode : > This article: > https://techcrunch.com/2018/02/15/iphone-text-bomb-ios- > mac-crash-apple/?ncid=mobilenavtrend > > The single Unicode symbol referred to in the article results from a > string of Telugu characters. The article doesn't list or display the > characters, so Mac users can visit the above link. A link in one of > the comments leads to a page which does display the characters. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 14:53:00 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 12:53:00 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another > major hole in written communication -- the need to convey > emotional state and affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. > This is the kind of information that face-to-face > communication has a huge and evolutionarily deep > bandwidth for, but which written communication > typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! From unicode at unicode.org Thu Feb 15 15:38:19 2018 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 15 Feb 2018 21:38:19 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: For voice we certainly get clues about the speaker's intent from their tone. That tone can change the meaning of the same written word quite a bit. There is no need for video to wildly change the meaning of two different readings of the exact same words. Writers have always taken liberties with the written word to convey ideas that aren't purely grammatically correct. This may be most obvious in poetry, but it happens even in other writings. Maybe their entire reason was so that future English teachers would ask us why some author chose some peculiar structure or whatever. I find it odd that I write things like "I'd've thought" (AFAIK I hadn't been exposed to I'd've and it just spontaneously occurred, but apparently others (mis)use it as well). I realize "I'd've" isn't "right", but it better conveys my current state of mind than spelling it out would've. Similarly, if I find myself smiling internally while I'm writing, it's going to get a :) Though I may use :), I agree that most of my use of emoji is more decorative, however including other emoji can also make the sentence feel more "fun". If I receive a ?? as the only response to a comment I made, that conveys information that I would have a difficult time putting into words. I don't find emoji to necessarily be a "post-literate" thing. Just a different way of communicating. I have also seen them used in a "pre-literate" fashion. Helping people that were struggling to learn to read get past the initial difficulties they were having on their way to becoming more literate. -Shawn -----Original Message----- From: Unicode On Behalf Of James Kass via Unicode Sent: Thursday, February 15, 2018 12:53 PM To: Ken Whistler Cc: Erik Pedersen ; Unicode Public Subject: Re: Why so much emoji nonsense? Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another major hole in > written communication -- the need to convey emotional state and > affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. > This is the kind of information that face-to-face communication has a > huge and evolutionarily deep bandwidth for, but which written > communication typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! From unicode at unicode.org Thu Feb 15 16:24:18 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 15 Feb 2018 23:24:18 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-15 22:38 GMT+01:00 Shawn Steele via Unicode : > > I don't find emoji to necessarily be a "post-literate" thing. Just a > different way of communicating. I have also seen them used in a > "pre-literate" fashion. Helping people that were struggling to learn to > read get past the initial difficulties they were having on their way to > becoming more literate. > If you just look at how more and more people "communicate" today on the Internet, it's only by video, most of them of poor quality and actually no graphic value at all where a single photo of the speaker on his profile would be enough. So the web is overwhelmed now by poor videos just containing speech, with very low value. But the worse is that this fabulous collection is almost impossible to qualify, sort, organize, it is not reusable, almost not transmissible (except on the social network where they are posted and where they'll soon disappear because there's simply no way to build efficient archives that would be usable in some near future: just a haystack where even the precious gold needles are extremely difficult to find. If people don't know how to read and cannot reuse the content and transmit it, they become just consumers and in fact less and less productors or creators of contents. Just look at opinions under videos, most of them are just "thumbs up", "like", "+1", barely counted only, unqualifiable (there's not even a thumb down). Even these terms are avoided on the interface and you just see an icon for the counter: do you have something to learn when seeing these icons? I fear that those in the near futuyre that won't be able to read and will only be able to listen the medias produced by others, will not even be able to make any judgement, and then will be easily manipulated. And it's in the mission of Unicode, IMHO, to promote litteracy because it is necessary for preserving, transmitting, and expanding the cultures, as well as reconciliate peopel with sciences instead of just following the voice of new gurus only because they look "fun". -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 16:30:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 15 Feb 2018 22:30:49 +0000 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <20180215223049.3e4c3692@JRWUBU2> On Thu, 15 Feb 2018 21:38:19 +0000 Shawn Steele via Unicode wrote: > I realize "I'd've" isn't > "right", Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? Richard. From unicode at unicode.org Thu Feb 15 16:33:12 2018 From: unicode at unicode.org (Oren Watson via Unicode) Date: Thu, 15 Feb 2018 17:33:12 -0500 Subject: Invisible characters must be specified to be visible in security-sensitive situations Message-ID: https://securelist.com/zero-day-vulnerability-in-telegram/83800/ You could disallow these characters in filenames, but when filename handling is charset-agnostic due to the extended-ascii principle this is impractical. I think a better solution is to specify a visible form of these characters to be used (e.g. through otf font variants) when security is of importance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 16:35:23 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 15 Feb 2018 22:35:23 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: <20180215223523.6a7a5abb@JRWUBU2> On Wed, 14 Feb 2018 17:49:05 -0800 James Kass via Unicode wrote: > I've personally exchanged text data with others using the PUA for both > Klingon and Ewellic. [winks] But wasn't that using a supplementary standard, the ConScript Unicode Registry? Richard. From unicode at unicode.org Thu Feb 15 16:35:54 2018 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 15 Feb 2018 22:35:54 +0000 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: <20180215223049.3e4c3692@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <20180215223049.3e4c3692@JRWUBU2> Message-ID: Depends on your perspective I guess ;) -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Thursday, February 15, 2018 2:31 PM To: unicode at unicode.org Subject: Re: Why so much emoji nonsense? - Proscription On Thu, 15 Feb 2018 21:38:19 +0000 Shawn Steele via Unicode wrote: > I realize "I'd've" isn't > "right", Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? Richard. From unicode at unicode.org Thu Feb 15 16:41:23 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 15 Feb 2018 14:41:23 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote: > And it's in the mission of Unicode, IMHO, to promote litteracy Um, no. And not even literacy, either. ;-) https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy --Ken From unicode at unicode.org Thu Feb 15 16:47:51 2018 From: unicode at unicode.org (Nelson H. F. Beebe via Unicode) Date: Thu, 15 Feb 2018 15:47:51 -0700 Subject: Invisible characters must be specified to be visible in security-sensitive situations Message-ID: A list poster reported this story today: https://securelist.com/zero-day-vulnerability-in-telegram/83800/ For a view from the co-father of the Internet, see this recent article: Desirable Properties of Internet Identifiers Vinton G. Cerf https://www.computer.org/csdl/mags/ic/2017/06/mic2017060063.html ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: beebe at math.utah.edu - - 155 S 1400 E RM 233 beebe at acm.org beebe at computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - ------------------------------------------------------------------------------- From unicode at unicode.org Thu Feb 15 16:52:28 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 14:52:28 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <20180215223523.6a7a5abb@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> <20180215223523.6a7a5abb@JRWUBU2> Message-ID: Richard Wordingham wrote, >> Klingon and Ewellic. [winks] > > But wasn't that using a supplementary standard, the ConScript Unicode > Registry? The code points registered with CSUR were used for the interchange. But, to clarify, CSUR is not an official supplement to The Unicode Standard. Of course, any exchange of PUA data requires an agreement between senders and recipients. CSUR offers character mappings which private individuals may agree to use for data exchange. From unicode at unicode.org Thu Feb 15 17:19:41 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 15:19:41 -0800 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <20180215223049.3e4c3692@JRWUBU2> Message-ID: I'd not've thought "I'd've" was proscribed. Who woulda guessed? On Thu, Feb 15, 2018 at 2:35 PM, Shawn Steele via Unicode wrote: > Depends on your perspective I guess ;) > > -----Original Message----- > From: Unicode On Behalf Of Richard Wordingham via Unicode > Sent: Thursday, February 15, 2018 2:31 PM > To: unicode at unicode.org > Subject: Re: Why so much emoji nonsense? - Proscription > > On Thu, 15 Feb 2018 21:38:19 +0000 > Shawn Steele via Unicode wrote: > >> I realize "I'd've" isn't >> "right", > > Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? > > Richard. > From unicode at unicode.org Thu Feb 15 17:49:18 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 16 Feb 2018 00:49:18 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Oh well the 1 to 2 T is a minor English typo (there's 2 T in French for the similar word family, sorry). But I included "IMHO", which means that even if it's not official, it has been the motivating reason why various members joined the project and try to put an end to the destruction of written languages and loss of our written heritage which is still the essential way to communicate for the humanity (much more than oral languages that are all threatened of rapid death and being fogotten if it's not written). Written languages easily cross the borders, the generations, the cultures, with it you can extend your own language and culture, and get more ideas, more inventions, you better understand the world, and you have the mean to be more creative, and not follow only what the most visible leaders are saying. Everywhere, literacy is improving people life and offers more means of living. And it really helps preserving your own personal memory (you do that with photos/videos or audio which are almost impossible to organize without attaching text to it)! 2018-02-15 23:41 GMT+01:00 Ken Whistler : > > > On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote: > >> And it's in the mission of Unicode, IMHO, to promote litteracy >> > > Um, no. And not even literacy, either. ;-) > > https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 18:16:23 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 16:16:23 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Philippe Verdy wrote, >>> And it's in the mission of Unicode, IMHO, to promote litteracy >> >> Um, no. And not even literacy, either. ;-) > > Oh well the 1 to 2 T is a minor English typo (there's 2 T in French for the > similar word family, sorry). > > But I included "IMHO", which means that even if it's not official, it has > been the motivating reason why various members joined the project ... In this case the punctuation emoticon tacked onto Ken's message apparently did little to diminish the sting of his correcting both your spelling and your opinion. Unicode's stated mission is more along the lines of ensuring that computer text can be universally interchanged in a standard fashion. As a tool, Unicode can be used to promote either literacy or illiteracy. It can be used to exchange messages of joy and love, or hatred and despair. I completely agree that promoting literacy and preserving texts has been a motivating factor for many people supporting the project. From unicode at unicode.org Thu Feb 15 18:17:17 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 16 Feb 2018 01:17:17 +0100 Subject: Invisible characters must be specified to be visible in security-sensitive situations In-Reply-To: References: Message-ID: The suggested filename has no real importance, it could be garbage, Displaying it exactly has no importance. What is important is to display the MIME type (which is transmitted separately of the filemane, and frequently as well without the filename, a browser trying to infer a suitable filename from the URL, but it should respect the MIME type). The acceptable MIME types (and especially here when they are executable like here a javascript), should be clearly identified, and then the file-extension removed from what is displayed when it matches the MIME type. With these, the user would not be confused by the presence of a Bidi override control So. "photo_high_re"++"gnp.js" becomes the text field (to embed in an isolate like ) " photo_high_re"++"gnp (text/javascript)" rendered as "photo_high_regnp" (text/javascript). The browser may also be smarter by describing it as an executable script. But here in an alert box, where it detects a potential harmful content, the suggested filename to display should be simply filtered from these Bidi controls, and the suggested file extension removed and replaced by the default extension for the MIME type outside the isolate). The user would then see; "photo_high_regnp.js" (text/javascript) where the suggested filename was altered (in such alert, the suggested file names should also be truncated to a maximum limit and an indication of the truncation before the replaced extension, such as: "photo_high[...].js" (text/javascript) As well the generic icon used is not enough descriptive and counter productive as the user may think the icon is a preview of a PNG image, that's why the MIOME type should be clearly exposed. 2018-02-15 23:33 GMT+01:00 Oren Watson via Unicode : > https://securelist.com/zero-day-vulnerability-in-telegram/83800/ > > You could disallow these characters in filenames, but when filename > handling is charset-agnostic due to the extended-ascii principle this is > impractical. I think a better solution is to specify a visible form of > these characters to be used (e.g. through otf font variants) when security > is of importance. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 18:59:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 00:59:11 +0000 Subject: Origin of Alphasyllabaries (was: Why so much emoji nonsense?) In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <20180216005911.7b2b9012@JRWUBU2> On Wed, 14 Feb 2018 21:49:57 +0100 Philippe Verdy via Unicode wrote: > The concept of vowels as distinctive letters came later, even the > letter A was initially a representation of a glottal stop consonnant, > sometimes mute, only written to indicate a word that did not start by > a consonnant in their first syllable, letter. This has survived today > in abjads and abugidas where vowels became optional diacritics, but > that evolved as plain diacritics in Indic abugidas. OK. > The situation is even more complex because clusters of consonnants > were also represented in early vowel-less alphabets to represent full > syllables (this has formed the base of todays syllabaries when only > some glyph variants of the base consonnant was introduced to > distinguish their vocalization; The only syllabary where what you say might be true is the Ethiopic syllabary, and I have grave doubts as to that case. I hope you are aware that most syllabaries do not derive from alphabets, abjads or abugidas. > Indic abugidas with their complex > clusters where vowel diacritic create contextual variant forms of the > base consonnant is also a remnant of this old age): I see no reasons to regard consonant-vowel ligatures as going back to an earlier system without dependent vowels. > the separation of > phonetic consonnants came only later. Old Brahmi stacked consonants are generally very clear compositions. Opaque ligatures are a later development. Writing consonants linearly is a later development; is this what you are referring to? Richard. From unicode at unicode.org Thu Feb 15 20:19:22 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 16 Feb 2018 10:19:22 +0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-16 04:55, "James Kass via Unicode" wrote: Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another > major hole in written communication -- the need to convey > emotional state and affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. By the standard of "if one can't string word together that speak for themselves can use otger media", then we can scrap Unicode and simply use voice recording for all the purposes. ?_? > This is the kind of information that face-to-face > communication has a huge and evolutionarily deep > bandwidth for, but which written communication > typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. Tonal emoticon for telephone or voice transmission? There are tones for voice based transmission system And yes, there are limits in these technology which make teleconferencing still not all that popular and people still have to fly across the world just to attend all different sort of meetings. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! Emoji is part of the literacy. Remember that Japanese writing system use ideographic characters plus kana, it won't be odd to add yet another set of pictographic writing system in line to express what you don't want to spell out. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 20:37:02 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 18:37:02 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On Thu, Feb 15, 2018 at 6:19 PM, Phake Nick via Unicode wrote: > > > 2018-02-16 04:55, "James Kass via Unicode" wrote: > > Ken Whistler replied to Erik Pedersen, > >> Emoticons were invented, in large part, to fill another >> major hole in written communication -- the need to convey >> emotional state and affective attitudes towards the text. > > There is no such need. If one can't string words together which > 'speak for themselves', there are other media. I suspect that > emoticons were invented for much the same reason that "typewriter art" > was invented: because it's there, it's cute, it's clever, and it's > novel. > > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? > > >> This is the kind of information that face-to-face >> communication has a huge and evolutionarily deep >> bandwidth for, but which written communication >> typically fails miserably at. > > Does Braille include emoji? Are there tonal emoticons available for > telephone or voice transmission? Does the telephone "fail miserably" > at oral communication because there's no video to transmit facial tics > and hand gestures? Did Pontius Pilate have a cousin named Otto? > These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. > > > For me, the emoji are a symptom of our moving into a post-literate > age. We already have people in positions of power who pride > themselves on their marginal literacy and boast about the fact that > they don't read much. Sad! > > Emoji is part of the literacy. Remember that Japanese writing system use > ideographic characters plus kana, it won't be odd to add yet another set of > pictographic writing system in line to express what you don't want to spell > out. From unicode at unicode.org Thu Feb 15 20:46:16 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 18:46:16 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Phake Nick wrote, > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? Not for me, I can still type faster than I can talk. Besides, voice recordings are all about communicating by stringing words together. >> These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. At least, that's what they tell their accountants and tax people, right? > Emoji is part of the literacy. Remember that Japanese writing system use > ideographic characters plus kana, it won't be odd to add yet another set of > pictographic writing system in line to express what you don't want to spell > out. Yes, it's a done deal. For better or for worse. From unicode at unicode.org Thu Feb 15 21:26:00 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 19:26:00 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: If someone were to be smiling and shrugging while giving you the finger, would you be smiling too? Heck, I'd probably be laughing out loud while running for my life! So, poor example. OK. A smiling creep is still a creep. Suppose for a moment that you and I are pals in the same room having a face-to-face conversation. I advise you that, due to unforeseen events, I'm a bit financially strapped and could use a spot of cash to sort of tide me over until my ship comes into orbit. You smile and nod your head while saying "no". Which response applies? Words suffice. We go by what people actually say rather than whatever they might have meant. When we read text, we go by what's written. An inability to communicate any essential feelings and overtones using words is not a gross failure of either language or writing. It's more about the skill levels of the speaker, listener, author, and reader. As for the thread title question, perhaps the exchanges within the thread offer insight. Emoji exist and are interchanged. Unicode enables them to be interchanged in a standard fashion. Even if they're just for fun, frivolous, silly, and ephemeral. Even if some people consider them beyond the scope of The Unicode Standard. The best time to argue against the addition of emoji to Unicode would be 2007 or 2008, but you'd be wasting your time travel. Trust me. From unicode at unicode.org Thu Feb 15 22:58:31 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Fri, 16 Feb 2018 05:58:31 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On Fri, Feb 16, 2018 at 4:26 AM, James Kass via Unicode wrote: > The best time to argue against the addition of emoji to Unicode would be > 2007 or 2008, but you'd be wasting your time travel. Trust me. But it's always a good time to argue against the addition of more nonsense to what we already have got. From unicode at unicode.org Thu Feb 15 23:24:58 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 15 Feb 2018 22:24:58 -0700 Subject: +1 (was: Re: Why so much emoji nonsense?) Message-ID: <4D59975202364CD9959680B1D523958F@DougEwell> Philippe Verdy wrote: > If people don't know how to read and cannot reuse the content and > transmit it, they become just consumers and in fact less and less > productors or creators of contents. Just look at opinions under > videos, most of them are just "thumbs up", "like", "+1", barely > counted only, unqualifiable (there's not even a thumb down). +1 is actually a convenient shorthand when all that needs to be said is "I agree" or "me too" (especially now that the latter has taken on a highly charged meaning in the U.S.). It is especially popular in the IETF. It is not intended for situations that require explanation or details. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Feb 15 23:33:58 2018 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Thu, 15 Feb 2018 23:33:58 -0600 Subject: =?utf-8?Q?End_of_discussion,_please_=E2=80=94_Re:_Why_so_much_em?= =?utf-8?Q?oji_nonsense=3F?= In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <40CA88F8-8C33-4F3A-9ECA-38B66A5B4680@umich.edu> > On Feb 15, 2018, at 10:58 PM, Pierpaolo Bernardi via Unicode wrote: > > On Fri, Feb 16, 2018 at 4:26 AM, James Kass via Unicode > wrote: > >> The best time to argue against the addition of emoji to Unicode would be >> 2007 or 2008, but you'd be wasting your time travel. Trust me. > > But it's always a good time to argue against the addition of more > nonsense to what we already have got. I think it?s a good time to end this conversation. Whether ?nonsense? or not, emoji are here and they?re in Unicode. This conversation has itself become nonsense, d?y?all agree? The amount of time that people have spent on this discussion could?ve been directed towards work on any one of the unencoded scripts listed at: http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html As many have noted during this discussion, the emoji ?ship has already sailed?. I?d?ve jumped aboard sooner, but this metaphor is now also quite tired. ?? All my best, Anshu -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 00:04:12 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 16 Feb 2018 14:04:12 +0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-16 10:46, "James Kass" wrote Phake Nick wrote, > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? Not for me, I can still type faster than I can talk. Besides, voice recordings are all about communicating by stringing words together. There are thousands of situations where one would want to express something in text form instead of voice form other than to be fast. Voice communication isn't just about communicating "string of words" together. Emotion and any other rhibgs are also transferred. That's also why carriers are supporting HQ Voice transmission over telephony system for better clarity in this aspect. >> These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. At least, that's what they tell their accountants and tax people, right? Then why do those people who pay for their own trip still do so? > [?] 2018-02-16 11:27, "James Kass via Unicode" wrote: If someone were to be smiling and shrugging while giving you the finger, would you be smiling too? Heck, I'd probably be laughing out loud while running for my life! So, poor example. OK. A smiling creep is still a creep. This is an example of extravocal communication. If the person was sayong thankyou with smiling face while giving you a middle finger, it would be totally different context from a regular thank you goven by other people. Suppose for a moment that you and I are pals in the same room having a face-to-face conversation. I advise you that, due to unforeseen events, I'm a bit financially strapped and could use a spot of cash to sort of tide me over until my ship comes into orbit. You smile and nod your head while saying "no". Which response applies? Words suffice. We go by what people actually say rather than whatever they might have meant. When we read text, we go by what's written. Then, what would be the feeling of the listener if he onky hear you say no but didn't know about your facial and body reaction? They might not be able to grasp the pevep of no you are giving out, and you would want to use some rather lengthy description to explain to the person why you want to reject him. Why do that when a simple non-verbal expression is enough? An inability to communicate any essential feelings and overtones using words is not a gross failure of either language or writing. It's more about the skill levels of the speaker, listener, author, and reader. https://en.wikipedia.org/wiki/Nonverbal_communication As for the thread title question, perhaps the exchanges within the thread offer insight. Emoji exist and are interchanged. Unicode enables them to be interchanged in a standard fashion. Even if they're just for fun, frivolous, silly, and ephemeral. Even if some people consider them beyond the scope of The Unicode Standard. The best time to argue against the addition of emoji to Unicode would be 2007 or 2008, but you'd be wasting your time travel. Trust me. I would like to add that, if Unicode didn't include emoji at the time, then I suspect many more systems will continue to use Shift-JIS instead. Individual mobile phone carriers will continue to use each of their own provate codepoints and app/platform developers either have to find a way to convert between code point between different emoji being used (remember implementation by each carriers don't strictly correspond to each other), or invent yet another private use font to correspond to each of all those emoji within their platform. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 01:24:38 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 23:24:38 -0800 Subject: =?UTF-8?Q?Re=3A_End_of_discussion=2C_please_=E2=80=94_Re=3A_Why_so_much_em?= =?UTF-8?Q?oji_nonsense=3F?= In-Reply-To: <40CA88F8-8C33-4F3A-9ECA-38B66A5B4680@umich.edu> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <40CA88F8-8C33-4F3A-9ECA-38B66A5B4680@umich.edu> Message-ID: Anshuman Pandey wrote: > I think it?s a good time to end this conversation. Whether ?nonsense? or not, > emoji are here and they?re in Unicode. This conversation has itself become > nonsense, d?y?all agree? No. Other than the part about emoji being here and in Unicode. > The amount of time that people have spent on this discussion could?ve been > directed towards work on any one of the unencoded scripts listed at: > > http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html https://en.wikipedia.org/wiki/All_work_and_no_play_makes_Jack_a_dull_boy From unicode at unicode.org Fri Feb 16 01:47:11 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 16 Feb 2018 09:47:11 +0200 Subject: Invisible characters must be specified to be visible in security-sensitive situations In-Reply-To: (message from Oren Watson via Unicode on Thu, 15 Feb 2018 17:33:12 -0500) References: Message-ID: <83tvuhe5q8.fsf@gnu.org> > Date: Thu, 15 Feb 2018 17:33:12 -0500 > From: Oren Watson via Unicode > > https://securelist.com/zero-day-vulnerability-in-telegram/83800/ > > You could disallow these characters in filenames, but when filename handling is charset-agnostic due to the > extended-ascii principle this is impractical. I think a better solution is to specify a visible form of these > characters to be used (e.g. through otf font variants) when security is of importance. Emacs has a special function that searches a given region of a buffer of text or of a text string for characters whose Bidi_Class property has been overridden by RLO or LRO. Emacs application programs can use this function to detect and flag such regions of text, and prevent such malicious attacks. From unicode at unicode.org Fri Feb 16 01:54:04 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 23:54:04 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Pierpaolo Bernardi wrote: > But it's always a good time to argue against the addition of more > nonsense to what we already have got. It's an open-ended set and precedent for encoding them exists. Generally, input regarding the addition of characters to a repertoire is solicited from the user community, of which I am not a member. My personal feeling is that all of the time, effort, and money spent by the various corporations in promoting the emoji into Unicode would have been better directed towards something more worthwhile, such as the unencoded scripts listed at: http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html ... but nobody asked me. From unicode at unicode.org Fri Feb 16 02:06:17 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 16 Feb 2018 00:06:17 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: > Words suffice.? We go by what people actually say rather than whatever > they might have meant.? When we read text, we go by what's written. That is a worthy opinion, but not one that is shared, either in principle or in lived practice (esp. related to digital communication) by vast numbers of people. One of the strengths of Unicode has always been its willingness to deal with actual use of writing and notational systems - sometimes after a bit of a delay. In other words, Unicode is rarely prescriptive, unless positive interchange isn't possible otherwise. And that reactiveness is a good thing, as much as the result can look a bit "messy" at times and time and again refuses to fit a nice&clean single conceptual framework. A./ From unicode at unicode.org Fri Feb 16 02:25:03 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 16 Feb 2018 00:25:03 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 02:40:50 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 16 Feb 2018 00:40:50 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 02:56:00 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 16 Feb 2018 00:56:00 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Asmus Freytag wrote: >> Words suffice. We go by what people actually say rather than whatever >> they might have meant. When we read text, we go by what's written. > > That is a worthy opinion, but not one that is shared, either in principle > or in lived practice (esp. related to digital communication) by vast numbers > of people. True, but there are also plenty of people who strive to say what they mean and mean what they say. From unicode at unicode.org Fri Feb 16 04:04:54 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 16 Feb 2018 11:04:54 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: A few points 1. To add to what Asmus said, see also http://unicode.org/L2/L2018/18044-encoding-emoji.pdf "Their encoding, surprisingly, has been a boon for language support. The emoji draw on Unicode mechanisms that are used by various languages, but which had been incompletely implemented on many platforms. Because of the demand for emoji, many implementations have upgraded their Unicode support substantially. That means that implementations now have far better support for the languages that use the more complicated Unicode mechanisms." An example of that is MySQL, where the rise of emoji led to non-BMP support. 2. Aside from SEI (at UCB), we've also been able to fund a number of projects such as http://blog.unicode.org/2016/12/adopt-character-grant-to-support-indic.html 4. Finally, I'd like to point out that this external mailing list is open to anyone (subject to civil behavior), with the main goal being to provide a forum for people to ask questions about how to deploy, use, and contribute to Unicode, and get answers from a community of users. Those who want to engage in extended kvetching can take that to the rightful place: *Twitter*. Mark Mark On Fri, Feb 16, 2018 at 9:25 AM, Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 2/15/2018 11:54 PM, James Kass via Unicode wrote: > > Pierpaolo Bernardi wrote: > > > But it's always a good time to argue against the addition of more > nonsense to what we already have got. > > It's an open-ended set and precedent for encoding them exists. > Generally, input regarding the addition of characters to a repertoire > is solicited from the user community, of which I am not a member. > > My personal feeling is that all of the time, effort, and money spent > by the various corporations in promoting the emoji into Unicode would > have been better directed towards something more worthwhile, such as > the unencoded scripts listed at: > > http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html > > ... but nobody asked me. > > > Curiously enough it is the emoji that keep a large number of users (and > companies > serving them) engaged with Unicode who would otherwise be likely to come > to the conclusion that Unicode is "done" as far as their needs are > concerned. > > Few, if any, of the not-yet-encoded scripts are used by large living > populations, > therefore they are not urgently missing / needed in daily life and are of > interest > primarily to specialists. > > Emoji are definitely up-ending that dynamic, which I would argue is a good > thing. > > A financially well endowed Consortium with strong membership is a > prerequisite > to fulfilling the larger cultural mission of Unicode. Sure, for the > populations > whose scripts are already encoded, there are separate issues that will keep > some interest alive, like solving problems related to algorithms and > locales, but > also dealing with extensions of existing scripts and notational systems - > although > few enough of those are truly urgent/widely used. > > The University of Berkeley people would be the first to tell you how their > funding > puncture is positively influenced by the current perceived relevancy of > the Unicode > Consortium - much of it being due to those emoji. > > A./ > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 04:42:51 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 16 Feb 2018 11:42:51 +0100 Subject: Origin of Alphasyllabaries (was: Why so much emoji nonsense?) In-Reply-To: <20180216005911.7b2b9012@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <20180216005911.7b2b9012@JRWUBU2> Message-ID: 2018-02-16 1:59 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Wed, 14 Feb 2018 21:49:57 +0100 > Philippe Verdy via Unicode wrote: > > > The concept of vowels as distinctive letters came later, even the > > letter A was initially a representation of a glottal stop consonnant, > > sometimes mute, only written to indicate a word that did not start by > > a consonnant in their first syllable, letter. This has survived today > > in abjads and abugidas where vowels became optional diacritics, but > > that evolved as plain diacritics in Indic abugidas. > > OK. > > > The situation is even more complex because clusters of consonnants > > were also represented in early vowel-less alphabets to represent full > > syllables (this has formed the base of todays syllabaries when only > > some glyph variants of the base consonnant was introduced to > > distinguish their vocalization; > > The only syllabary where what you say might be true is the Ethiopic > syllabary, and I have grave doubts as to that case. > > I hope you are aware that most syllabaries do not derive from > alphabets, abjads or abugidas. > I said the opposite: the alphabets, abjads, abugidas and today's full syllabaries derive from early simplified syllabaries, themselves derived from simplified pictograms (ideograms becoming phonograms). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 04:57:57 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 16 Feb 2018 10:57:57 +0000 Subject: Why so much emoji nonsense? Message-ID: 2018-02-16 FRI 15:55, James Kass via Unicode wrote: > Pierpaolo Bernardi wrote: > > > But it's always a good time to argue against the addition of more > > nonsense to what we already have got. > > It's an open-ended set and precedent for encoding them exists. > Generally, input regarding the addition of characters to a repertoire > is solicited from the user community, of which I am not a member. > > My personal feeling is that all of the time, effort, and money spent > by the various corporations in promoting the emoji into Unicode would > have been better directed towards something more worthwhile, such as > the unencoded scripts listed at: > > http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html > > ... but nobody asked me. > 1. In UTS #51, it have been mentioned that embedded graphic is the way to go as a longer term solution to emoji, in addition to emoji characters. But then that would requires substantial infrastructure changes, and even then in pure text environment they would most probably not be supported. 2. Actually, the problem is not just limited to emoji. Many Ideographic characters (Chinese, Japanese, etc) are adding to the unicode each years, while at the current rate there are still many rooms in Unicode standard to contain them, it's still more open-ended than would be desired for a multilingual encoding system, and the it also make it hard to expect newly encoded ideographic characters to just "work" on different system with sufficient font support. The situation that a character have to be encoded into Unicode before they can be exchanged digitally have also limited activities by users in term of creating new characters in ad hoc manner, which is something that would probably happen in pre-digital era more often. Different parties have proposed some solutions to dynamically construct and use these characters as desired instead of relying on an encoding mechanism but then they all seems to be so radically different from modern computer infrastructure that they are not being adopted. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 09:37:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 15:37:11 +0000 Subject: Origin of Alphasyllabaries (was: Why so much emoji nonsense?) In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <20180216005911.7b2b9012@JRWUBU2> Message-ID: <20180216153711.597f90cc@JRWUBU2> On Fri, 16 Feb 2018 11:42:51 +0100 Philippe Verdy via Unicode wrote: > I said the opposite: the alphabets, abjads, abugidas and today's full > syllabaries derive from early simplified syllabaries,... In the Old World, alphabets and abugidas derive from abjads, which do not derive from syllabaries. I'm counting Ancient Egyptian as an abjad, as that is the category that fits the purely phonetic writings best. > ...themselves > derived from simplified pictograms (ideograms becoming phonograms). This bit is true. From unicode at unicode.org Fri Feb 16 10:00:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 16:00:40 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: Message-ID: <20180216160040.1e630740@JRWUBU2> On Fri, 16 Feb 2018 10:57:57 +0000 Phake Nick via Unicode wrote: > 2. Actually, the problem is not just limited to emoji. Many > Ideographic characters (Chinese, Japanese, etc) are adding to the > unicode each years, while at the current rate there are still many > rooms in Unicode standard to contain them, it's still more open-ended > than would be desired for a multilingual encoding system, and the it > also make it hard to expect newly encoded ideographic characters to > just "work" on different system with sufficient font support. Isn't Unicode designed to stifle innovation? -:) Actually, there are two mechanisms that could be made to support innovations. For characters with limited dissemination, one can revert to a font-based mechanism that defines properties for graphical PUA characters. The problem is that that won't work at all well in plain text like this email. I thought a specialised version of the scheme was already working for Japanese names - PUAs started as a temporary extension measure for CJK encodings. A more portable solution for ideographs is to render an Ideographic Description Sequences (IDS) as approximations to the characters they describe. The Unicode Standard carefully does not prohibit so doing, and a similar scheme is being developed for blocks of Egyptian Hieroglyphs, and has been proposed for Mayan as well. There may be merit in making the rendering of an IDs ugly, so as to encourage its replacement by the encoding of the character. I gather that making the use of IDSes consistent with searching is considered daunting. Richard. From unicode at unicode.org Fri Feb 16 10:22:23 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 16 Feb 2018 08:22:23 -0800 Subject: IDC's versus Egyptian format controls (was: Re: Why so much emoji nonsense?) In-Reply-To: <20180216160040.1e630740@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> Message-ID: On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: > A more portable solution for ideographs is to render an Ideographic > Description Sequences (IDS) as approximations to the characters they > describe. The Unicode Standard carefully does not prohibit so doing, > and a similar scheme is being developed for blocks of Egyptian > Hieroglyphs, and has been proposed for Mayan as well. A point of clarification: The IDC's (ideographic description characters) are explicitly *not* format controls. They are visible graphic symbols that sit visibly in text. There is a specified syntax for stringing them together into sequences with ideographic characters and radicals to *suggest* a specific form of CJK (or other ideographic) character assembled from the pieces in a certain order -- but there is no implication that a generic text layout process *should* attempt to assemble that described character as a single glyph. IDC's are a *description* methodology. IDC's are General_Category=So. The Egyptian quadrat controls, on the other hand, are full-fledged Unicode format controls. They do not just describe hieroglyphic quadrats -- they are intended to be implemented in text format software and OpenType fonts to actually construct and display fully-formed quadrats on the fly. They will be General_Category=Cf. Mayan will work in a similar manner, although the specification of the sign list and exact required set of format controls is not yet as mature as that for Egyptian. --Ken From unicode at unicode.org Fri Feb 16 10:41:47 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 16 Feb 2018 08:41:47 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> Message-ID: On 2/16/2018 8:22 AM, Ken Whistler wrote: > The Egyptian quadrat controls, on the other hand, are full-fledged > Unicode format controls. One more point of distinction: The (gc=So) IDC's follow a syntax that uses Polish notation order for the descriptive operators (inherited from the intended use in GB 18030, where these came from in the first place). That order minimizes ambiguity of representation without requiring bracketing, but it has the disadvantage of being hard for humans to interpret easily in complicated cases. The Egyptian format controls use an infix notation, instead. That follows current Egyptologists' practice of representing quadrats with MdC conventions. It is also a better order for the layout engine processing. The disadvantage is that it requires a bracketing notation to deal with ambiguities of operator precedence in complicated cases. --Ken From unicode at unicode.org Fri Feb 16 12:20:00 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 18:20:00 +0000 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> Message-ID: <20180216182000.5c2a4431@JRWUBU2> On Fri, 16 Feb 2018 08:22:23 -0800 Ken Whistler via Unicode wrote: > On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: > > > A more portable solution for ideographs is to render an Ideographic > > Description Sequences (IDS) as approximations to the characters they > > describe. The Unicode Standard carefully does not prohibit so > > doing, and a similar scheme is being developed for blocks of > > Egyptian Hieroglyphs, and has been proposed for Mayan as well. > > A point of clarification: The IDC's (ideographic description > characters) are explicitly *not* format controls. They are visible > graphic symbols that sit visibly in text. That doesn't square well with, "An implementation may render a valid Ideographic Description Sequence either by rendering the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described." (TUS 10.0 p704, in Section 18.2) The reason for comparison with Egyptian quadrat controls is the scaling issue. The thickness of brush strokes should be consistent across the ideograph, which increases the complexity of a font that parses the descriptions. Outline hieroglyphic quadrats have the same problem. However, as I said before, there is a good argument for rendering an IDS inelegantly. Richard. From unicode at unicode.org Fri Feb 16 12:44:29 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Fri, 16 Feb 2018 10:44:29 -0800 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: FWIW I dissected the crashing strings, it's basically all sequences in Telugu, Bengali, Devanagari where the consonant is suffix-joining (ra in Devanagari, jo and ro in Bengali, and all Telugu consonants), the vowel is not Bengali au or o / Telugu ai, and if the second consonant is ra/ro the first one is not also ra/ro (or ro-with-line-through-it). https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ -Manish On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < unicode at unicode.org> wrote: > That's probably not a bug of Unicode but of MacOS/iOS text renderers with > some fonts using advanced composition feature. > > Similar bugs could as well the new advanced features added in Windows or > Android to support multicolored emojis, variable fonts, contextual glyph > transforms, style variants, or more font formats (not just OpenType); the > bug may also be in the graphic renderer (incorrect clipping when drawing > the glyph into the glyph cache, with buffer overflows possibly caused by > incorrectly computed splines), and it could be in the display driver (or in > the hardware accelerator having some limitations on the compelxity of > multipolygons to fill and to antialias), causing some infinite recursion > loop, or too deep recursion exhausting the stack limit; > > Finally the bug could be in the OpenType hinting engine moving some points > outside the clipping area (the math theory may say that such plcement of a > point outside the clipping area may be impossible, but various mathematical > simplifcations and shortcuts are used to simplify or accelerate the > rendering, at the price of some quirks. Even the SVG standard (in constant > evolution) could be affected as well in its implementation. > > There are tons of possible bugs here. > > 2018-02-15 18:21 GMT+01:00 James Kass via Unicode : > >> This article: >> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac- >> crash-apple/?ncid=mobilenavtrend >> >> The single Unicode symbol referred to in the article results from a >> string of Telugu characters. The article doesn't list or display the >> characters, so Mac users can visit the above link. A link in one of >> the comments leads to a page which does display the characters. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 13:00:45 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 16 Feb 2018 11:00:45 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <20180216182000.5c2a4431@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> Message-ID: <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> On 2/16/2018 10:20 AM, Richard Wordingham via Unicode wrote: > On Fri, 16 Feb 2018 08:22:23 -0800 > Ken Whistler via Unicode wrote: > >> On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: >> >>> A more portable solution for ideographs is to render an Ideographic >>> Description Sequences (IDS) as approximations to the characters they >>> describe. The Unicode Standard carefully does not prohibit so >>> doing, and a similar scheme is being developed for blocks of >>> Egyptian Hieroglyphs, and has been proposed for Mayan as well. >> A point of clarification: The IDC's (ideographic description >> characters) are explicitly *not* format controls. They are visible >> graphic symbols that sit visibly in text. > That doesn't square well with, "An implementation may render a valid > Ideographic Description Sequence either by rendering the individual > characters separately or by parsing the Ideographic Description > Sequence and drawing the ideograph so described." (TUS 10.0 p704, in > Section 18.2) Should we ask t make the default behavior (visible IDS characters) more explicit? I don't mind allowing the other as an option (it's kind of the reverse of the "show invisible" mode, which we also allow, but for which we do have a clear default). > > The reason for comparison with Egyptian quadrat controls is the scaling > issue. The thickness of brush strokes should be consistent across the > ideograph, which increases the complexity of a font that parses the > descriptions. Outline hieroglyphic quadrats have the same problem. > However, as I said before, there is a good argument for rendering an > IDS inelegantly. > > Richard. > From unicode at unicode.org Fri Feb 16 13:10:29 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 16 Feb 2018 11:10:29 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> Message-ID: On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote: On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: >> That doesn't square well with, "An implementation *may* render a valid >> Ideographic Description Sequence either by rendering the individual >> characters separately or by parsing the Ideographic Description >> Sequence and drawing the ideograph so described." (TUS 10.0 p704, in >> Section 18.2) Emphasis on the "may". In point of fact, no widespread layout engine or set of fonts does parse IDS'es to turn them into single ideographs for display. That would be a highly specialized display. > > Should we ask t make the default behavior (visible IDS characters) > more explicit? Ask away. --Ken > > I don't mind allowing the other as an option (it's kind of the reverse > of the "show invisible" > mode, which we also allow, but for which we do have a clear default). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 13:28:13 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Fri, 16 Feb 2018 11:28:13 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> Message-ID: <86554342-00d5-8fd8-ad80-3cb1cca6f50b@ix.netcom.com> On 2/16/2018 11:10 AM, Ken Whistler wrote: It's the "may either" which is not the same as "may also". A./ > > > On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote: > > On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: >>> That doesn't square well with, "An implementation *may* render a valid >>> Ideographic Description Sequence either by rendering the individual >>> characters separately or by parsing the Ideographic Description >>> Sequence and drawing the ideograph so described." (TUS 10.0 p704, in >>> Section 18.2) > > Emphasis on the "may". In point of fact, no widespread layout engine > or set of fonts does parse IDS'es to turn them into single ideographs > for display. That would be a highly specialized display. > >> >> Should we ask t make the default behavior (visible IDS characters) >> more explicit? > > Ask away. > > --Ken > >> >> I don't mind allowing the other as an option (it's kind of the >> reverse of the "show invisible" >> mode, which we also allow, but for which we do have a clear default). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 16:27:24 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 22:27:24 +0000 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> Message-ID: <20180216222724.00b2cbb4@JRWUBU2> On Fri, 16 Feb 2018 11:10:29 -0800 Ken Whistler via Unicode wrote: > On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote: > > On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: > >> That doesn't square well with, "An implementation *may* render a > >> valid Ideographic Description Sequence either by rendering the > >> individual characters separately or by parsing the Ideographic > >> Description Sequence and drawing the ideograph so described." (TUS > >> 10.0 p704, in Section 18.2) > > Emphasis on the "may". In point of fact, no widespread layout engine > or set of fonts does parse IDS'es to turn them into single ideographs > for display. That would be a highly specialized display. And doing it reasonably well could be a lot of work. However, I don't see any good reason to discourage fonts from doing it by default, which is what is now being proposed. > > Should we ask t make the default behavior (visible IDS characters) > > more explicit? > > Ask away. > > I don't mind allowing the other as an option (it's kind of the > > reverse of the "show invisible" > > mode, which we also allow, but for which we do have a clear > > default). If that analogy is to be enforced, that strikes me as a major change to the allowed meaning of the IDCs. A default form should be the natural form for reading, and it has already been stated that visible IDCs are not intuitive. And I thought I was joking when I suggested that Unicode was deliberately designed to stifle innovation. Now, one could suggest that IDCs should be retained as sutures in parsed IDSes. However, even that is a change in the character identity. Having visible IDCs is rather like making every Devanagari virama visble. It's an admission that the font cannot cope. For IDSes, it is not unreasonable for a font to lack the ability to parse them. Richard. From unicode at unicode.org Fri Feb 16 17:25:22 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 16 Feb 2018 15:25:22 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <20180216222724.00b2cbb4@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> Message-ID: Richard Wordingham wrote, > And doing it reasonably well could be a lot of work. > However, I don't see any good reason to discourage > fonts from doing it by default, which is what is now > being proposed. Some people studying Han characters use the IDCs to illustrate the ideographs and their components for various purposes. For example: U-0002A8B8 ?? ??? U-0002A8B9 ?? ??? U-0002A8BA ?? ??? U-0002A8BB ?? ??? U-0002A8BC ?? ??? U-0002A8BD ?? ??? U-0002A8BE ?? ??? U-0002A8BF ?? ??? U-0002A8C0 ?? ??? U-0002A8C1 ?? ??? It would be probably be disconcerting if the display of those sequences changed into their respective characters overnight. Such usage might be limited to scholars and students, and a desire for default composition might outweigh scholarly concerns, but IMHO to say that 'doing it reasonably well at the font level would be a lot of work' is a vast understatement. From unicode at unicode.org Fri Feb 16 18:48:10 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 17 Feb 2018 00:48:10 +0000 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> Message-ID: <20180217004810.6238bf5c@JRWUBU2> On Fri, 16 Feb 2018 15:25:22 -0800 James Kass via Unicode wrote: > Some people studying Han characters use the IDCs to illustrate the > ideographs and their components for various purposes. For example: > > U-0002A8B8 ?? ??? > U-0002A8B9 ?? ??? > U-0002A8BA ?? ??? > U-0002A8BB ?? ??? > U-0002A8BC ?? ??? > U-0002A8BD ?? ??? > U-0002A8BE ?? ??? > U-0002A8BF ?? ??? > U-0002A8C0 ?? ??? > U-0002A8C1 ?? ??? > > It would be probably be disconcerting if the display of those > sequences changed into their respective characters overnight. And it would be extremely disconcerting if this post was suddenly rendered in mediaeval black letters, but in theory that could happen. One can argue that once the compound ideograph have been encoded, the IDS should no longer be interpreted. However, I think it will be difficult to do this in practice. > Such > usage might be limited to scholars and students, and a desire for > default composition might outweigh scholarly concerns, The lack of mix and match control of the font choices for 'plain text' presentations is disappointing. We probably need a pair of OpenType features, one to discourage and one to encourage interpretation of IDSes. For web pages and PDFs one should be able to specify the font or fonts, and OpenType features are increasingly being supported. > but IMHO to say > that 'doing it reasonably well at the font level would be a lot of > work' is a vast understatement. That was my first thought, but I had worried that I might have been overestimating. For the examples you give above, I strongly suspect that Code2001 already contains the requisite glyph halves. There is another possible use of the latitude given by TUS 5.0 to 10.0 and possibly earlier. I can certainly imagine a case where someone writes a font so that an unencoded character may be manipulated like any other character. He has two choices - he can put it in the PUA, or he can make it the ligature for the IDS. If he chooses the former, and then the text and font are separated, the recipient of the text is left with tofu for the character. If he chooses the latter, the recipient of the text would at least have the IDS. I think the latter outcome is the better outcome. Richard. From unicode at unicode.org Fri Feb 16 19:34:22 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 16 Feb 2018 17:34:22 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <20180217004810.6238bf5c@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> Message-ID: Richard Wordingham wrote: > There is another possible use of the latitude given by TUS 5.0 to 10.0 > and possibly earlier. I can certainly imagine a case where someone > writes a font so that an unencoded character may be manipulated like any > other character. He has two choices - he can put it in the PUA, or he > can make it the ligature for the IDS. If he chooses the former, and > then the text and font are separated, the recipient of the text is left > with tofu for the character. If he chooses the latter, the recipient of > the text would at least have the IDS. I think the latter outcome is > the better outcome. Yes, I think it's much better to leave the unencoded ideograph unmapped (not assigned within the font to a Unicode code point) and treated as a font ligature. If the unencoded ideograph is encoded, then the ligature glyph would be mapped to the actual character, of course. When estimating the complexity of the look-up tables involved, please keep in mind that, as the complexity of the ideograph increases, so do the number of different ways of breaking down that ideograph. And all of those ways would need to be accomodated in the look-up tables. For example, U+2A7FF "??", according to my notes, can be described in two pieces (???). The right half "?" can be further broken down into three components (?????). The left half could also be broken down further. From unicode at unicode.org Fri Feb 16 20:05:41 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 16 Feb 2018 18:05:41 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <20180217004810.6238bf5c@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> Message-ID: Richard Wordingham wrote: > One can argue that once the compound ideograph have been encoded, the > IDS should no longer be interpreted. Wouldn't that break existing data? If this sort of thing were done at OS or app level, it might be possible to replace the IDS string with the appropriate character upon file save in some kind of automatic fashion. But I'd sure hate for that to happen to any of my text files without warning. From unicode at unicode.org Fri Feb 16 20:08:54 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 16 Feb 2018 18:08:54 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> Message-ID: > Wouldn't that break existing data? Functionality, not data. From unicode at unicode.org Sat Feb 17 03:43:58 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 17 Feb 2018 09:43:58 +0000 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> Message-ID: <20180217094358.05292de8@JRWUBU2> On Fri, 16 Feb 2018 18:05:41 -0800 James Kass via Unicode wrote: > Richard Wordingham wrote: > > > One can argue that once the compound ideograph have been encoded, > > the IDS should no longer be interpreted. > > Wouldn't that break existing data? If this sort of thing were done at > OS or app level, it might be possible to replace the IDS string with > the appropriate character upon file save in some kind of automatic > fashion. But I'd sure hate for that to happen to any of my text files > without warning. TUS allows one to use an IDS in place of an unencoded character, but not in place of an encoded character. Once the character is encoded, the IDS substitutions should be weeded out. Of course, there is the problem that upgrades to a new version of Unicode can be a mosaic process, with data tables, fonts and rendering engines out of alignment. At least it's a graceful break, unlike the probability of PUA mappings simply vanishing or, worse, changing. Ideally, searching as just searching would use a collation to equate character and IDS. There may be a problem in that two distinct characters could have the same IDS. Search and automatic replacement is more of a problem. I strongly suspect that the rule not to use an IDS in place of an encoded character would only be applied to an input method. There is the very common interpretation that 'should' in the principal clause of a requirement cancels the requirement; formally the justification is that it would be too much work. Enforcing the rule for an unsupported encoded character would be a hostile act. Richard. From unicode at unicode.org Sat Feb 17 06:39:04 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Sat, 17 Feb 2018 13:39:04 +0100 (CET) Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <1079226995.76169.1518871144119@ox.hosteurope.de> James Kass: > Asmus Freytag wrote: > >>> Words suffice. We go by what people actually say rather than whatever >>> they might have meant. When we read text, we go by what's written. > >> That is a worthy opinion, but not one that is shared, either in principle >> or in lived practice (...) by vast numbers of people. > > True, but there are also plenty of people who strive to say what they > mean and mean what they say. It's astonishing how you apparently ignore how human communication actually works. We are not machines where the Shannon-Weaver model of a message encoded by the sender and accurately decoded by the receiver applies (with some correction for errors induced by noise in the transmission channel). Communication, even written, is a very complex process that involves a lot of unspoken assumptions and external knowledge on all sides. Words do not suffice. We do not go simply by what's written. Stuff like typography or emoji can improve the effectiveness and efficiency of textual communication a lot. (And if used badly or maliciously they can deter it as well.) From unicode at unicode.org Sat Feb 17 12:36:08 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 17 Feb 2018 19:36:08 +0100 (CET) Subject: Why so much emoji nonsense? In-Reply-To: <1079226995.76169.1518871144119@ox.hosteurope.de> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <1079226995.76169.1518871144119@ox.hosteurope.de> Message-ID: <809239934.8782.1518892568487.JavaMail.www@wwinf1h28> On 17/02/18 13:43, Christoph P?per via Unicode wrote: [?] > Stuff like typography or emoji can improve the effectiveness and efficiency > of textual communication a lot. (And if used badly or maliciously they can > deter it as well.) > Since poor typography can deteriorate our communication as well, what people need is also a keyboard layout that can be left on all the time while giving access to what we need, in a straightforward manner. Here we?ll get the letter apostrophe, curly quotes, along with acute accent, tilde, diaeresis/umlaut in the Base shift state: http://charupdate.info/doc/kbenintu/ As already mailed to CLDR-Users, feedback is always welcome. http://unicode.org/pipermail/cldr-users/2018-February/000737.html Regards, Marcel From unicode at unicode.org Sat Feb 17 13:49:34 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 17 Feb 2018 12:49:34 -0700 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Manish Goregaokar wrote: > FWIW I dissected the crashing strings, it's basically all virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, > Devanagari where the consonant is suffix-joining (ra in Devanagari, > jo and ro in Bengali, and all Telugu consonants), the vowel is not > Bengali au or o / Telugu ai, and if the second consonant is ra/ro the > first one is not also ra/ro (or ro-with-line-through-it). > > https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ Thanks for this very detailed and informative blog post. It's certainly better than "probably not a bug of Unicode," implying an outside chance that it might be. I've linked Manish's post on FB as a reply to one of those mainstream articles that repeatedly calls the conjunct a "single character," written by a staffer who couldn't be bothered to find out how a writing system used by 78 million people works. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Feb 17 14:22:55 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 17 Feb 2018 21:22:55 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: I would have liked that your invented term of "left-joining consonants" took the usual name "phala forms" (to represent RA or JA/JO after a virama, generally named "raphala" or "japhala/jophala"). And why this bug does not occur with some vowels is because these are vowels in two parts, that are first decomposed into two separate glyphs reordered in the buffer of glyphs, while other vowels do not need this prior mapping and keep their initial direct mapping from their codepoints in fonts, which means that this has to do to the way the ZWNJ looks for the glyphs of the vowels in the glyphs buffer and not in the initial codepoints buffer: there's some desynchronization, and more probably an uninitialized data field (for the lookup made in handling ZWNJ) if no vowel decomposition was done (the same data field is correctly initialized when it is the first consonnant which takes an alternate form before a virama, like in most Indic consonnant clusters, because the a glyph buffer is created. Now we have some hints about why the bug does not occur in Kannada or Khmer: a glyph buffer is always created, but there was some shortcut made in Devanagari, Bengali, and Telugu to allow processing clusters faster without having to create always a gyphs buffer (to allow reordering glyphs before positioning them), and working directly on the codepoints streams. So it seems related to the fact that OpenType fonts do not need to include rules for glyph substitution, but the PHALA forms are represented without any glyph substitution, by mapping directly the phala forms in a separate table for the consonants. Because there's been no code to glyph subtitution, the glyph buffer is not created, but then when processing the ZWNJ, it looks for data in a glyph buffer that has still not be initialized (and this is specific to the renderers implemented by Apple in iOS and MacOS). This bug does not occur if another text rendering engine is used (e.g. in non-Apple web browsers). 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : > FWIW I dissected the crashing strings, it's basically all virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari > where the consonant is suffix-joining (ra in Devanagari, jo and ro in > Bengali, and all Telugu consonants), the vowel is not Bengali au or o / > Telugu ai, and if the second consonant is ra/ro the first one is not also > ra/ro (or ro-with-line-through-it). > > https://manishearth.github.io/blog/2018/02/15/picking-apart- > the-crashing-ios-string/ > > -Manish > > On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> That's probably not a bug of Unicode but of MacOS/iOS text renderers with >> some fonts using advanced composition feature. >> >> Similar bugs could as well the new advanced features added in Windows or >> Android to support multicolored emojis, variable fonts, contextual glyph >> transforms, style variants, or more font formats (not just OpenType); the >> bug may also be in the graphic renderer (incorrect clipping when drawing >> the glyph into the glyph cache, with buffer overflows possibly caused by >> incorrectly computed splines), and it could be in the display driver (or in >> the hardware accelerator having some limitations on the compelxity of >> multipolygons to fill and to antialias), causing some infinite recursion >> loop, or too deep recursion exhausting the stack limit; >> >> Finally the bug could be in the OpenType hinting engine moving some >> points outside the clipping area (the math theory may say that such >> plcement of a point outside the clipping area may be impossible, but >> various mathematical simplifcations and shortcuts are used to simplify or >> accelerate the rendering, at the price of some quirks. Even the SVG >> standard (in constant evolution) could be affected as well in its >> implementation. >> >> There are tons of possible bugs here. >> >> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode : >> >>> This article: >>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>> rash-apple/?ncid=mobilenavtrend >>> >>> The single Unicode symbol referred to in the article results from a >>> string of Telugu characters. The article doesn't list or display the >>> characters, so Mac users can visit the above link. A link in one of >>> the comments leads to a page which does display the characters. >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 17 14:54:57 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Sat, 17 Feb 2018 12:54:57 -0800 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Heh, I wasn't aware of the word "phala-form", though that seems Bengali-specific? Interesting observation about the vowel glyphs, I'll mention this in the post. Initially I missed this because I hadn't realized that the bengali o vowel crashed (which made me discount this). Thanks! -Manish On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy wrote: > I would have liked that your invented term of "left-joining consonants" > took the usual name "phala forms" (to represent RA or JA/JO after a virama, > generally named "raphala" or "japhala/jophala"). > > And why this bug does not occur with some vowels is because these are > vowels in two parts, that are first decomposed into two separate glyphs > reordered in the buffer of glyphs, while other vowels do not need this > prior mapping and keep their initial direct mapping from their codepoints > in fonts, which means that this has to do to the way the ZWNJ looks for the > glyphs of the vowels in the glyphs buffer and not in the initial codepoints > buffer: there's some desynchronization, and more probably an uninitialized > data field (for the lookup made in handling ZWNJ) if no vowel decomposition > was done (the same data field is correctly initialized when it is the first > consonnant which takes an alternate form before a virama, like in most > Indic consonnant clusters, because the a glyph buffer is created. > > Now we have some hints about why the bug does not occur in Kannada or > Khmer: a glyph buffer is always created, but there was some shortcut made > in Devanagari, Bengali, and Telugu to allow processing clusters faster > without having to create always a gyphs buffer (to allow reordering glyphs > before positioning them), and working directly on the codepoints streams. > > So it seems related to the fact that OpenType fonts do not need to include > rules for glyph substitution, but the PHALA forms are represented without > any glyph substitution, by mapping directly the phala forms in a separate > table for the consonants. Because there's been no code to glyph > subtitution, the glyph buffer is not created, but then when processing the > ZWNJ, it looks for data in a glyph buffer that has still not be initialized > (and this is specific to the renderers implemented by Apple in iOS and > MacOS). This bug does not occur if another text rendering engine is used > (e.g. in non-Apple web browsers). > > > 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : > >> FWIW I dissected the crashing strings, it's basically all > virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari >> where the consonant is suffix-joining (ra in Devanagari, jo and ro in >> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / >> Telugu ai, and if the second consonant is ra/ro the first one is not also >> ra/ro (or ro-with-line-through-it). >> >> https://manishearth.github.io/blog/2018/02/15/picking-apart- >> the-crashing-ios-string/ >> >> -Manish >> >> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >> unicode at unicode.org> wrote: >> >>> That's probably not a bug of Unicode but of MacOS/iOS text renderers >>> with some fonts using advanced composition feature. >>> >>> Similar bugs could as well the new advanced features added in Windows or >>> Android to support multicolored emojis, variable fonts, contextual glyph >>> transforms, style variants, or more font formats (not just OpenType); the >>> bug may also be in the graphic renderer (incorrect clipping when drawing >>> the glyph into the glyph cache, with buffer overflows possibly caused by >>> incorrectly computed splines), and it could be in the display driver (or in >>> the hardware accelerator having some limitations on the compelxity of >>> multipolygons to fill and to antialias), causing some infinite recursion >>> loop, or too deep recursion exhausting the stack limit; >>> >>> Finally the bug could be in the OpenType hinting engine moving some >>> points outside the clipping area (the math theory may say that such >>> plcement of a point outside the clipping area may be impossible, but >>> various mathematical simplifcations and shortcuts are used to simplify or >>> accelerate the rendering, at the price of some quirks. Even the SVG >>> standard (in constant evolution) could be affected as well in its >>> implementation. >>> >>> There are tons of possible bugs here. >>> >>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode : >>> >>>> This article: >>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>> rash-apple/?ncid=mobilenavtrend >>>> >>>> The single Unicode symbol referred to in the article results from a >>>> string of Telugu characters. The article doesn't list or display the >>>> characters, so Mac users can visit the above link. A link in one of >>>> the comments leads to a page which does display the characters. >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 17 16:09:31 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 17 Feb 2018 23:09:31 +0100 (CET) Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: <269119379.10220.1518905371457.JavaMail.www@wwinf1h28> On 17/02/18 21:01, Doug Ewell via Unicode wrote: [?] > > I've linked Manish's post on FB as a reply to one of those mainstream > articles that repeatedly calls the conjunct a "single character," > written by a staffer who couldn't be bothered to find out how a writing > system used by 78 million people works. > That?s how what initially is faulty Unicode implementation is distorted to the attention of inadvertent customers. It should be surprising but isn?t really? What about ?Apple of Death?? Looks like something well-known. Made curious again by Manish?s blogpost and Doug?s comment I?ve tried it in my browser: The Telugu cluster is as inoffensive as any Unicode characters! Works just fine on Windows. There are cases I?m not tempted to cease being conservative :) Regards Marcel From unicode at unicode.org Sat Feb 17 16:18:25 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sat, 17 Feb 2018 23:18:25 +0100 Subject: metric for block coverage Message-ID: <20180217221825.wovnzpnzftpsjp37@angband.pl> Hi! As a part of Debian fonts team work, we're trying to improve fonts review: ways to organize them, add metadata, pick which fonts are installed by default and/or recommended to users, etc. I'm looking for a way to determine a font's coverage of available scripts. It's probably reasonable to do this per Unicode block. Also, it's a safe assumption that a font which doesn't know a codepoint can do no complex shaping of such a glyph, thus looking at just codepoints should be adequate for our purposes. A na?ve way would be to count codepoints present in the font vs the number of all codepoints in the block. Alas, there's way too much chaff for such an approach to be reasonable: ? or ? count the same as LATIN TURNED CAPITAL LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON. Another idea would be giving every codepoint a weight equal to the number of languages which currently use such a letter. Too bad, that wouldn't work for symbols, or for dead scripts: a good runic font will have a complete coverage of elder futhark, anglo-saxon, younger and medieval, while only a completionist would care about franks casket or Tolkien's inventions. I don't think I'm the first to have this question. Any suggestions? ????! -- ??????? ??????? A dumb species has no way to open a tuna can. ??????? A smart species invents a can opener. ??????? A master species delegates. From unicode at unicode.org Sat Feb 17 18:30:09 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 01:30:09 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: My opinion about this bug is that Apple's text renderer dynamically allocates a glyphs buffer only when needed (lazily), but a test is missing for the lazy construction of this buffer (which is not needed for most texts not needing glyph substitutions or reordering when a single accessor from the code point can find the glyph data directly by lookup in font tables) and this is causing a null pointer exception at run time. The bug occurs effectively when processing the vowel that occurs after the ZWNJ, if the code assumes that there's a glyphs buffer already constructed for the cluster, in order to place the vowel over the correct glyph (which may have been reordered in that buffer). Microsoft's text renderer, or other engines use do not delay the constructiuon of the glyphs buffer, which can be reused for processing the rest of the text, provided it is correctly reset after processing a cluster. 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : > Heh, I wasn't aware of the word "phala-form", though that seems > Bengali-specific? > > Interesting observation about the vowel glyphs, I'll mention this in the > post. Initially I missed this because I hadn't realized that the bengali o > vowel crashed (which made me discount this). > > > Thanks! > > -Manish > > On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy > wrote: > >> I would have liked that your invented term of "left-joining consonants" >> took the usual name "phala forms" (to represent RA or JA/JO after a virama, >> generally named "raphala" or "japhala/jophala"). >> >> And why this bug does not occur with some vowels is because these are >> vowels in two parts, that are first decomposed into two separate glyphs >> reordered in the buffer of glyphs, while other vowels do not need this >> prior mapping and keep their initial direct mapping from their codepoints >> in fonts, which means that this has to do to the way the ZWNJ looks for the >> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >> buffer: there's some desynchronization, and more probably an uninitialized >> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >> was done (the same data field is correctly initialized when it is the first >> consonnant which takes an alternate form before a virama, like in most >> Indic consonnant clusters, because the a glyph buffer is created. >> >> Now we have some hints about why the bug does not occur in Kannada or >> Khmer: a glyph buffer is always created, but there was some shortcut made >> in Devanagari, Bengali, and Telugu to allow processing clusters faster >> without having to create always a gyphs buffer (to allow reordering glyphs >> before positioning them), and working directly on the codepoints streams. >> >> So it seems related to the fact that OpenType fonts do not need to >> include rules for glyph substitution, but the PHALA forms are represented >> without any glyph substitution, by mapping directly the phala forms in a >> separate table for the consonants. Because there's been no code to glyph >> subtitution, the glyph buffer is not created, but then when processing the >> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >> (and this is specific to the renderers implemented by Apple in iOS and >> MacOS). This bug does not occur if another text rendering engine is used >> (e.g. in non-Apple web browsers). >> >> >> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >> >>> FWIW I dissected the crashing strings, it's basically all >> virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari >>> where the consonant is suffix-joining (ra in Devanagari, jo and ro in >>> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / >>> Telugu ai, and if the second consonant is ra/ro the first one is not also >>> ra/ro (or ro-with-line-through-it). >>> >>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>> the-crashing-ios-string/ >>> >>> -Manish >>> >>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>> unicode at unicode.org> wrote: >>> >>>> That's probably not a bug of Unicode but of MacOS/iOS text renderers >>>> with some fonts using advanced composition feature. >>>> >>>> Similar bugs could as well the new advanced features added in Windows >>>> or Android to support multicolored emojis, variable fonts, contextual glyph >>>> transforms, style variants, or more font formats (not just OpenType); the >>>> bug may also be in the graphic renderer (incorrect clipping when drawing >>>> the glyph into the glyph cache, with buffer overflows possibly caused by >>>> incorrectly computed splines), and it could be in the display driver (or in >>>> the hardware accelerator having some limitations on the compelxity of >>>> multipolygons to fill and to antialias), causing some infinite recursion >>>> loop, or too deep recursion exhausting the stack limit; >>>> >>>> Finally the bug could be in the OpenType hinting engine moving some >>>> points outside the clipping area (the math theory may say that such >>>> plcement of a point outside the clipping area may be impossible, but >>>> various mathematical simplifcations and shortcuts are used to simplify or >>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>> standard (in constant evolution) could be affected as well in its >>>> implementation. >>>> >>>> There are tons of possible bugs here. >>>> >>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode >>>> : >>>> >>>>> This article: >>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>> rash-apple/?ncid=mobilenavtrend >>>>> >>>>> The single Unicode symbol referred to in the article results from a >>>>> string of Telugu characters. The article doesn't list or display the >>>>> characters, so Mac users can visit the above link. A link in one of >>>>> the comments leads to a page which does display the characters. >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 17 18:36:22 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 18 Feb 2018 00:36:22 +0000 Subject: metric for block coverage In-Reply-To: <20180217221825.wovnzpnzftpsjp37@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: On Sat, Feb 17, 2018 at 3:30 PM Adam Borowski via Unicode < unicode at unicode.org> wrote: > ? or ? count the same as LATIN TURNED CAPITAL LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON. ? is in Latin-1, and ? is in Latin-A; the first is essential, even in its marginal characters, and the second is pretty consistently useful in the modern world. I don't see the problem or solution here; if something supports a good chunk of the Arabic block, then it supports Arabic, and if you need Persian and it supports Urdu instead, or vice versa, that's no comfort. Too bad, that wouldn't work for symbols, or for dead scripts: a good runic > font will have a complete coverage of elder futhark, anglo-saxon, younger > and medieval, while only a completionist would care about franks casket or > Tolkien's inventions. > Where as I might guess that the serious users of Tolkien's runic might rival or outnumber the users of the scripts for other purposes; after all, Anglo-Saxon and other languages that appeared in Runic all have standard Latin orthographies that are more suitable for scholarly purposes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 17 18:40:26 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 01:40:26 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: An interesting read: https://docs.microsoft.com/fr-fr/typography/script-development/bengali#reor 2018-02-18 1:30 GMT+01:00 Philippe Verdy : > My opinion about this bug is that Apple's text renderer dynamically > allocates a glyphs buffer only when needed (lazily), but a test is missing > for the lazy construction of this buffer (which is not needed for most > texts not needing glyph substitutions or reordering when a single accessor > from the code point can find the glyph data directly by lookup in font > tables) and this is causing a null pointer exception at run time. > > The bug occurs effectively when processing the vowel that occurs after the > ZWNJ, if the code assumes that there's a glyphs buffer already constructed > for the cluster, in order to place the vowel over the correct glyph (which > may have been reordered in that buffer). > > Microsoft's text renderer, or other engines use do not delay the > constructiuon of the glyphs buffer, which can be reused for processing the > rest of the text, provided it is correctly reset after processing a cluster. > > > 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : > >> Heh, I wasn't aware of the word "phala-form", though that seems >> Bengali-specific? >> >> Interesting observation about the vowel glyphs, I'll mention this in the >> post. Initially I missed this because I hadn't realized that the bengali o >> vowel crashed (which made me discount this). >> >> >> Thanks! >> >> -Manish >> >> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy >> wrote: >> >>> I would have liked that your invented term of "left-joining consonants" >>> took the usual name "phala forms" (to represent RA or JA/JO after a virama, >>> generally named "raphala" or "japhala/jophala"). >>> >>> And why this bug does not occur with some vowels is because these are >>> vowels in two parts, that are first decomposed into two separate glyphs >>> reordered in the buffer of glyphs, while other vowels do not need this >>> prior mapping and keep their initial direct mapping from their codepoints >>> in fonts, which means that this has to do to the way the ZWNJ looks for the >>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >>> buffer: there's some desynchronization, and more probably an uninitialized >>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >>> was done (the same data field is correctly initialized when it is the first >>> consonnant which takes an alternate form before a virama, like in most >>> Indic consonnant clusters, because the a glyph buffer is created. >>> >>> Now we have some hints about why the bug does not occur in Kannada or >>> Khmer: a glyph buffer is always created, but there was some shortcut made >>> in Devanagari, Bengali, and Telugu to allow processing clusters faster >>> without having to create always a gyphs buffer (to allow reordering glyphs >>> before positioning them), and working directly on the codepoints streams. >>> >>> So it seems related to the fact that OpenType fonts do not need to >>> include rules for glyph substitution, but the PHALA forms are represented >>> without any glyph substitution, by mapping directly the phala forms in a >>> separate table for the consonants. Because there's been no code to glyph >>> subtitution, the glyph buffer is not created, but then when processing the >>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >>> (and this is specific to the renderers implemented by Apple in iOS and >>> MacOS). This bug does not occur if another text rendering engine is used >>> (e.g. in non-Apple web browsers). >>> >>> >>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >>> >>>> FWIW I dissected the crashing strings, it's basically all >>> virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari >>>> where the consonant is suffix-joining (ra in Devanagari, jo and ro in >>>> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / >>>> Telugu ai, and if the second consonant is ra/ro the first one is not also >>>> ra/ro (or ro-with-line-through-it). >>>> >>>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>>> the-crashing-ios-string/ >>>> >>>> -Manish >>>> >>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>>> unicode at unicode.org> wrote: >>>> >>>>> That's probably not a bug of Unicode but of MacOS/iOS text renderers >>>>> with some fonts using advanced composition feature. >>>>> >>>>> Similar bugs could as well the new advanced features added in Windows >>>>> or Android to support multicolored emojis, variable fonts, contextual glyph >>>>> transforms, style variants, or more font formats (not just OpenType); the >>>>> bug may also be in the graphic renderer (incorrect clipping when drawing >>>>> the glyph into the glyph cache, with buffer overflows possibly caused by >>>>> incorrectly computed splines), and it could be in the display driver (or in >>>>> the hardware accelerator having some limitations on the compelxity of >>>>> multipolygons to fill and to antialias), causing some infinite recursion >>>>> loop, or too deep recursion exhausting the stack limit; >>>>> >>>>> Finally the bug could be in the OpenType hinting engine moving some >>>>> points outside the clipping area (the math theory may say that such >>>>> plcement of a point outside the clipping area may be impossible, but >>>>> various mathematical simplifcations and shortcuts are used to simplify or >>>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>>> standard (in constant evolution) could be affected as well in its >>>>> implementation. >>>>> >>>>> There are tons of possible bugs here. >>>>> >>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode >>>> >: >>>>> >>>>>> This article: >>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>>> rash-apple/?ncid=mobilenavtrend >>>>>> >>>>>> The single Unicode symbol referred to in the article results from a >>>>>> string of Telugu characters. The article doesn't list or display the >>>>>> characters, so Mac users can visit the above link. A link in one of >>>>>> the comments leads to a page which does display the characters. >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 00:31:12 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 17 Feb 2018 22:31:12 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <1079226995.76169.1518871144119@ox.hosteurope.de> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <1079226995.76169.1518871144119@ox.hosteurope.de> Message-ID: Christoph P?per wrote, > Stuff like typography or emoji can improve the > effectiveness and efficiency of textual communication > a lot. "Given that rich text equals plain text plus added information, the extra information in rich text can be stripped away to reveal the "pure" text underneath." "Plain text must contain enough information to permit the text to be rendered legibly and nothing more." "The Unicode Standard encodes plain text." (Above quotes from The Unicode Standard 5.0, pages 18 and 19) It's true that added features can make for a better presentation. Removing the special features shouldn't alter the message. The Unicode Standard draws the line between minimal legibility and special features. Emoji are in The Standard and have, therefore, been determined to be required for minimal legibility. If the emoji repertoire expands and Unicode does not include the new emoji, then Unicode cannot be depended upon to exchange legible textual data. The addition of more emoji to Unicode is inevitable. From unicode at unicode.org Sun Feb 18 01:25:50 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 17 Feb 2018 23:25:50 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <20180217094358.05292de8@JRWUBU2> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> <20180217094358.05292de8@JRWUBU2> Message-ID: I apologize for apparently misunderstanding the scope of what was being proposed. If a finite set of unencoded Han characters needs to be displayed correctly using IDSes, then the complexity of the look-up tables depends upon how many characters are in the set. It would probably best be handled at the font level and we shouldn't expect any mainstream support. If any reasonable IDS is expected to be displayed as a Han ideograph, then the project would be vast. It's doable and I'm sure there would be several approaches. It would not be feasible at the font level. I agree that the language in the text cited earlier in this thread should be strengthened to clarify that visible display of the IDCs is expected behavior while enabling higher level protocols to remain conformant if they attempt to display constructs in place of IDSes. But I don't think that the fact that a previously unencoded character has become encoded should forbid any application from making a display substitution based on IDSes. From unicode at unicode.org Sun Feb 18 01:57:43 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Sat, 17 Feb 2018 23:57:43 -0800 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Ah, looking at that the OpenType `pstf` feature seems relevant, though I cannot get it to crash with Gurmukhi (where the consonant ya is a postform) -Manish On Sat, Feb 17, 2018 at 4:40 PM, Philippe Verdy wrote: > An interesting read: > > https://docs.microsoft.com/fr-fr/typography/script- > development/bengali#reor > > > 2018-02-18 1:30 GMT+01:00 Philippe Verdy : > >> My opinion about this bug is that Apple's text renderer dynamically >> allocates a glyphs buffer only when needed (lazily), but a test is missing >> for the lazy construction of this buffer (which is not needed for most >> texts not needing glyph substitutions or reordering when a single accessor >> from the code point can find the glyph data directly by lookup in font >> tables) and this is causing a null pointer exception at run time. >> >> The bug occurs effectively when processing the vowel that occurs after >> the ZWNJ, if the code assumes that there's a glyphs buffer already >> constructed for the cluster, in order to place the vowel over the correct >> glyph (which may have been reordered in that buffer). >> >> Microsoft's text renderer, or other engines use do not delay the >> constructiuon of the glyphs buffer, which can be reused for processing the >> rest of the text, provided it is correctly reset after processing a cluster. >> >> >> 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : >> >>> Heh, I wasn't aware of the word "phala-form", though that seems >>> Bengali-specific? >>> >>> Interesting observation about the vowel glyphs, I'll mention this in the >>> post. Initially I missed this because I hadn't realized that the bengali o >>> vowel crashed (which made me discount this). >>> >>> >>> Thanks! >>> >>> -Manish >>> >>> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy >>> wrote: >>> >>>> I would have liked that your invented term of "left-joining consonants" >>>> took the usual name "phala forms" (to represent RA or JA/JO after a virama, >>>> generally named "raphala" or "japhala/jophala"). >>>> >>>> And why this bug does not occur with some vowels is because these are >>>> vowels in two parts, that are first decomposed into two separate glyphs >>>> reordered in the buffer of glyphs, while other vowels do not need this >>>> prior mapping and keep their initial direct mapping from their codepoints >>>> in fonts, which means that this has to do to the way the ZWNJ looks for the >>>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >>>> buffer: there's some desynchronization, and more probably an uninitialized >>>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >>>> was done (the same data field is correctly initialized when it is the first >>>> consonnant which takes an alternate form before a virama, like in most >>>> Indic consonnant clusters, because the a glyph buffer is created. >>>> >>>> Now we have some hints about why the bug does not occur in Kannada or >>>> Khmer: a glyph buffer is always created, but there was some shortcut made >>>> in Devanagari, Bengali, and Telugu to allow processing clusters faster >>>> without having to create always a gyphs buffer (to allow reordering glyphs >>>> before positioning them), and working directly on the codepoints streams. >>>> >>>> So it seems related to the fact that OpenType fonts do not need to >>>> include rules for glyph substitution, but the PHALA forms are represented >>>> without any glyph substitution, by mapping directly the phala forms in a >>>> separate table for the consonants. Because there's been no code to glyph >>>> subtitution, the glyph buffer is not created, but then when processing the >>>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >>>> (and this is specific to the renderers implemented by Apple in iOS and >>>> MacOS). This bug does not occur if another text rendering engine is used >>>> (e.g. in non-Apple web browsers). >>>> >>>> >>>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >>>> >>>>> FWIW I dissected the crashing strings, it's basically all >>>> virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari >>>>> where the consonant is suffix-joining (ra in Devanagari, jo and ro in >>>>> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / >>>>> Telugu ai, and if the second consonant is ra/ro the first one is not also >>>>> ra/ro (or ro-with-line-through-it). >>>>> >>>>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>>>> the-crashing-ios-string/ >>>>> >>>>> -Manish >>>>> >>>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>>>> unicode at unicode.org> wrote: >>>>> >>>>>> That's probably not a bug of Unicode but of MacOS/iOS text renderers >>>>>> with some fonts using advanced composition feature. >>>>>> >>>>>> Similar bugs could as well the new advanced features added in Windows >>>>>> or Android to support multicolored emojis, variable fonts, contextual glyph >>>>>> transforms, style variants, or more font formats (not just OpenType); the >>>>>> bug may also be in the graphic renderer (incorrect clipping when drawing >>>>>> the glyph into the glyph cache, with buffer overflows possibly caused by >>>>>> incorrectly computed splines), and it could be in the display driver (or in >>>>>> the hardware accelerator having some limitations on the compelxity of >>>>>> multipolygons to fill and to antialias), causing some infinite recursion >>>>>> loop, or too deep recursion exhausting the stack limit; >>>>>> >>>>>> Finally the bug could be in the OpenType hinting engine moving some >>>>>> points outside the clipping area (the math theory may say that such >>>>>> plcement of a point outside the clipping area may be impossible, but >>>>>> various mathematical simplifcations and shortcuts are used to simplify or >>>>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>>>> standard (in constant evolution) could be affected as well in its >>>>>> implementation. >>>>>> >>>>>> There are tons of possible bugs here. >>>>>> >>>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode < >>>>>> unicode at unicode.org>: >>>>>> >>>>>>> This article: >>>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>>>> rash-apple/?ncid=mobilenavtrend >>>>>>> >>>>>>> The single Unicode symbol referred to in the article results from a >>>>>>> string of Telugu characters. The article doesn't list or display the >>>>>>> characters, so Mac users can visit the above link. A link in one of >>>>>>> the comments leads to a page which does display the characters. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 02:01:53 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Sun, 18 Feb 2018 00:01:53 -0800 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Oh, also vatu. Seems like that ordering algorithm is indeed relevant. -Manish On Sat, Feb 17, 2018 at 11:57 PM, Manish Goregaokar wrote: > Ah, looking at that the OpenType `pstf` feature seems relevant, though I > cannot get it to crash with Gurmukhi (where the consonant ya is a postform) > > -Manish > > On Sat, Feb 17, 2018 at 4:40 PM, Philippe Verdy > wrote: > >> An interesting read: >> >> https://docs.microsoft.com/fr-fr/typography/script-developme >> nt/bengali#reor >> >> >> 2018-02-18 1:30 GMT+01:00 Philippe Verdy : >> >>> My opinion about this bug is that Apple's text renderer dynamically >>> allocates a glyphs buffer only when needed (lazily), but a test is missing >>> for the lazy construction of this buffer (which is not needed for most >>> texts not needing glyph substitutions or reordering when a single accessor >>> from the code point can find the glyph data directly by lookup in font >>> tables) and this is causing a null pointer exception at run time. >>> >>> The bug occurs effectively when processing the vowel that occurs after >>> the ZWNJ, if the code assumes that there's a glyphs buffer already >>> constructed for the cluster, in order to place the vowel over the correct >>> glyph (which may have been reordered in that buffer). >>> >>> Microsoft's text renderer, or other engines use do not delay the >>> constructiuon of the glyphs buffer, which can be reused for processing the >>> rest of the text, provided it is correctly reset after processing a cluster. >>> >>> >>> 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : >>> >>>> Heh, I wasn't aware of the word "phala-form", though that seems >>>> Bengali-specific? >>>> >>>> Interesting observation about the vowel glyphs, I'll mention this in >>>> the post. Initially I missed this because I hadn't realized that the >>>> bengali o vowel crashed (which made me discount this). >>>> >>>> >>>> Thanks! >>>> >>>> -Manish >>>> >>>> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy >>>> wrote: >>>> >>>>> I would have liked that your invented term of "left-joining >>>>> consonants" took the usual name "phala forms" (to represent RA or JA/JO >>>>> after a virama, generally named "raphala" or "japhala/jophala"). >>>>> >>>>> And why this bug does not occur with some vowels is because these are >>>>> vowels in two parts, that are first decomposed into two separate glyphs >>>>> reordered in the buffer of glyphs, while other vowels do not need this >>>>> prior mapping and keep their initial direct mapping from their codepoints >>>>> in fonts, which means that this has to do to the way the ZWNJ looks for the >>>>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >>>>> buffer: there's some desynchronization, and more probably an uninitialized >>>>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >>>>> was done (the same data field is correctly initialized when it is the first >>>>> consonnant which takes an alternate form before a virama, like in most >>>>> Indic consonnant clusters, because the a glyph buffer is created. >>>>> >>>>> Now we have some hints about why the bug does not occur in Kannada or >>>>> Khmer: a glyph buffer is always created, but there was some shortcut made >>>>> in Devanagari, Bengali, and Telugu to allow processing clusters faster >>>>> without having to create always a gyphs buffer (to allow reordering glyphs >>>>> before positioning them), and working directly on the codepoints streams. >>>>> >>>>> So it seems related to the fact that OpenType fonts do not need to >>>>> include rules for glyph substitution, but the PHALA forms are represented >>>>> without any glyph substitution, by mapping directly the phala forms in a >>>>> separate table for the consonants. Because there's been no code to glyph >>>>> subtitution, the glyph buffer is not created, but then when processing the >>>>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >>>>> (and this is specific to the renderers implemented by Apple in iOS and >>>>> MacOS). This bug does not occur if another text rendering engine is used >>>>> (e.g. in non-Apple web browsers). >>>>> >>>>> >>>>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >>>>> >>>>>> FWIW I dissected the crashing strings, it's basically all >>>>> virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari >>>>>> where the consonant is suffix-joining (ra in Devanagari, jo and ro in >>>>>> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / >>>>>> Telugu ai, and if the second consonant is ra/ro the first one is not also >>>>>> ra/ro (or ro-with-line-through-it). >>>>>> >>>>>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>>>>> the-crashing-ios-string/ >>>>>> >>>>>> -Manish >>>>>> >>>>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>>>>> unicode at unicode.org> wrote: >>>>>> >>>>>>> That's probably not a bug of Unicode but of MacOS/iOS text renderers >>>>>>> with some fonts using advanced composition feature. >>>>>>> >>>>>>> Similar bugs could as well the new advanced features added in >>>>>>> Windows or Android to support multicolored emojis, variable fonts, >>>>>>> contextual glyph transforms, style variants, or more font formats (not just >>>>>>> OpenType); the bug may also be in the graphic renderer (incorrect clipping >>>>>>> when drawing the glyph into the glyph cache, with buffer overflows possibly >>>>>>> caused by incorrectly computed splines), and it could be in the display >>>>>>> driver (or in the hardware accelerator having some limitations on the >>>>>>> compelxity of multipolygons to fill and to antialias), causing some >>>>>>> infinite recursion loop, or too deep recursion exhausting the stack limit; >>>>>>> >>>>>>> Finally the bug could be in the OpenType hinting engine moving some >>>>>>> points outside the clipping area (the math theory may say that such >>>>>>> plcement of a point outside the clipping area may be impossible, but >>>>>>> various mathematical simplifcations and shortcuts are used to simplify or >>>>>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>>>>> standard (in constant evolution) could be affected as well in its >>>>>>> implementation. >>>>>>> >>>>>>> There are tons of possible bugs here. >>>>>>> >>>>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode < >>>>>>> unicode at unicode.org>: >>>>>>> >>>>>>>> This article: >>>>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>>>>> rash-apple/?ncid=mobilenavtrend >>>>>>>> >>>>>>>> The single Unicode symbol referred to in the article results from a >>>>>>>> string of Telugu characters. The article doesn't list or display >>>>>>>> the >>>>>>>> characters, so Mac users can visit the above link. A link in one of >>>>>>>> the comments leads to a page which does display the characters. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 02:18:01 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 00:18:01 -0800 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Doug Ewell wrote, > I've linked Manish's post on FB as a reply to one of those mainstream > articles that repeatedly calls the conjunct a "single character," written by > a staffer who couldn't be bothered to find out how a writing system used by > 78 million people works. Linking Manish's information in reply was a Good Thing?. A lot of people aren't even aware that complex scripts exist and may have no suspicion that any writing system other than their own would work differently. From unicode at unicode.org Sun Feb 18 04:14:46 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 02:14:46 -0800 Subject: metric for block coverage In-Reply-To: <20180217221825.wovnzpnzftpsjp37@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: Adam Borowski wrote, > I'm looking for a way to determine a font's coverage of available scripts. > It's probably reasonable to do this per Unicode block. Also, it's a safe > assumption that a font which doesn't know a codepoint can do no complex > shaping of such a glyph, thus looking at just codepoints should be adequate > for our purposes. You probably already know that basic script coverage information is stored internally in OpenType fonts in the OS/2 table. https://docs.microsoft.com/en-us/typography/opentype/spec/os2 Parsing the bits in the "ulUnicodeRange..." entries may be the simplest way to get basic script coverage info. OpenType fonts also include script coverage information in the OpenType tables. A font with an OpenType table for a script would be likely to have at least some complex script shaping abilities for that script. From unicode at unicode.org Sun Feb 18 05:16:04 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 03:16:04 -0800 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: > OpenType fonts also include script coverage information in the > OpenType tables. A font with an OpenType table for a script would be > likely to have at least some complex script shaping abilities for that > script. https://docs.microsoft.com/en-us/typography/opentype/spec/chapter2#slTbl_sRec From unicode at unicode.org Sun Feb 18 05:26:10 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Sun, 18 Feb 2018 13:26:10 +0200 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: <20180218112610.GA18088@macbook.localdomain> On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: > Adam Borowski wrote, > > > I'm looking for a way to determine a font's coverage of available scripts. > > It's probably reasonable to do this per Unicode block. Also, it's a safe > > assumption that a font which doesn't know a codepoint can do no complex > > shaping of such a glyph, thus looking at just codepoints should be adequate > > for our purposes. > > You probably already know that basic script coverage information is > stored internally in OpenType fonts in the OS/2 table. > > https://docs.microsoft.com/en-us/typography/opentype/spec/os2 > > Parsing the bits in the "ulUnicodeRange..." entries may be the > simplest way to get basic script coverage info. Though this might not be very reliable since OpenType does not have a definition of what it means for a Unicode block to be supported; some font authoring tools use a percentage, others use the presence of any characters in the range, and fonts might even provide incorrect data for any reason. However, I don?t think script or block coverage is that useful, what users are usually interested in is the language coverage. Regards, Khaled From unicode at unicode.org Sun Feb 18 05:40:32 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sun, 18 Feb 2018 12:40:32 +0100 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: <20180218114031.dlg4vw3g2lm4iodn@angband.pl> On Sun, Feb 18, 2018 at 12:36:22AM +0000, David Starner wrote: > On Sat, Feb 17, 2018 at 3:30 PM Adam Borowski via Unicode < > unicode at unicode.org> wrote: > > ? or ? count the same as LATIN TURNED CAPITAL > LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON. > > ? is in Latin-1, and ? is in Latin-A; the first is essential, even in its > marginal characters, and the second is pretty consistently useful in the > modern world. I don't see the problem or solution here; if something > supports a good chunk of the Arabic block, then it supports Arabic, and if > you need Persian and it supports Urdu instead, or vice versa, that's no > comfort. I probably used a bad example: scripts like Cyrillic (not even Supplement) include both essential letters and those which are historic only or used by old folks in a language spoken by 1000, who use Russian (or English...) for all computer use anyway -- all within one block. What I'm thinking, is that a beautiful font that covers Russian, Ukrainian, Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users before one whose only grace is including every single codepoint. > Too bad, that wouldn't work for symbols, or for dead scripts: a good runic > > font will have a complete coverage of elder futhark, anglo-saxon, younger > > and medieval, while only a completionist would care about franks casket or > > Tolkien's inventions. > > Where as I might guess that the serious users of Tolkien's runic might > rival or outnumber the users of the scripts for other purposes; after all, > Anglo-Saxon and other languages that appeared in Runic all have standard > Latin orthographies that are more suitable for scholarly purposes. Hasn't Tolkien moved to Cirth soon after (excuse my ignorance)? Not sure if I understand your advice right: you're recommending to ignore all the complexity and going with just raw count of in-block coverage? This could work: a released font probably has codepoints its author considers important. ????! -- ??????? ??????? Vat kind uf sufficiently advanced technology iz dis!? ??????? -- Genghis Ht'rok'din ??????? From unicode at unicode.org Sun Feb 18 06:04:16 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 13:04:16 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Yes, I found other possible crashes all caused by the glyph reordering. It seems really that Apple implemented some unsafe shortcuts by not creating a glyphs buffer in all cases (using lasy instanciation only when needed), but forgot some cases and the code assumes that the glyphs buffer has been initialized and then it probably does through a null pointer exception or similar 2018-02-18 9:01 GMT+01:00 Manish Goregaokar : > Oh, also vatu. > > Seems like that ordering algorithm is indeed relevant. > > -Manish > > On Sat, Feb 17, 2018 at 11:57 PM, Manish Goregaokar > wrote: > >> Ah, looking at that the OpenType `pstf` feature seems relevant, though I >> cannot get it to crash with Gurmukhi (where the consonant ya is a postform) >> >> -Manish >> >> On Sat, Feb 17, 2018 at 4:40 PM, Philippe Verdy >> wrote: >> >>> An interesting read: >>> >>> https://docs.microsoft.com/fr-fr/typography/script-developme >>> nt/bengali#reor >>> >>> >>> 2018-02-18 1:30 GMT+01:00 Philippe Verdy : >>> >>>> My opinion about this bug is that Apple's text renderer dynamically >>>> allocates a glyphs buffer only when needed (lazily), but a test is missing >>>> for the lazy construction of this buffer (which is not needed for most >>>> texts not needing glyph substitutions or reordering when a single accessor >>>> from the code point can find the glyph data directly by lookup in font >>>> tables) and this is causing a null pointer exception at run time. >>>> >>>> The bug occurs effectively when processing the vowel that occurs after >>>> the ZWNJ, if the code assumes that there's a glyphs buffer already >>>> constructed for the cluster, in order to place the vowel over the correct >>>> glyph (which may have been reordered in that buffer). >>>> >>>> Microsoft's text renderer, or other engines use do not delay the >>>> constructiuon of the glyphs buffer, which can be reused for processing the >>>> rest of the text, provided it is correctly reset after processing a cluster. >>>> >>>> >>>> 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : >>>> >>>>> Heh, I wasn't aware of the word "phala-form", though that seems >>>>> Bengali-specific? >>>>> >>>>> Interesting observation about the vowel glyphs, I'll mention this in >>>>> the post. Initially I missed this because I hadn't realized that the >>>>> bengali o vowel crashed (which made me discount this). >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> -Manish >>>>> >>>>> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy >>>>> wrote: >>>>> >>>>>> I would have liked that your invented term of "left-joining >>>>>> consonants" took the usual name "phala forms" (to represent RA or JA/JO >>>>>> after a virama, generally named "raphala" or "japhala/jophala"). >>>>>> >>>>>> And why this bug does not occur with some vowels is because these are >>>>>> vowels in two parts, that are first decomposed into two separate glyphs >>>>>> reordered in the buffer of glyphs, while other vowels do not need this >>>>>> prior mapping and keep their initial direct mapping from their codepoints >>>>>> in fonts, which means that this has to do to the way the ZWNJ looks for the >>>>>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >>>>>> buffer: there's some desynchronization, and more probably an uninitialized >>>>>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >>>>>> was done (the same data field is correctly initialized when it is the first >>>>>> consonnant which takes an alternate form before a virama, like in most >>>>>> Indic consonnant clusters, because the a glyph buffer is created. >>>>>> >>>>>> Now we have some hints about why the bug does not occur in Kannada or >>>>>> Khmer: a glyph buffer is always created, but there was some shortcut made >>>>>> in Devanagari, Bengali, and Telugu to allow processing clusters faster >>>>>> without having to create always a gyphs buffer (to allow reordering glyphs >>>>>> before positioning them), and working directly on the codepoints streams. >>>>>> >>>>>> So it seems related to the fact that OpenType fonts do not need to >>>>>> include rules for glyph substitution, but the PHALA forms are represented >>>>>> without any glyph substitution, by mapping directly the phala forms in a >>>>>> separate table for the consonants. Because there's been no code to glyph >>>>>> subtitution, the glyph buffer is not created, but then when processing the >>>>>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >>>>>> (and this is specific to the renderers implemented by Apple in iOS and >>>>>> MacOS). This bug does not occur if another text rendering engine is used >>>>>> (e.g. in non-Apple web browsers). >>>>>> >>>>>> >>>>>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >>>>>> >>>>>>> FWIW I dissected the crashing strings, it's basically all >>>>>>> sequences in Telugu, Bengali, >>>>>>> Devanagari where the consonant is suffix-joining (ra in Devanagari, jo and >>>>>>> ro in Bengali, and all Telugu consonants), the vowel is not Bengali au or o >>>>>>> / Telugu ai, and if the second consonant is ra/ro the first one is not also >>>>>>> ra/ro (or ro-with-line-through-it). >>>>>>> >>>>>>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>>>>>> the-crashing-ios-string/ >>>>>>> >>>>>>> -Manish >>>>>>> >>>>>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>>>>>> unicode at unicode.org> wrote: >>>>>>> >>>>>>>> That's probably not a bug of Unicode but of MacOS/iOS text >>>>>>>> renderers with some fonts using advanced composition feature. >>>>>>>> >>>>>>>> Similar bugs could as well the new advanced features added in >>>>>>>> Windows or Android to support multicolored emojis, variable fonts, >>>>>>>> contextual glyph transforms, style variants, or more font formats (not just >>>>>>>> OpenType); the bug may also be in the graphic renderer (incorrect clipping >>>>>>>> when drawing the glyph into the glyph cache, with buffer overflows possibly >>>>>>>> caused by incorrectly computed splines), and it could be in the display >>>>>>>> driver (or in the hardware accelerator having some limitations on the >>>>>>>> compelxity of multipolygons to fill and to antialias), causing some >>>>>>>> infinite recursion loop, or too deep recursion exhausting the stack limit; >>>>>>>> >>>>>>>> Finally the bug could be in the OpenType hinting engine moving some >>>>>>>> points outside the clipping area (the math theory may say that such >>>>>>>> plcement of a point outside the clipping area may be impossible, but >>>>>>>> various mathematical simplifcations and shortcuts are used to simplify or >>>>>>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>>>>>> standard (in constant evolution) could be affected as well in its >>>>>>>> implementation. >>>>>>>> >>>>>>>> There are tons of possible bugs here. >>>>>>>> >>>>>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode < >>>>>>>> unicode at unicode.org>: >>>>>>>> >>>>>>>>> This article: >>>>>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>>>>>> rash-apple/?ncid=mobilenavtrend >>>>>>>>> >>>>>>>>> The single Unicode symbol referred to in the article results from a >>>>>>>>> string of Telugu characters. The article doesn't list or display >>>>>>>>> the >>>>>>>>> characters, so Mac users can visit the above link. A link in one >>>>>>>>> of >>>>>>>>> the comments leads to a page which does display the characters. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 06:05:29 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sun, 18 Feb 2018 13:05:29 +0100 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: <20180218120529.funepdzaa2bh3hjt@angband.pl> On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass wrote: > Adam Borowski wrote, > > I'm looking for a way to determine a font's coverage of available scripts. > > It's probably reasonable to do this per Unicode block. Also, it's a safe > > assumption that a font which doesn't know a codepoint can do no complex > > shaping of such a glyph, thus looking at just codepoints should be adequate > > for our purposes. > > You probably already know that basic script coverage information is > stored internally in OpenType fonts in the OS/2 table. > > https://docs.microsoft.com/en-us/typography/opentype/spec/os2 It's only a single bit without a meaning beyond "range is considered functional". No "basic coverage" vs "good coverage" vs "full coverage". On the other hand, listing raw codepoints in an universal way is as simple as: .---- #!/usr/bin/perl -w use Font::FreeType; $#ARGV==0 or die "Usage: $0 \n"; Font::FreeType->new->face($ARGV[0])->foreach_char(sub { printf("%04X\n", $_->char_code); }); `---- These codepoints can then be grouped by block -- but interpreting such lists is what's unobvious. ????! -- ??????? I've read an article about how lively happy music boosts ??????? productivity. You can read it, too, you just need the ??????? right music while doing so. I recommend Skepticism ??????? (funeral doom metal). From unicode at unicode.org Sun Feb 18 06:09:44 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 04:09:44 -0800 Subject: metric for block coverage In-Reply-To: <20180218114031.dlg4vw3g2lm4iodn@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218114031.dlg4vw3g2lm4iodn@angband.pl> Message-ID: Adam Borowski wrote, > What I'm thinking, is that a beautiful font that covers Russian, Ukrainian, > Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users before > one whose only grace is including every single codepoint. https://docs.microsoft.com/en-us/typography/opentype/spec/chapter2#scripts-and-languages If there's any language tag in addition to 'dflt' (default) under a particular script, the font is likely to be more expert in its development. Beauty is in the eye of the beholder, and both fancy and utilitarian typefaces can be typographically elegant. From unicode at unicode.org Sun Feb 18 07:05:42 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 14:05:42 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: Now what I suspect in Apple's implementation is the following: the OpenType specification details the steps to parse strings, find clusters boundaries, identify the various character types (joining, associativity, decomposable characters...) At first Apple parses the clusters and marks those that may require reordering: it can detect the possibility of existence of reph forms, before-base consonnants, vowels with multiple components. If this condition is true, then it goes to a "slow path" to use the complex algorithm requiring the preparation of a glyphs buffer. Otherwise it uses a "fast path" and can work directly at the code points level. Here the bug is manifested by the behavior of ZWNJ + vowel, because this code assumes it runs only in the "slow path" (where a glyphs buffer has been prepared), but we are here in a case for the "fast path" determined only by conditions set by the clusters parsing. The "glyphs buffer" may also still be prepared lazily in case of application of complex GSUB (i.e. not 1-to-1 mappings) in some of the features (I don't think that Apple has a bug here, this still allows switching dynamically from the "fast path" to the "slow path" on demand, depending on features implemented in fonts. But any operation in OpenType that requires reordering requires a glyphs buffer. This could even apply to Latin if Microsoft really intends to support normalization (i.e. canonical equivalences) in its own USE engine (for now it does not) because it would also require a glyphs buffer to allow correct reordering of glyphs (according to their properties, notably for "beforebase", or for special placement of some diacritics such as the cedilla that moves from "belowbase" to "abovebase" when the base is the letter "g"). Unfortunately, the OpenType specifications are not very clear and it is still a mess to read. In addition, it has been repeatedly moved on Microsoft website (broken URLs all the time): this specification hosted by Microsoft should better be on a separate stable website and not necessarily linked to Microsoft. These repeated moves (and content conversions when Microsoft decides to change the site layout for its own online "developers network" center) is a problem: the conversion has once again broken a part of the documentation (see missing images for illustrations or for showing some glyphs...) If OpenType is supposed to be interoperable, Microsoft should make it more stable outside MSDN (GitHub suggested, Microsoft already moves there many of its open sourced or cooperative projects, and GutHub still allows integration from Microsoft's website, including for commenting the Microsoft documentation for Windows or Office or XBox apps which is now on GitHub, and GiutHub still permits Microsoft to link back to its own website with tools on the sidebar without breaking the local content in GitHub projects). This move would also allow cleaner versioning than what is on MSDN. --- side comment: In fact, even the Windows/Office/XBox public developers documentation could also be transited to GitHub (MSDN is completely broken now, when it mixes all versions, with too many "Page not found" errors found everywhere, it is extremely difficult to make stable references to the doc in all development projects for Windows, when it changes at each major Windows release, or when a new version is in preparation: MSDN only focus the msot recent version and documentation for older versions are completely forgotten and too frequently broken: this is also a problem for support sites for many third party developers, but as well within Microsoft's own solution centers and forums, the solutions are hard to evaluate, unstable)... Microsoft still does not want to honor a strong recommendation made by the W3C and the IETF: URLs must be stable (and Microsoft's idea of using its own GUIDs or article IDs to reference the contents via an indirection is not a solution, because Microsoft frequently forget to maintain the targets of these redirects when it is moved "elsewhere"). 2018-02-18 13:04 GMT+01:00 Philippe Verdy : > Yes, I found other possible crashes all caused by the glyph reordering. It > seems really that Apple implemented some unsafe shortcuts by not creating a > glyphs buffer in all cases (using lasy instanciation only when needed), but > forgot some cases and the code assumes that the glyphs buffer has been > initialized and then it probably does through a null pointer exception or > similar > > 2018-02-18 9:01 GMT+01:00 Manish Goregaokar : > >> Oh, also vatu. >> >> Seems like that ordering algorithm is indeed relevant. >> >> -Manish >> >> On Sat, Feb 17, 2018 at 11:57 PM, Manish Goregaokar >> wrote: >> >>> Ah, looking at that the OpenType `pstf` feature seems relevant, though I >>> cannot get it to crash with Gurmukhi (where the consonant ya is a postform) >>> >>> -Manish >>> >>> On Sat, Feb 17, 2018 at 4:40 PM, Philippe Verdy >>> wrote: >>> >>>> An interesting read: >>>> >>>> https://docs.microsoft.com/fr-fr/typography/script-developme >>>> nt/bengali#reor >>>> >>>> >>>> 2018-02-18 1:30 GMT+01:00 Philippe Verdy : >>>> >>>>> My opinion about this bug is that Apple's text renderer dynamically >>>>> allocates a glyphs buffer only when needed (lazily), but a test is missing >>>>> for the lazy construction of this buffer (which is not needed for most >>>>> texts not needing glyph substitutions or reordering when a single accessor >>>>> from the code point can find the glyph data directly by lookup in font >>>>> tables) and this is causing a null pointer exception at run time. >>>>> >>>>> The bug occurs effectively when processing the vowel that occurs after >>>>> the ZWNJ, if the code assumes that there's a glyphs buffer already >>>>> constructed for the cluster, in order to place the vowel over the correct >>>>> glyph (which may have been reordered in that buffer). >>>>> >>>>> Microsoft's text renderer, or other engines use do not delay the >>>>> constructiuon of the glyphs buffer, which can be reused for processing the >>>>> rest of the text, provided it is correctly reset after processing a cluster. >>>>> >>>>> >>>>> 2018-02-17 21:54 GMT+01:00 Manish Goregaokar : >>>>> >>>>>> Heh, I wasn't aware of the word "phala-form", though that seems >>>>>> Bengali-specific? >>>>>> >>>>>> Interesting observation about the vowel glyphs, I'll mention this in >>>>>> the post. Initially I missed this because I hadn't realized that the >>>>>> bengali o vowel crashed (which made me discount this). >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -Manish >>>>>> >>>>>> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy >>>>>> wrote: >>>>>> >>>>>>> I would have liked that your invented term of "left-joining >>>>>>> consonants" took the usual name "phala forms" (to represent RA or JA/JO >>>>>>> after a virama, generally named "raphala" or "japhala/jophala"). >>>>>>> >>>>>>> And why this bug does not occur with some vowels is because these >>>>>>> are vowels in two parts, that are first decomposed into two separate glyphs >>>>>>> reordered in the buffer of glyphs, while other vowels do not need this >>>>>>> prior mapping and keep their initial direct mapping from their codepoints >>>>>>> in fonts, which means that this has to do to the way the ZWNJ looks for the >>>>>>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints >>>>>>> buffer: there's some desynchronization, and more probably an uninitialized >>>>>>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition >>>>>>> was done (the same data field is correctly initialized when it is the first >>>>>>> consonnant which takes an alternate form before a virama, like in most >>>>>>> Indic consonnant clusters, because the a glyph buffer is created. >>>>>>> >>>>>>> Now we have some hints about why the bug does not occur in Kannada >>>>>>> or Khmer: a glyph buffer is always created, but there was some shortcut >>>>>>> made in Devanagari, Bengali, and Telugu to allow processing clusters >>>>>>> faster without having to create always a gyphs buffer (to allow reordering >>>>>>> glyphs before positioning them), and working directly on the codepoints >>>>>>> streams. >>>>>>> >>>>>>> So it seems related to the fact that OpenType fonts do not need to >>>>>>> include rules for glyph substitution, but the PHALA forms are represented >>>>>>> without any glyph substitution, by mapping directly the phala forms in a >>>>>>> separate table for the consonants. Because there's been no code to glyph >>>>>>> subtitution, the glyph buffer is not created, but then when processing the >>>>>>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized >>>>>>> (and this is specific to the renderers implemented by Apple in iOS and >>>>>>> MacOS). This bug does not occur if another text rendering engine is used >>>>>>> (e.g. in non-Apple web browsers). >>>>>>> >>>>>>> >>>>>>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar : >>>>>>> >>>>>>>> FWIW I dissected the crashing strings, it's basically all >>>>>>>> sequences in Telugu, Bengali, >>>>>>>> Devanagari where the consonant is suffix-joining (ra in Devanagari, jo and >>>>>>>> ro in Bengali, and all Telugu consonants), the vowel is not Bengali au or o >>>>>>>> / Telugu ai, and if the second consonant is ra/ro the first one is not also >>>>>>>> ra/ro (or ro-with-line-through-it). >>>>>>>> >>>>>>>> https://manishearth.github.io/blog/2018/02/15/picking-apart- >>>>>>>> the-crashing-ios-string/ >>>>>>>> >>>>>>>> -Manish >>>>>>>> >>>>>>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < >>>>>>>> unicode at unicode.org> wrote: >>>>>>>> >>>>>>>>> That's probably not a bug of Unicode but of MacOS/iOS text >>>>>>>>> renderers with some fonts using advanced composition feature. >>>>>>>>> >>>>>>>>> Similar bugs could as well the new advanced features added in >>>>>>>>> Windows or Android to support multicolored emojis, variable fonts, >>>>>>>>> contextual glyph transforms, style variants, or more font formats (not just >>>>>>>>> OpenType); the bug may also be in the graphic renderer (incorrect clipping >>>>>>>>> when drawing the glyph into the glyph cache, with buffer overflows possibly >>>>>>>>> caused by incorrectly computed splines), and it could be in the display >>>>>>>>> driver (or in the hardware accelerator having some limitations on the >>>>>>>>> compelxity of multipolygons to fill and to antialias), causing some >>>>>>>>> infinite recursion loop, or too deep recursion exhausting the stack limit; >>>>>>>>> >>>>>>>>> Finally the bug could be in the OpenType hinting engine moving >>>>>>>>> some points outside the clipping area (the math theory may say that such >>>>>>>>> plcement of a point outside the clipping area may be impossible, but >>>>>>>>> various mathematical simplifcations and shortcuts are used to simplify or >>>>>>>>> accelerate the rendering, at the price of some quirks. Even the SVG >>>>>>>>> standard (in constant evolution) could be affected as well in its >>>>>>>>> implementation. >>>>>>>>> >>>>>>>>> There are tons of possible bugs here. >>>>>>>>> >>>>>>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode < >>>>>>>>> unicode at unicode.org>: >>>>>>>>> >>>>>>>>>> This article: >>>>>>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c >>>>>>>>>> rash-apple/?ncid=mobilenavtrend >>>>>>>>>> >>>>>>>>>> The single Unicode symbol referred to in the article results from >>>>>>>>>> a >>>>>>>>>> string of Telugu characters. The article doesn't list or display >>>>>>>>>> the >>>>>>>>>> characters, so Mac users can visit the above link. A link in one >>>>>>>>>> of >>>>>>>>>> the comments leads to a page which does display the characters. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 07:13:22 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 14:13:22 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: 20But any operation in OpenType that requires reordering requires a glyphs buffer. This could even apply to Latin if Microsoft really intends to support normalization (i.e. canonical equivalences) in its own USE engine (for now it does not) because it would also require a glyphs buffer to allow correct reordering of glyphs (according to their properties, notably for "beforebase", or for special placement of some diacritics such as the cedilla that moves from "belowbase" to "abovebase" when the base is the letter "g"). Similar complex shaping features may also exist for rendering Latin Fraktur, or Latin medieval texts... Latin is also a very complex script (probably much more than even most indic scripts), as it has really a lot of contextual "features". Complex shaping is also needed for more correct handling of the classic cursive style, or decorated "swash" styles ! Now with the introduction of "variable fonts" the complexity is increasing (think about hinting, or kerning, and how some glyphs may need non-linear with breaks for example with variable weights). The Microsoft USE engine is still in work, and OpenType will also need major updates to support more scripts (some scripts are still only partly suppported, such as Lana). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 07:30:33 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 05:30:33 -0800 Subject: metric for block coverage In-Reply-To: <20180218120529.funepdzaa2bh3hjt@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> Message-ID: Adam Borowski wrote, > It's only a single bit without a meaning beyond "range is considered > functional". No "basic coverage" vs "good coverage" vs "full coverage". > ... > These codepoints can then be grouped by block -- but interpreting such lists > is what's unobvious. Compare the number of glyphs in the range with the number of assigned characters in the range. Older fonts would lack anything added to The Standard after the font was made. +1 if the font has any glyphs in the range +1 if the font has a good portion of glyphs in the range +1 if the font has all the glyphs in the range +1 if the font has OpenType tables covering the script +1 if the script has 1 language tag in addition to 'dflt' tag +1 if the script has 2 language tags in addition to 'dflt' tag ... And for a "good portion of glyphs in the range", possibly the number of characters in the range which were assigned as of Unicode 3.0 would indicate a more-or-less "basic coverage" of that range. From unicode at unicode.org Sun Feb 18 07:35:00 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 18 Feb 2018 14:35:00 +0100 Subject: Unicode Digest, Vol 50, Issue 13 In-Reply-To: (via Unicode's message of "Sun, 18 Feb 2018 07:06:14 -0600") References: Message-ID: <86k1vaxvy3.fsf@mimuw.edu.pl> On Sun, Feb 18 2018 at 14:06 CET, unicode at unicode.org writes: [...] > From: Adam Borowski via Unicode > Subject: metric for block coverage > To: unicode at unicode.org > Date: Sat, 17 Feb 2018 23:18:25 +0100 > Reply-To: Adam Borowski > Date: Sat, 17 Feb 2018 23:18:25 +0100 (15 hours, 2 minutes, 28 seconds ago) > > Hi! > As a part of Debian fonts team work, we're trying to improve fonts review: > ways to organize them, add metadata, pick which fonts are installed by > default and/or recommended to users, etc. > > I'm looking for a way to determine a font's coverage of available scripts. > It's probably reasonable to do this per Unicode block. Also, it's a safe > assumption that a font which doesn't know a codepoint can do no complex > shaping of such a glyph, thus looking at just codepoints should be adequate > for our purposes. As a Debian user using some rare characters for old Polish transliteration I would be happy with a tool which scans available/installed fonts for a specific list of characters and shows only those fonts which support the whole list. Of course showing also the characters in question would be very desirable. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sun Feb 18 07:43:09 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 18 Feb 2018 05:43:09 -0800 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> Message-ID: > +1 if the font has all the glyphs in the range should be > +1 if the font has all the glyphs needed for the range From unicode at unicode.org Sun Feb 18 10:33:00 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 18 Feb 2018 18:33:00 +0200 Subject: metric for block coverage In-Reply-To: <86k1vaxvy3.fsf@mimuw.edu.pl> (unicode@unicode.org) References: <86k1vaxvy3.fsf@mimuw.edu.pl> Message-ID: <83606ub6mb.fsf@gnu.org> > Cc: unicode-request at unicode.org > Date: Sun, 18 Feb 2018 14:35:00 +0100 > From: "Janusz S. Bie? via Unicode" > > As a Debian user using some rare characters for old Polish > transliteration I would be happy with a tool which scans > available/installed fonts for a specific list of characters and shows > only those fonts which support the whole list. Of course showing also > the characters in question would be very desirable. I'm sure you know about BabelMap. It has such a feature. From unicode at unicode.org Sun Feb 18 10:45:36 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 18 Feb 2018 17:45:36 +0100 Subject: metric for block coverage In-Reply-To: <83606ub6mb.fsf@gnu.org> (Eli Zaretskii's message of "Sun, 18 Feb 2018 18:33:00 +0200") References: <86k1vaxvy3.fsf@mimuw.edu.pl> <83606ub6mb.fsf@gnu.org> Message-ID: <86fu5yxn4f.fsf@mimuw.edu.pl> On Sun, Feb 18 2018 at 17:33 CET, eliz at gnu.org writes: >> Cc: unicode-request at unicode.org >> Date: Sun, 18 Feb 2018 14:35:00 +0100 >> From: "Janusz S. Bie? via Unicode" >> >> As a Debian user using some rare characters for old Polish >> transliteration I would be happy with a tool which scans >> available/installed fonts for a specific list of characters and shows >> only those fonts which support the whole list. Of course showing also >> the characters in question would be very desirable. > > I'm sure you know about BabelMap. It has such a feature. Yes, I know about BabelMap, but was not aware of the feature. Thank you. I'm interested in a tool for Linux. I suppose BabelMap can be run on Linux with Wine, but will this feature work in such a situation? I can of course make a try, but have practically no experience with Wine. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sun Feb 18 11:03:43 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sun, 18 Feb 2018 18:03:43 +0100 Subject: Unicode Digest, Vol 50, Issue 13 In-Reply-To: <86k1vaxvy3.fsf@mimuw.edu.pl> References: <86k1vaxvy3.fsf@mimuw.edu.pl> Message-ID: <20180218170343.ev5z5lqcf3sceutk@angband.pl> On Sun, Feb 18, 2018 at 02:35:00PM +0100, Janusz S. Bie? via Unicode wrote: > On Sun, Feb 18 2018 at 14:06 CET, unicode at unicode.org writes: > > Subject: metric for block coverage > > > > Hi! > > As a part of Debian fonts team work, we're trying to improve fonts review: > > ways to organize them, add metadata, pick which fonts are installed by > > default and/or recommended to users, etc. > > > > I'm looking for a way to determine a font's coverage of available scripts. > > It's probably reasonable to do this per Unicode block. Also, it's a safe > > assumption that a font which doesn't know a codepoint can do no complex > > shaping of such a glyph, thus looking at just codepoints should be adequate > > for our purposes. > > As a Debian user using some rare characters for old Polish > transliteration I would be happy with a tool which scans > available/installed fonts for a specific list of characters and shows > only those fonts which support the whole list. Of course showing also > the characters in question would be very desirable. Thanks, your suggestion is a good addition to the wishlist of features we'd want to have. Especially for the "available" case -- it'd be tedious to install all candidates just to check them. As for "installed": fc-list ':charset=16e5' file family ????! -- ??????? ??????? Imagine there are bandits in your house, your kid is bleeding out, ??????? the house is on fire, and seven big-ass trumpets are playing in the ??????? sky. Your cat demands food. The priority should be obvious... From unicode at unicode.org Sun Feb 18 11:19:50 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 18 Feb 2018 18:19:50 +0100 Subject: Unicode Digest, Vol 50, Issue 13 In-Reply-To: <20180218170343.ev5z5lqcf3sceutk@angband.pl> (Adam Borowski's message of "Sun, 18 Feb 2018 18:03:43 +0100") References: <86k1vaxvy3.fsf@mimuw.edu.pl> <20180218170343.ev5z5lqcf3sceutk@angband.pl> Message-ID: <86bmgmxljd.fsf@mimuw.edu.pl> On Sun, Feb 18 2018 at 18:03 CET, kilobyte at angband.pl writes: > On Sun, Feb 18, 2018 at 02:35:00PM +0100, Janusz S. Bie? via Unicode wrote: >> On Sun, Feb 18 2018 at 14:06 CET, unicode at unicode.org writes: >> > Subject: metric for block coverage >> > >> > Hi! >> > As a part of Debian fonts team work, we're trying to improve fonts review: >> > ways to organize them, add metadata, pick which fonts are installed by >> > default and/or recommended to users, etc. >> > >> > I'm looking for a way to determine a font's coverage of available scripts. >> > It's probably reasonable to do this per Unicode block. Also, it's a safe >> > assumption that a font which doesn't know a codepoint can do no complex >> > shaping of such a glyph, thus looking at just codepoints should be adequate >> > for our purposes. >> >> As a Debian user using some rare characters for old Polish >> transliteration I would be happy with a tool which scans >> available/installed fonts for a specific list of characters and shows >> only those fonts which support the whole list. Of course showing also >> the characters in question would be very desirable. > > Thanks, your suggestion is a good addition to the wishlist of features we'd > want to have. Especially for the "available" case -- it'd be tedious to > install all candidates just to check them. > > As for "installed": > fc-list ':charset=16e5' file family Thanks! Some time ago I was looking at various Debian font utilities and found nothing suitable, but looks like I should use Google more intensively: https://unix.stackexchange.com/questions/162305/find-the-best-font-for-rendering-a-codepoint Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sun Feb 18 13:10:36 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 18 Feb 2018 19:10:36 +0000 Subject: metric for block coverage In-Reply-To: <20180218120529.funepdzaa2bh3hjt@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> Message-ID: <20180218191036.44ffa6e0@JRWUBU2> On Sun, 18 Feb 2018 13:05:29 +0100 Adam Borowski via Unicode wrote: > On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass wrote: > > You probably already know that basic script coverage information is > > stored internally in OpenType fonts in the OS/2 table. > > > > https://docs.microsoft.com/en-us/typography/opentype/spec/os2 > > It's only a single bit without a meaning beyond "range is considered > functional". No "basic coverage" vs "good coverage" vs "full > coverage". It's worse than that when a script uses characters primarily associated with another script. For example, to have any confidence that my Tai Tham font will be used for U+0E4A THAI CHARACTER MAI TRI or U+0E4B THAI CHARACTER MAI CHATTAWA placed on U+1A4B TAI THAM LETTER A, I have to set the Thai bit, even though I only have four Thai characters in my font. (The other two are punctuation.) Richard. From unicode at unicode.org Sun Feb 18 13:38:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 18 Feb 2018 19:38:42 +0000 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: <20180218193842.1935d0ce@JRWUBU2> On Sun, 18 Feb 2018 14:13:22 +0100 Philippe Verdy via Unicode wrote: > But any operation in OpenType that requires reordering requires a > glyphs buffer. This could even apply to Latin if Microsoft really > intends to support normalization (i.e. canonical equivalences) in its > own USE engine (for now it does not) because it would also require a > glyphs buffer to allow correct reordering of glyphs (according to > their properties, notably for "beforebase", or for special placement > of some diacritics such as the cedilla that moves from "belowbase" to > "abovebase" when the base is the letter "g"). The examples accompanying the OpenType specification assume a font may insert spacing glyphs for punctuation in French, so there's no need to consider anything complicated. Microsoft renderers aren't immune to problems. I've had whole lines vanish because of undocumented shortcomings in the implementation of multiple ligations in a contextual substitution. (I presume the vanishing was to save me from something worse, such as memory corruption.) I couldn't see anything wrong with the maxp parameters. OpenType semantics have not been thoroughly reverse engineered. Richard. From unicode at unicode.org Sun Feb 18 13:47:34 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 20:47:34 +0100 Subject: Unicode of Death 2.0 In-Reply-To: <20180218193842.1935d0ce@JRWUBU2> References: <20180218193842.1935d0ce@JRWUBU2> Message-ID: 2018-02-18 20:38 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Sun, 18 Feb 2018 14:13:22 +0100 > Philippe Verdy via Unicode wrote: > > > But any operation in OpenType that requires reordering requires a > > glyphs buffer. This could even apply to Latin if Microsoft really > > intends to support normalization (i.e. canonical equivalences) in its > > own USE engine (for now it does not) because it would also require a > > glyphs buffer to allow correct reordering of glyphs (according to > > their properties, notably for "beforebase", or for special placement > > of some diacritics such as the cedilla that moves from "belowbase" to > > "abovebase" when the base is the letter "g"). > > The examples accompanying the OpenType specification assume a font may > insert spacing glyphs for punctuation in French, so there's no need to > consider anything complicated. > I've not told at all about the possible additional spacing of punctuation in French, it is simple to handle and does not involve the shaper, it's just a matter of alternate glyph selection per language defined in the font which can have different mappings, with different metrics or different GPOS, even if they share the same vector definition, with a simple affine transform. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 14:06:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 18 Feb 2018 20:06:42 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <1079226995.76169.1518871144119@ox.hosteurope.de> Message-ID: <20180218200642.6de1fe52@JRWUBU2> On Sat, 17 Feb 2018 22:31:12 -0800 James Kass via Unicode wrote: > It's true that added features can make for a better presentation. > Removing the special features shouldn't alter the message. I think I've encountered the use of italics in novels for sotto voce or asides. > The Unicode Standard draws the line between minimal legibility and > special features. Emoji are in The Standard and have, therefore, been > determined to be required for minimal legibility. That is a fuzzy boundary, as is evidenced by the optional effects of ZWJ and ZWNJ in most scripts and variation sequences (all scripts). Unicode also avoids text that is 'wrong' but still comprehensible. Richard. From unicode at unicode.org Sun Feb 18 14:22:14 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 18 Feb 2018 21:22:14 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: <20180218193842.1935d0ce@JRWUBU2> Message-ID: To be clear, the Opentype application II profile at least (initially defined for Arabic) may also be needed in Latin for correctly rendering cursive Latin styles. For now this Application profile II ( https://docs.microsoft.com/fr-fr/typography/script-development/use#featureapplicationii) has not been extended to cover cursive Latin for contextual shaping, but it is not non-sense (IMHO) to think about such extension (using the same "isol", "init", "medi", "fina" features required by Arabic, but optional in Latin ?) As well Fraktur Latin or medieval Latin styles are also challenging, and not correctly covered by the Basic ("standard") profile ( https://docs.microsoft.com/fr-fr/typography/script-development/standard). Look at the issues listed in the "Other encoding issues" sections of the specs. As OpenType is a project comanaged by Microsoft and Adobe, with additional consultations made with Apple, Unicode, Linux developers, I think it should be brought to a more separate subcomity, and its documentation moved to its own website/repository, outside Microsoft website itself, even if Microsoft can still control its publication and modifications (under agreeements with other OpenType participants in its adhoc subcomity). Given that these companies are also full Unicode members (and Linux developers are also represented by companies creating and supporting Linux distributions, including or Google, Oracle), this OpenType initiative should become officially a subcomity in the Unicode consortium (just like when IBM transfered its CLDR project to Unicode as a subcomity) But this does not mean that the Unicode needs to host and manage the documentation itself and some reference implementation (GitHub looks great for that), or links to existing implementations of OpenType core algorithms on popular development platforms (shaping, vectorization, hinting, rasterization, variable font shapes, colororimetry for colored emojis, device capability profiles, programmatic transforms of shapes for generated styles or animated shapes, or for 3D/OpenGL/DirectX with the addition of rotations and non-linear projections like perspective...), and of some conformance test tools. In my opinon the "shaper" part of OpenType rendering is the most important part where the Unicode Consortium and TUS must be synchronized (and stabilized: we see that lack of stability is a severe security problem, this Apple Bug is a big precedent showing that this specification must be studied more seriously by an open comity). 2018-02-18 20:47 GMT+01:00 Philippe Verdy : > > > 2018-02-18 20:38 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > >> On Sun, 18 Feb 2018 14:13:22 +0100 >> Philippe Verdy via Unicode wrote: >> >> > But any operation in OpenType that requires reordering requires a >> > glyphs buffer. This could even apply to Latin if Microsoft really >> > intends to support normalization (i.e. canonical equivalences) in its >> > own USE engine (for now it does not) because it would also require a >> > glyphs buffer to allow correct reordering of glyphs (according to >> > their properties, notably for "beforebase", or for special placement >> > of some diacritics such as the cedilla that moves from "belowbase" to >> > "abovebase" when the base is the letter "g"). >> >> The examples accompanying the OpenType specification assume a font may >> insert spacing glyphs for punctuation in French, so there's no need to >> consider anything complicated. >> > > I've not told at all about the possible additional spacing of punctuation > in French, it is simple to handle and does not involve the shaper, it's > just a matter of alternate glyph selection per language defined in the font > which can have different mappings, with different metrics or different > GPOS, even if they share the same vector definition, with a simple affine > transform. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 15:39:42 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 18 Feb 2018 21:39:42 +0000 Subject: metric for block coverage In-Reply-To: <20180218114031.dlg4vw3g2lm4iodn@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218114031.dlg4vw3g2lm4iodn@angband.pl> Message-ID: On Sun, Feb 18, 2018 at 3:42 AM Adam Borowski wrote: > I probably used a bad example: scripts like Cyrillic (not even Supplement) > include both essential letters and those which are historic only or used by > old folks in a language spoken by 1000, who use Russian (or English...) for > all computer use anyway -- all within one block. > > What I'm thinking, is that a beautiful font that covers Russian, Ukrainian, > Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users before > one whose only grace is including every single codepoint. > I'm not sure what your goal is. Opening up gucharmap shows me that FreeSerif and Noto Serif both have complete coverage of Cyrillic and Cyrillic Supplemental. We have reasonable fonts to offer users that cover everything Cyrillic, or pretty much any script in use. I'm not sure where and how you're trying to cut a line between a beautiful multilingual font and a workable full font. Ultimately, when I look at fonts, I look for Esperanto support. I'd be a little surprised if it didn't come with Polish support, but it's unlikely to be my problem. A useful feature for a font selector for me would be able to select English, German, and Esperanto and get just the fonts that support those languages (in an extended sense, including the extra-ASCII punctuation and accents English needs, for example.) It does me absolutely no good to know that it has "good, but not complete" Latin-A support. Likewise, if you're a Persian speaker, knowing that the Arabic block has "good, but not complete" support is worthless. For single language ancient scripts, like Ancient Greek, then virtually any font with decent coverage should cover the generally useful stuff. For more complex ancient scripts, it pretty much has to be on a per language matter. For some ancient scripts, like Runic and Old Italic, I understand that after unifying the various writings, most people feel a language-specific font is necessary for any serious work. The ultimate problem is that the question is will it support my needs. Language can often be used as a proxy, but names can often foil that. And symbols are worse; ? is the only character from Currency Symbols that's used in an extended work in many, many instances, but so is ?. Percentage of block support is minimally helpful. Miscellaneous symbols lives up to its name; ?, ?, ?, ?, and ? are all useful symbols, but not likely to be found in the same work. Again, recommend 100% coverage or do the manual work of separating them into groups and offering a specific font (game, occult, etc.) that covers it, but messing around with a beautiful font with less than 100% coverage versus a decent font with 100% coverage seems counterproductive. Not sure if I understand your advice right: you're recommending to ignore > all the complexity and going with just raw count of in-block coverage? > This could work: a released font probably has codepoints its author > considers important. > I guess separating out by language when you need to is going to be the way that helps people the most. Where that's most complex, I'm not sure why you're not just offering a decent 100% coverage font (which Debian has a decent selection of) and stepping back. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 16:03:26 2018 From: unicode at unicode.org (Leonardo Boiko via Unicode) Date: Sun, 18 Feb 2018 23:03:26 +0100 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218114031.dlg4vw3g2lm4iodn@angband.pl> Message-ID: The most useful feature for me (Debian user, linguist) would be a search system where I can provide a string, and filter fonts to those who include glyphs for all characters; ideally if I could also combine it with other search criteria, like OTF features (true small caps, etc.). I often write academic texts where I use specialized characters not really classifiable by language, script or block (say, '?/?' for p?ny?n, plus IPA tone marks, plus multiple combining diacritics like 'a??', all in the same running text). I then need visual inspection to choose a font that actually looks halfway decent, typographically speaking, and to check for bugs in IPA kerning, etc. For a long time now, I've been using a simple Python script to filter fonts in this manner (it just straightforwardly renders the provided characters, then uses `pango.Layout.get_unknown_glyphs_count()` to remove fonts lacking them, and displays all the rest for inspection). 2018-02-18 22:39 GMT+01:00 David Starner via Unicode : > On Sun, Feb 18, 2018 at 3:42 AM Adam Borowski wrote: > >> I probably used a bad example: scripts like Cyrillic (not even Supplement) >> include both essential letters and those which are historic only or used >> by >> old folks in a language spoken by 1000, who use Russian (or English...) >> for >> all computer use anyway -- all within one block. >> >> What I'm thinking, is that a beautiful font that covers Russian, >> Ukrainian, >> Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users >> before >> one whose only grace is including every single codepoint. >> > > I'm not sure what your goal is. Opening up gucharmap shows me that > FreeSerif and Noto Serif both have complete coverage of Cyrillic and > Cyrillic Supplemental. We have reasonable fonts to offer users that cover > everything Cyrillic, or pretty much any script in use. I'm not sure where > and how you're trying to cut a line between a beautiful multilingual font > and a workable full font. > > Ultimately, when I look at fonts, I look for Esperanto support. I'd be a > little surprised if it didn't come with Polish support, but it's unlikely > to be my problem. A useful feature for a font selector for me would be able > to select English, German, and Esperanto and get just the fonts that > support those languages (in an extended sense, including the extra-ASCII > punctuation and accents English needs, for example.) It does me absolutely > no good to know that it has "good, but not complete" Latin-A support. > Likewise, if you're a Persian speaker, knowing that the Arabic block has > "good, but not complete" support is worthless. > > For single language ancient scripts, like Ancient Greek, then virtually > any font with decent coverage should cover the generally useful stuff. For > more complex ancient scripts, it pretty much has to be on a per language > matter. For some ancient scripts, like Runic and Old Italic, I understand > that after unifying the various writings, most people feel a > language-specific font is necessary for any serious work. > > The ultimate problem is that the question is will it support my needs. > Language can often be used as a proxy, but names can often foil that. And > symbols are worse; ? is the only character from Currency Symbols that's > used in an extended work in many, many instances, but so is ?. Percentage > of block support is minimally helpful. Miscellaneous symbols lives up to > its name; ?, ?, ?, ?, and ? are all useful symbols, but not likely to be > found in the same work. Again, recommend 100% coverage or do the manual > work of separating them into groups and offering a specific font (game, > occult, etc.) that covers it, but messing around with a beautiful font with > less than 100% coverage versus a decent font with 100% coverage seems > counterproductive. > > Not sure if I understand your advice right: you're recommending to ignore >> all the complexity and going with just raw count of in-block coverage? >> This could work: a released font probably has codepoints its author >> considers important. >> > > I guess separating out by language when you need to is going to be the way > that helps people the most. Where that's most complex, I'm not sure why > you're not just offering a decent 100% coverage font (which Debian has a > decent selection of) and stepping back. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 17:06:06 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 19 Feb 2018 00:06:06 +0100 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218114031.dlg4vw3g2lm4iodn@angband.pl> Message-ID: For Latin, usually looking for the coverage of Vietnamese works quite well... except for African languages that need additional uncommon Latin letters (open o, open e, alpha, some turned/mirrored/stroked letters), in which case you should look also for IPA coverage (but you may missing their associated capital letters sometimes needed too in these African languages). Unfortunately the IPA subset of "Latin" symbols includes many letters (only lowercase) that have no associated capitals, and trying to match the full coverage of IPA is not needed for African languages (and the variant of "g" added in Latin for IPA only is really unfortunate where it should have been encoded as a variant of standard "g"). And the capital SHARP S (ESSTSETT in German) may still be frequently missing in the font (I don't know if renderers will try to lookup the glyph from another font, or if they'll fallback to the lowercase sharp s mapped in the font !). 2018-02-18 22:39 GMT+01:00 David Starner via Unicode : > On Sun, Feb 18, 2018 at 3:42 AM Adam Borowski wrote: > >> I probably used a bad example: scripts like Cyrillic (not even Supplement) >> include both essential letters and those which are historic only or used >> by >> old folks in a language spoken by 1000, who use Russian (or English...) >> for >> all computer use anyway -- all within one block. >> >> What I'm thinking, is that a beautiful font that covers Russian, >> Ukrainian, >> Serbian, Kazakh, Mongolian cyr, etc., should be recommended to users >> before >> one whose only grace is including every single codepoint. >> > > I'm not sure what your goal is. Opening up gucharmap shows me that > FreeSerif and Noto Serif both have complete coverage of Cyrillic and > Cyrillic Supplemental. We have reasonable fonts to offer users that cover > everything Cyrillic, or pretty much any script in use. I'm not sure where > and how you're trying to cut a line between a beautiful multilingual font > and a workable full font. > > Ultimately, when I look at fonts, I look for Esperanto support. I'd be a > little surprised if it didn't come with Polish support, but it's unlikely > to be my problem. A useful feature for a font selector for me would be able > to select English, German, and Esperanto and get just the fonts that > support those languages (in an extended sense, including the extra-ASCII > punctuation and accents English needs, for example.) It does me absolutely > no good to know that it has "good, but not complete" Latin-A support. > Likewise, if you're a Persian speaker, knowing that the Arabic block has > "good, but not complete" support is worthless. > > For single language ancient scripts, like Ancient Greek, then virtually > any font with decent coverage should cover the generally useful stuff. For > more complex ancient scripts, it pretty much has to be on a per language > matter. For some ancient scripts, like Runic and Old Italic, I understand > that after unifying the various writings, most people feel a > language-specific font is necessary for any serious work. > > The ultimate problem is that the question is will it support my needs. > Language can often be used as a proxy, but names can often foil that. And > symbols are worse; ? is the only character from Currency Symbols that's > used in an extended work in many, many instances, but so is ?. Percentage > of block support is minimally helpful. Miscellaneous symbols lives up to > its name; ?, ?, ?, ?, and ? are all useful symbols, but not likely to be > found in the same work. Again, recommend 100% coverage or do the manual > work of separating them into groups and offering a specific font (game, > occult, etc.) that covers it, but messing around with a beautiful font with > less than 100% coverage versus a decent font with 100% coverage seems > counterproductive. > > Not sure if I understand your advice right: you're recommending to ignore >> all the complexity and going with just raw count of in-block coverage? >> This could work: a released font probably has codepoints its author >> considers important. >> > > I guess separating out by language when you need to is going to be the way > that helps people the most. Where that's most complex, I'm not sure why > you're not just offering a decent 100% coverage font (which Debian has a > decent selection of) and stepping back. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 18 23:26:24 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 19 Feb 2018 06:26:24 +0100 (CET) Subject: Why so much emoji nonsense? In-Reply-To: <20180218200642.6de1fe52@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <1079226995.76169.1518871144119@ox.hosteurope.de> <20180218200642.6de1fe52@JRWUBU2> Message-ID: <2015894766.110.1519017984337.JavaMail.www@wwinf1g21> On Sun, 18 Feb 2018 20:06:42 +0000, Richard Wordingham via Unicode wrote: [?] > Unicode also avoids text that is 'wrong' but still comprehensible. > Unicode should then legalize the use of preformatted superscripts in Latin script. This convention appears to root back in medieval Latin, for which Unicode have added all required superscripts. Interoperable digital representation of modern languages may differ in policies, but it does not in principles. As of practice, a sample layout provides access to all existing small superscript Latin base letters on live key positions: http://charupdate.info/doc/kbenintu/#N The '??' sequence is on key E12, level 1B (with AltGr/Option): http://charupdate.info/doc/kbenintu/#B And a ?Superscript? dead key is on key C02 [S]. Regards, M?rcel From unicode at unicode.org Mon Feb 19 07:06:24 2018 From: unicode at unicode.org (=?utf-8?Q? J.=C2=A0S._Choi ?= via Unicode) Date: Mon, 19 Feb 2018 07:06:24 -0600 Subject: metric for block coverage Message-ID: <4B23B567-3AC6-401A-AF52-E3FCF17AE498@icloud.com> Better heuristics of the coverage by a font of a human script sound useful, but don?t the standards discourage using codepoint blocks for determining whether a character belongs to the repertoire of a human language or script? Although the specification authors try to arrange characters into codepoint blocks as logically as they can, codepoint blocks are, most of all, artifacts of Unicode?s own patchwork history. Additional information, such as script information or ICU data, is often required to correctly determine a script?s repertoire of essential characters. See also https://www.unicode.org/reports/tr18/#Character_Blocks and https://www.unicode.org/faq/blocks_ranges.html#16. Another thing to point out is that correct rendering of script-essential combined characters is another important part of font quality. This would be difficult to evaluate only with a heuristic based on codepoint blocks. J. S. Choi Saint Louis University School of Medicine -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 19 08:58:29 2018 From: unicode at unicode.org (Bobby de Vos via Unicode) Date: Mon, 19 Feb 2018 07:58:29 -0700 Subject: metric for block coverage In-Reply-To: <20180218191036.44ffa6e0@JRWUBU2> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> Message-ID: <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> On 2018-02-18 12:10, Richard Wordingham via Unicode wrote: >> It's only a single bit without a meaning beyond "range is considered >> functional". No "basic coverage" vs "good coverage" vs "full >> coverage". > It's worse than that when a script uses characters primarily > associated with another script. For example, to have any confidence > that my Tai Tham font will be used for U+0E4A THAI CHARACTER MAI > TRI or U+0E4B THAI CHARACTER MAI CHATTAWA placed on U+1A4B TAI THAM > LETTER A, I have to set the Thai bit, even though I only have four Thai > characters in my font. (The other two are punctuation.) > Indic scripts (other than Devanagari) also use a few characters from another block. Specifically, two punctuation characters (from the Devanagari block) * U+0964 DEVANAGARI DANDA * U+0965 DEVANAGARI DOUBLE DANDA are expected to be used with the non-Devanagari Indic scripts. Looking at the fonts Noto Sans Kannada and Noto Sans Tamil, the expected Unicode range bit is set for Kannada or Tamil, but not Devanagari, even though those fonts contain U+0964 and U+0965. Bobby -- Bobby de Vos /bobby_devos at sil.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 19 13:02:28 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 19 Feb 2018 20:02:28 +0100 Subject: metric for block coverage In-Reply-To: <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> Message-ID: This pair of punctuation should have been considered since long as common punctuations (independantly of their assigned names), i.e. assigned the script property "Comn" and not "Deva". I don't see why they could not be used in non-indic scripts (because they are not semantically equivalent to Latin punctuations in their use). I can easily imagine valid uses cases even in Latin, Greek or Cyrillic to properly translate poems, religious texts, or citations without transforming them in inaccurate full stops, colons, semi-colons, commas, or even exclamation marks (such transform is an interpretation by the translator), where they would typically be used along with surrounding spaces and not glued to Latin/Greek/Cyrillic words. Such use in Latin would be part of "extended Latin", but if these punctuations are "Common", this is not so much extended, and many fonts could have these two simple punctuations (which do not need any "complex" feature in OpenType). Their presence in fonts designed for Indic scripts should be mandatory or strongly recommanded (just like the mapping of SPACE, NBSP, dotted circle or blank square, and a few others listed in OpenType development documentation), meaning that given their "Common" script property we don't need to test their presence to compute a script coverage (any other font available could also be used by renderers to insert their own glyph if some Indic fonts are ever defective for forgetting to map glyphs to them, just like a renderer is allowed to substitute or infer a synthetized glyph for the dotted circle or blank square, or any whitespace variant, if ever they are not mapped, using only the basic font metrics to scale the glyph or infer a suitable advance width/height; the renderer just needs to look at the generic font metrics providing average width and heights and relative position of the baselines in the em-square). 2018-02-19 15:58 GMT+01:00 Bobby de Vos via Unicode : > On 2018-02-18 12:10, Richard Wordingham via Unicode wrote: > > It's only a single bit without a meaning beyond "range is considered > functional". No "basic coverage" vs "good coverage" vs "full > coverage". > > It's worse than that when a script uses characters primarily > associated with another script. For example, to have any confidence > that my Tai Tham font will be used for U+0E4A THAI CHARACTER MAI > TRI or U+0E4B THAI CHARACTER MAI CHATTAWA placed on U+1A4B TAI THAM > LETTER A, I have to set the Thai bit, even though I only have four Thai > characters in my font. (The other two are punctuation.) > > > > Indic scripts (other than Devanagari) also use a few characters from > another block. Specifically, two punctuation characters (from the > Devanagari block) > > - U+0964 DEVANAGARI DANDA > - U+0965 DEVANAGARI DOUBLE DANDA > > are expected to be used with the non-Devanagari Indic scripts. Looking at > the fonts Noto Sans Kannada and Noto Sans Tamil, the expected Unicode range > bit is set for Kannada or Tamil, but not Devanagari, even though those > fonts contain U+0964 and U+0965. > > Bobby > > -- > Bobby de Vos > *bobby_devos at sil.org * > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 19 14:41:01 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 19 Feb 2018 20:41:01 +0000 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> Message-ID: <20180219204101.7cabb833@JRWUBU2> On Mon, 19 Feb 2018 20:02:28 +0100 Philippe Verdy via Unicode wrote: > This pair of punctuation should have been considered since long as > common punctuations (independantly of their assigned names), i.e. > assigned the script property "Comn" and not "Deva". I don't see why > they could not be used in non-indic scripts (because they are not > semantically equivalent to Latin punctuations in their use). They currently both have sc=Common, so common sense prevails here. > I can easily imagine valid uses cases even in Latin, Greek or > Cyrillic to properly translate poems, religious texts, or citations... They have had scx ? Latn, but no longer. It may be because CLDR lacks sa_Latn; perhaps someone will claim that that the dandas and double dandas I've seen in Sanskrit verses in Latin script are actually something else. > Their presence in fonts designed for Indic scripts should be > mandatory or strongly recommanded... They're generally not necessary for scripts in whose encoding Michael Everson has had a significant hand. He defines script-specific dandas. Tai Tham has two such pairs! >... (just like the mapping of SPACE, > NBSP, dotted circle or blank square, and a few others listed in > OpenType development documentation), meaning that given their > "Common" script property we don't need to test their presence to > compute a script coverage (any other font available could also be > used by renderers to insert their own glyph if some Indic fonts are > ever defective for forgetting to map glyphs to them, just like a > renderer is allowed to substitute or infer a synthetized glyph for > the dotted circle or blank square, or any whitespace variant, if ever > they are not mapped, using only the basic font metrics to scale the > glyph or infer a suitable advance width/height; the renderer just > needs to look at the generic font metrics providing average width and > heights and relative position of the baselines in the em-square). Microsoft Word and the USE document the use or recommendation for quite a few such shapes and special letters. They make ulUnicodeRange rather unreliable. Note, however, that ulUnicodeRange works by Unicode range, not script. Richard. From unicode at unicode.org Tue Feb 20 09:13:16 2018 From: unicode at unicode.org (Dreiheller, Albrecht via Unicode) Date: Tue, 20 Feb 2018 15:13:16 +0000 Subject: AW: metric for block coverage In-Reply-To: <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> Message-ID: <3E10480FE4510343914E4312AB46E74212D3A2F9@DEFTHW99EH5MSX.ww902.siemens.net> Could someone please supply an example (web link ...) for usage of danda / double danda in Tamil? Thanks, Albrecht Von: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Bobby de Vos via Unicode Gesendet: Montag, 19. Februar 2018 15:58 An: unicode at unicode.org Betreff: Re: metric for block coverage On 2018-02-18 12:10, Richard Wordingham via Unicode wrote: It's only a single bit without a meaning beyond "range is considered functional". No "basic coverage" vs "good coverage" vs "full coverage". It's worse than that when a script uses characters primarily associated with another script. For example, to have any confidence that my Tai Tham font will be used for U+0E4A THAI CHARACTER MAI TRI or U+0E4B THAI CHARACTER MAI CHATTAWA placed on U+1A4B TAI THAM LETTER A, I have to set the Thai bit, even though I only have four Thai characters in my font. (The other two are punctuation.) Indic scripts (other than Devanagari) also use a few characters from another block. Specifically, two punctuation characters (from the Devanagari block) * U+0964 DEVANAGARI DANDA * U+0965 DEVANAGARI DOUBLE DANDA are expected to be used with the non-Devanagari Indic scripts. Looking at the fonts Noto Sans Kannada and Noto Sans Tamil, the expected Unicode range bit is set for Kannada or Tamil, but not Devanagari, even though those fonts contain U+0964 and U+0965. Bobby -- Bobby de Vos bobby_devos at sil.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 13:40:40 2018 From: unicode at unicode.org (=?ISO-8859-1?Q?Christoph_P=E4per?= via Unicode) Date: Tue, 20 Feb 2018 20:40:40 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Apparently the presidential decree prescribing the new Kazakh Latin orthography and alphabet has been amended recently. The change completely dumps the previous approach of digraphs with an apostrophe in second position in favor of an acute diacritic mark above the base letters for vowels ?/?, ?/?, ?/?, ?/?, ?/? and two consonants ?/? and ?/?, while the other two become commonly encountered H digraphs, Ch/ch and Sh/sh. Rejoice. https://tengrinews.kz/kazakhstan_news/novyiy-variant-kazahskogo-alfavita-latinitse-utverdil-338010 http://www.akorda.kz/kz/legal_acts/decrees/kazak-tili-alipbiin-kirillicadan-latyn-grafikasyna-koshiru-turaly-kazakstan-respublikasy-prezidentinin-2017-zhylgy-26-kazandagy-569-zharlygy http://www.akorda.kz/upload/media/files/785986f23c47a407facbfa52b935fc85.doc -- Diese Nachricht wurde von meinem Android-Ger?t mit K-9 Mail gesendet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 13:56:29 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 20 Feb 2018 20:56:29 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Message-ID: May be we've been heard... Now this makes better sense. But I wonder why they did not choose the caron over C and S, like in other eastern European languages; they are very well supported since long and cause no problem... 2018-02-20 20:40 GMT+01:00 Christoph P?per via Unicode : > Apparently the presidential decree prescribing the new Kazakh Latin > orthography and alphabet has been amended recently. The change completely > dumps the previous approach of digraphs with an apostrophe in second > position in favor of an acute diacritic mark above the base letters for > vowels ?/?, ?/?, ?/?, ?/?, ?/? and two consonants ?/? and ?/?, while the > other two become commonly encountered H digraphs, Ch/ch and Sh/sh. > > Rejoice. > > https://tengrinews.kz/kazakhstan_news/novyiy-variant-kazahskogo-alfavita- > latinitse-utverdil-338010 > > http://www.akorda.kz/kz/legal_acts/decrees/kazak-tili- > alipbiin-kirillicadan-latyn-grafikasyna-koshiru-turaly- > kazakstan-respublikasy-prezidentinin-2017-zhylgy-26-kazandagy-569-zharlygy > > http://www.akorda.kz/upload/media/files/785986f23c47a407facbfa52b935fc > 85.doc > -- > Diese Nachricht wurde von meinem Android-Ger?t mit K-9 Mail gesendet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 14:12:36 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 20 Feb 2018 20:12:36 +0000 Subject: metric for block coverage In-Reply-To: <3E10480FE4510343914E4312AB46E74212D3A2F9@DEFTHW99EH5MSX.ww902.siemens.net> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> <3E10480FE4510343914E4312AB46E74212D3A2F9@DEFTHW99EH5MSX.ww902.siemens.net> Message-ID: <20180220201236.56435946@JRWUBU2> On Tue, 20 Feb 2018 15:13:16 +0000 "Dreiheller, Albrecht via Unicode" wrote: > Could someone please supply an example (web link ...) for usage of > danda / double danda in Tamil? Thanks, Albrecht Take your pick from http://www.prapatti.com/slokas/slokasbyname.html . Do they meet your requirements, or do you perhaps want text in the Tamil language as opposed to PDFs of Sanskrit in Tamil script? I found the likes of my example by googling for 'Tamil Shloka' without quotes. Richard. From unicode at unicode.org Tue Feb 20 14:23:12 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 20 Feb 2018 12:23:12 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Message-ID: We'll probably never know which factors influenced the decision, but apparently some kind of message got through. From unicode at unicode.org Tue Feb 20 14:26:25 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Tue, 20 Feb 2018 20:26:25 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Message-ID: Why on earth would they use Ch and Sh when 1) C isn?t used by itself and 2) if you?re using ?? you may as well use ?? ??. Groan. > On 20 Feb 2018, at 19:40, Christoph P?per via Unicode wrote: > > Apparently the presidential decree prescribing the new Kazakh Latin orthography and alphabet has been amended recently. The change completely dumps the previous approach of digraphs with an apostrophe in second position in favor of an acute diacritic mark above the base letters for vowels ?/?, ?/?, ?/?, ?/?, ?/? and two consonants ?/? and ?/?, while the other two become commonly encountered H digraphs, Ch/ch and Sh/sh. > > Rejoice. > > https://tengrinews.kz/kazakhstan_news/novyiy-variant-kazahskogo-alfavita-latinitse-utverdil-338010 > > http://www.akorda.kz/kz/legal_acts/decrees/kazak-tili-alipbiin-kirillicadan-latyn-grafikasyna-koshiru-turaly-kazakstan-respublikasy-prezidentinin-2017-zhylgy-26-kazandagy-569-zharlygy > > http://www.akorda.kz/upload/media/files/785986f23c47a407facbfa52b935fc85.doc > -- > Diese Nachricht wurde von meinem Android-Ger?t mit K-9 Mail gesendet. From unicode at unicode.org Tue Feb 20 14:40:27 2018 From: unicode at unicode.org (=?ISO-8859-1?Q?Christoph_P=E4per?= via Unicode) Date: Tue, 20 Feb 2018 21:40:27 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Message-ID: Michael Everson: > Why on earth would they use Ch and Sh when 1) C isn?t used by itself > and 2) if you?re using ?? you may as well use ?? ??. I would have argued in favor of digraphs for G' and N' as well if there already was a decision for Ch and Sh. Many European orthographies use the digraph Qu although the letter Q does not occur otherwise. From unicode at unicode.org Tue Feb 20 15:04:31 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Tue, 20 Feb 2018 21:04:31 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> Message-ID: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> Not using Turkic letters is daft, particularly as there was a widely-used transliteration in Kazakhstan anyway. And even if not ? ?, they could have used ? and ?. There?s no value in using diagraphs in Kazakh particularly when there could be a one-to-one relation with the Cyrillic orthography, and I bet you anything there will be ambiguity where some morpheme ends in -s and the next begins with h- where you have [sx] and not [?]. Groan. > On 20 Feb 2018, at 20:40, Christoph P?per wrote: > > Michael Everson: >> Why on earth would they use Ch and Sh when 1) C isn?t used by itself and 2) if you?re using ?? you may as well use ?? ??. > > I would have argued in favor of digraphs for G' and N' as well if there already was a decision for Ch and Sh. > > Many European orthographies use the digraph Qu although the letter Q does not occur otherwise. From unicode at unicode.org Tue Feb 20 15:38:57 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 20 Feb 2018 22:38:57 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> Message-ID: As well, the Latin letter "c/C" is not used, just for the digraph "ch/Ch". But two distinct Cyrillic letters are mapped to Latin "h/H", when one could be mapped to Latin "x/X" with almost the same letter form to preserve the orthography. The three versions of the Cyrilic letter i is mapped to 1.5 (distinguished only on lowercase with the Turkic lowercase dotless i, but not distinguished on uppercase where there's no dot at all...). It should have used two distinct letters at least (I with or without acute). Yes it was possible to have a one-to-one mapping, and to allow full compatibiility with existing Kazakh Cyrillic keyboards maps (but not necessarily the additional US QWERTY layout in its existing Latin extension only made for typing English, and which could be dumped and replaced by the Kazakh Cyrillic-to-Latin one-to-one transliteration). No additional keystrokes was then necessary, no new hardware keyboards needed for using the orthography, if they just look at existing Cyrillic keycaps. New hardwares could have LAtin keycaps in the same positions (and the position of Cyrillic letters infered also by one-to-one transliteration). And then all documents in Kazakh could be extremely simply transliterated without loss. And new documents would also be readable instantly by the one-to-one transliterators for those that will be trained to only the Latin alphabet and will want to be able to read historic documents. But I fear that Kazakh governemnt does not care much of keeping things from the past and history is not their problem (they'll realize that this is not so simple because there are tons of historic documents that are still in the Cyrillic orthography and that must be legally kept unchanged (including international treaties, long term contracts, Kazakh justice decisions, legal personal records...): using a non one-to-one transliteration will cause legal problems even in the country itself, and various administrative problems with their citizens, or they'll need to duplicate the official databases to maintain the two orthographies and this will cause them computing costs, and storage costs, and problems in applications that will search one form and won't find the other, unless these applciatiosn are corrected (additional costs there too!). 2018-02-20 22:04 GMT+01:00 Michael Everson via Unicode : > Not using Turkic letters is daft, particularly as there was a widely-used > transliteration in Kazakhstan anyway. And even if not ? ?, they could have > used ? and ?. > > There?s no value in using diagraphs in Kazakh particularly when there > could be a one-to-one relation with the Cyrillic orthography, and I bet you > anything there will be ambiguity where some morpheme ends in -s and the > next begins with h- where you have [sx] and not [?]. > > Groan. > > > On 20 Feb 2018, at 20:40, Christoph P?per > wrote: > > > > Michael Everson: > >> Why on earth would they use Ch and Sh when 1) C isn?t used by itself > and 2) if you?re using ?? you may as well use ?? ??. > > > > I would have argued in favor of digraphs for G' and N' as well if there > already was a decision for Ch and Sh. > > > > Many European orthographies use the digraph Qu although the letter Q > does not occur otherwise. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 19:15:52 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Wed, 21 Feb 2018 01:15:52 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> Message-ID: AIUI "doesn't look like Turkish" was one of the design criteria, for political reasons. On Tue, Feb 20, 2018 at 1:07 PM Michael Everson via Unicode < unicode at unicode.org> wrote: > Not using Turkic letters is daft, particularly as there was a widely-used > transliteration in Kazakhstan anyway. And even if not ? ?, they could have > used ? and ?. > > There?s no value in using diagraphs in Kazakh particularly when there > could be a one-to-one relation with the Cyrillic orthography, and I bet you > anything there will be ambiguity where some morpheme ends in -s and the > next begins with h- where you have [sx] and not [?]. > > Groan. > > > On 20 Feb 2018, at 20:40, Christoph P?per > wrote: > > > > Michael Everson: > >> Why on earth would they use Ch and Sh when 1) C isn?t used by itself > and 2) if you?re using ?? you may as well use ?? ??. > > > > I would have argued in favor of digraphs for G' and N' as well if there > already was a decision for Ch and Sh. > > > > Many European orthographies use the digraph Qu although the letter Q > does not occur otherwise. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 19:19:55 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Wed, 21 Feb 2018 01:19:55 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> Message-ID: <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> Stalin would be very pleased. Divide and conquer. > On 21 Feb 2018, at 01:15, Garth Wallace via Unicode wrote: > > AIUI "doesn't look like Turkish" was one of the design criteria, for political reasons. > > On Tue, Feb 20, 2018 at 1:07 PM Michael Everson via Unicode wrote: > Not using Turkic letters is daft, particularly as there was a widely-used transliteration in Kazakhstan anyway. And even if not ? ?, they could have used ? and ?. > > There?s no value in using diagraphs in Kazakh particularly when there could be a one-to-one relation with the Cyrillic orthography, and I bet you anything there will be ambiguity where some morpheme ends in -s and the next begins with h- where you have [sx] and not [?]. > > Groan. > > > On 20 Feb 2018, at 20:40, Christoph P?per wrote: > > > > Michael Everson: > >> Why on earth would they use Ch and Sh when 1) C isn?t used by itself and 2) if you?re using ?? you may as well use ?? ??. > > > > I would have argued in favor of digraphs for G' and N' as well if there already was a decision for Ch and Sh. > > > > Many European orthographies use the digraph Qu although the letter Q does not occur otherwise. > > From unicode at unicode.org Tue Feb 20 20:19:48 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 03:19:48 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> Message-ID: I call that more isolationism: If I can uncerstand the political reasons for not looking like Turkish, why then do they use the dotless i in this last version (not distinguished however from the dotted i in capital) ? This is not just a transliteration, this is also a proposal to do at the same time an simplification of the orthography by reducing the alphabet. They'll have the troubles with people's names and new cases of homonymies, causing adminsitrative difficulties for these people... When other countries are now going to the inverse direction (accepting to extend their alphabet by adding more letters or distinguishing variants, and then treat ortrhographic simplifications not globally but on selected lists of terms studied by their local linguistic authorities, and otherwise allowing or demanding simplifications only on selected applications), here it is the reverse: create an alphabet that will not look like Turkish, not like Russian, not like other Eastern European languages, and also not like their own national language, clearing a singificant part of its history. All the difficulties coming at the same time and that will cost them a lot, because there's no place at all for transition and adaptation ! 2018-02-21 2:19 GMT+01:00 Michael Everson via Unicode : > Stalin would be very pleased. Divide and conquer. > > > On 21 Feb 2018, at 01:15, Garth Wallace via Unicode > wrote: > > > > AIUI "doesn't look like Turkish" was one of the design criteria, for > political reasons. > > > > On Tue, Feb 20, 2018 at 1:07 PM Michael Everson via Unicode < > unicode at unicode.org> wrote: > > Not using Turkic letters is daft, particularly as there was a > widely-used transliteration in Kazakhstan anyway. And even if not ? ?, they > could have used ? and ?. > > > > There?s no value in using diagraphs in Kazakh particularly when there > could be a one-to-one relation with the Cyrillic orthography, and I bet you > anything there will be ambiguity where some morpheme ends in -s and the > next begins with h- where you have [sx] and not [?]. > > > > Groan. > > > > > On 20 Feb 2018, at 20:40, Christoph P?per > wrote: > > > > > > Michael Everson: > > >> Why on earth would they use Ch and Sh when 1) C isn?t used by itself > and 2) if you?re using ?? you may as well use ?? ??. > > > > > > I would have argued in favor of digraphs for G' and N' as well if > there already was a decision for Ch and Sh. > > > > > > Many European orthographies use the digraph Qu although the letter Q > does not occur otherwise. > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 20:24:26 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 20 Feb 2018 18:24:26 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> Message-ID: A desire to choose their own writing system rather than have one imposed upon them is understandable. If they also want it to be distinctive, who could blame them? From unicode at unicode.org Tue Feb 20 21:15:48 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Wed, 21 Feb 2018 03:15:48 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> Message-ID: <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> I absolutely disagree. There?s a whole lot of related languages out there, and the speakers share some things in common. Orthographic harmonization between these languages can ONLY help any speaker of one to access information in any of the others. That expands people?s worlds. That would be a good goal. > On 21 Feb 2018, at 02:24, James Kass via Unicode wrote: > > A desire to choose their own writing system rather than have one > imposed upon them is understandable. If they also want it to be > distinctive, who could blame them? From unicode at unicode.org Tue Feb 20 21:38:25 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 04:38:25 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> Message-ID: That's true, this area is a mix of cultures and ethnies, some of them in troubles/conflicts, and creating additional linguistic problems, or trying to block communication between them will not help make the situation more peaceful. So yes the "divide to conquer" is a probable intent, but also the desire to strike a part of the history of the country. We'll then remember in a few decennials what this attempt created: just more complication and more costs for everyone. Experience has shown that people will continue to maintain their culture (see what happened after 70 years of USSR, religions and languages were not forgotten at all), independantly of what their government do and such reform will never succeeds compeltely before several centuries and only after a long period of peace, where people will want to reconciliate, and will then reinvent a common way of speaking to each other (with less control by the government itself) by volunteer adoption rather than imposed laws. Going to Latin, why not, but as long as there's a large compatibility with the past and no added ambiguities, and a smooth transition is possible so that people can take the time to understand it and adopt it. 2018-02-21 4:15 GMT+01:00 Michael Everson via Unicode : > I absolutely disagree. There?s a whole lot of related languages out there, > and the speakers share some things in common. Orthographic harmonization > between these languages can ONLY help any speaker of one to access > information in any of the others. That expands people?s worlds. That would > be a good goal. > > > On 21 Feb 2018, at 02:24, James Kass via Unicode > wrote: > > > > A desire to choose their own writing system rather than have one > > imposed upon them is understandable. If they also want it to be > > distinctive, who could blame them? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 21:49:59 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 20 Feb 2018 19:49:59 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> Message-ID: Michael Everson wrote: > Orthographic harmonization between these languages can ONLY help any > speaker of one to access information in any of the others. That expands > people?s worlds. That would be a good goal. Wouldn't dream of arguing with that. Expanding people's worlds is why many of us have supported Unicode. The good news is that the thread title question is moot. From unicode at unicode.org Tue Feb 20 22:01:41 2018 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Tue, 20 Feb 2018 22:01:41 -0600 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> Message-ID: <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> > On Feb 20, 2018, at 9:49 PM, James Kass via Unicode wrote: > > Michael Everson wrote: > >> Orthographic harmonization between these languages can ONLY help any >> speaker of one to access information in any of the others. That expands >> people?s worlds. That would be a good goal. > > Wouldn't dream of arguing with that. Expanding people's worlds is why > many of us have supported Unicode. Agreed! > The good news is that the thread title question is moot. Yes, now let?s please return to discussing emoji. All my best, Anshu From unicode at unicode.org Tue Feb 20 22:11:37 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 05:11:37 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: 2018-02-21 5:01 GMT+01:00 Anshuman Pandey via Unicode : > > The good news is that the thread title question is moot. > > Yes, now let?s please return to discussing emoji. > Or NOT !!! This is NOT at all the same topic -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 22:15:45 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 20 Feb 2018 20:15:45 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: Philippe, it was a jest. (Good one, too!) On Tue, Feb 20, 2018 at 8:11 PM, Philippe Verdy wrote: > 2018-02-21 5:01 GMT+01:00 Anshuman Pandey via Unicode : >> >> > The good news is that the thread title question is moot. >> >> Yes, now let?s please return to discussing emoji. > > > Or NOT !!! This is NOT at all the same topic From unicode at unicode.org Tue Feb 20 22:31:10 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 05:31:10 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: Sorry, but such English subtle interpretations are not in my mind, don't suppose everyone uses the second degree everytime something is posted here, these are just unneeded diversions causing trouble, it does not make the thread clear to follow. 2018-02-21 5:15 GMT+01:00 James Kass : > Philippe, it was a jest. (Good one, too!) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 20 23:29:23 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Wed, 21 Feb 2018 05:29:23 +0000 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <20180217004810.6238bf5c@JRWUBU2> <20180217094358.05292de8@JRWUBU2> Message-ID: Actually, given that the IDS characters are confusing in term of some users might expect it to show the composition while in other situations users might expect them to be composited together, would it be a good idea to encode a copy of IDS that is explicitly for the use of combining characters while the original ODS can be left to show compositions? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 00:10:29 2018 From: unicode at unicode.org (Robert Wheelock via Unicode) Date: Wed, 21 Feb 2018 01:10:29 -0500 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: The whole *ASCII apostrophe* thing for Qazaqi (Kazakh) could be avoided by using a Turkish-based orthography; this way, /h/ can still be distinguished from /x/, /u/ from /w/, ... ! ? for front rounded vowels /? ? y/ ? for laminal fricatives /? ?/, and for laminal affricates /t? d?/ ? for /x ~ ?/, and for its voiced counterpart /? ~ ?/ ?The Turkish dull-I letter for the phoneme /? ~ ? ~ ?/ ? for the *eng* sound /?/ ... . So, a Turkish-based ASDF keyboard layout would do fine for typing in Qazaqi using our Latin/Roman alphabet. On Tue, Feb 20, 2018 at 11:31 PM, Philippe Verdy via Unicode < unicode at unicode.org> wrote: > Sorry, but such English subtle interpretations are not in my mind, don't > suppose everyone uses the second degree everytime something is posted here, > these are just unneeded diversions causing trouble, it does not make the > thread clear to follow. > > 2018-02-21 5:15 GMT+01:00 James Kass : > >> Philippe, it was a jest. (Good one, too!) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 00:15:55 2018 From: unicode at unicode.org (Robert Wheelock via Unicode) Date: Wed, 21 Feb 2018 01:15:55 -0500 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: CORRECTION: The Turkish dull-I letter for the sound /? ~ ? ~ ?/ DOESN?T HAVE A DOT ATOP IT!!!! It?s simply written as , while the normal I letter for the sound /? ~ i:/ DOES HAVE A DOT ATOP THAT?and is written as . On Wed, Feb 21, 2018 at 1:10 AM, Robert Wheelock wrote: > The whole *ASCII apostrophe* thing for Qazaqi (Kazakh) could be avoided > by using a Turkish-based orthography; this way, /h/ can still be > distinguished from /x/, /u/ from /w/, ... ! > > ? for front rounded vowels /? ? y/ > ? for laminal fricatives /? ?/, and for laminal affricates /t? > d?/ > ? for /x ~ ?/, and for its voiced counterpart /? ~ ?/ > ?The Turkish dull-I letter for the phoneme /? ~ ? ~ ?/ > ? for the *eng* sound /?/ > ... . > > So, a Turkish-based ASDF keyboard layout would do fine for typing in > Qazaqi using our Latin/Roman alphabet. > > > On Tue, Feb 20, 2018 at 11:31 PM, Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> Sorry, but such English subtle interpretations are not in my mind, don't >> suppose everyone uses the second degree everytime something is posted here, >> these are just unneeded diversions causing trouble, it does not make the >> thread clear to follow. >> >> 2018-02-21 5:15 GMT+01:00 James Kass : >> >>> Philippe, it was a jest. (Good one, too!) >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 08:51:08 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Wed, 21 Feb 2018 16:51:08 +0200 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> Message-ID: <20180221145108.GC1439@macbook.localdomain> Now if he had used an emoji that shows the mode of the text it would have been a lot more obvious, but we already established that the world does not need emoji. Regards, Khaled On Wed, Feb 21, 2018 at 05:31:10AM +0100, Philippe Verdy via Unicode wrote: > Sorry, but such English subtle interpretations are not in my mind, don't > suppose everyone uses the second degree everytime something is posted here, > these are just unneeded diversions causing trouble, it does not make the > thread clear to follow. > > 2018-02-21 5:15 GMT+01:00 James Kass : > > > Philippe, it was a jest. (Good one, too!) > > From unicode at unicode.org Wed Feb 21 09:28:14 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 16:28:14 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180221145108.GC1439@macbook.localdomain> References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: 2018-02-21 15:51 GMT+01:00 Khaled Hosny : > Now if he had used an emoji that shows the mode of the text it would > have been a lot more obvious, but we already established that the world > does not need emoji. > No, I don't need emojis. Any emoji means all or nothing, they are just unnecessary and annoying eye-catching distractions. I even hope that there will be a setting in all browsers, OS'es, mobiles, and apps to refuse any colorful rendering, and just render them as monochromatic symbols. In summary, COMPLETETY DISABLE the colorful extensions of OpenType made for them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 09:23:23 2018 From: unicode at unicode.org (Jeb Eldridge via Unicode) Date: Wed, 21 Feb 2018 10:23:23 -0500 Subject: Suggestions? In-Reply-To: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> Message-ID: <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Where can I post suggestions and feedback for Unicode? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:05:01 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 21 Feb 2018 09:05:01 -0800 Subject: Suggestions? In-Reply-To: <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:10:13 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 21 Feb 2018 09:10:13 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:23:28 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 18:23:28 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: 2018-02-21 18:10 GMT+01:00 Asmus Freytag via Unicode : > Feeling a bit curmudgeony, are we, today? :-) > Don't know what it means, never heard that word, not found in dictionaries. Probably a local US jargon or typo in your strange word. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:36:54 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 21 Feb 2018 18:36:54 +0100 Subject: Suggestions? In-Reply-To: <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Message-ID: The Unicode website has a section for feedback in its menu, but in separate projects for TUS and for CLDR. There are also feedbacks requested for every proposed amendment to the standard, annexes, and data. First search the relevant topic on the website, then look at the side bar if there's no specific feedback link on the main page content. Feedback or proposals are submitted within an online form, and will then be forwarded by email to interested subcommities and possible subscribers. For data submission to CLDR, this is done by the survey tool, when it is open. For reference implementations, that have an opensourced repository, feedback is submitted via the links given in the repository itself. Basically, you need to look for the most relevant topic, and then use the appropriate link so that this can be sorted and sent to the correct people. There's also a feedback for questions related to Unicode memberships, or for legal requests. There's also a general feedback link, but don't expect an emergency response, it may take time to reach the right people to get an answer, and unsorted/unqualified feedbacks take time to be classified and extracted from the fog of incoming spams or non-relevant submissions. If you don't know where to post, this mailing list can guide you, but this is not the place to submit a formal request, and various people (including me) may reply to you, and any reply you would receive from this list is not endorsed ofciially by Unicode, this is more a "community" list used to interconnect interested people and discuss about how to improve the proposals, or being guided before submitting a qualified formal request, or ask for peer review before submitting it. 2018-02-21 16:23 GMT+01:00 Jeb Eldridge via Unicode : > > > > > Where can I post suggestions and feedback for Unicode? > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:39:03 2018 From: unicode at unicode.org (John W Kennedy via Unicode) Date: Wed, 21 Feb 2018 12:39:03 -0500 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: <3557C048-9503-43A9-96E7-0E03A0DBD06D@gmail.com> ?Curmudgeonly? is a perfectly good English word attested back to 1590. -- > On Feb 21, 2018, at 12:23 PM, Philippe Verdy via Unicode wrote: > > 2018-02-21 18:10 GMT+01:00 Asmus Freytag via Unicode : >> Feeling a bit curmudgeony, are we, today? :-) > Don't know what it means, never heard that word, not found in dictionaries. Probably a local US jargon or typo in your strange word. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 11:49:29 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 21 Feb 2018 09:49:29 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: <7ec7a119-dd7a-7907-63ae-e868c2f328bc@ix.netcom.com> On 2/21/2018 9:23 AM, Philippe Verdy wrote: > 2018-02-21 18:10 GMT+01:00 Asmus Freytag via Unicode > >: > > Feeling a bit curmudgeony, are we, today? :-) > > Don't know what it means, never heard that word, not found in > dictionaries. Probably alocalUS jargon or typo in your strange word. > Sorry for the typo. Dropped an "l". :-[ curmudgeonly from curmudgeon+ly The word is attested from the late 1500s in the forms /curmudgeon/ and /curmudgen/, and during the 17th century in numerous spelling variants, including /cormogeon, cormogion, cormoggian, cormudgeon, curmudgion, curmuggion, curmudgin, curr-mudgin, curre-megient/. Don't think the US existed in the late 1500s... A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 12:11:58 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 21 Feb 2018 10:11:58 -0800 Subject: Suggestions? In-Reply-To: References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Message-ID: http://www.unicode.org/faq/faq_on_faqs.html#34 From unicode at unicode.org Wed Feb 21 13:45:32 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 21 Feb 2018 19:45:32 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <3557C048-9503-43A9-96E7-0E03A0DBD06D@gmail.com> References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <3557C048-9503-43A9-96E7-0E03A0DBD06D@gmail.com> Message-ID: On Wed, Feb 21, 2018 at 9:40 AM John W Kennedy via Unicode < unicode at unicode.org> wrote: > ?Curmudgeonly? is a perfectly good English word attested back to 1590. > Curmudgeony may be identified as misspelled by Google, but it's got a bit of usage dating back a hundred years. Wiktionary's entry at [[-y]] says "This suffix is still very productive and can be added to almost any word.", and that matches my feeling that this is a perfectly good word, a perfectly wordy word, even if it wouldn't be used in formal English. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 13:54:31 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 21 Feb 2018 19:54:31 +0000 Subject: Coloured Characters (was: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: <20180221195431.46d1e37c@JRWUBU2> On Wed, 21 Feb 2018 16:28:14 +0100 Philippe Verdy via Unicode wrote: > I even hope that there will be a setting in all browsers, OS'es, > mobiles, and apps to refuse any colorful rendering, and just render > them as monochromatic symbols. In summary, COMPLETETY DISABLE the > colorful extensions of OpenType made for them. But hieroglyphs look so much better in colour! What's more, they were meant to be read in colour. If you want monochrome, you should make do with hieratic! On a more practical level, I've made a font that colours subscript coda consonants differently to subscript onset consonants for the purpose of proof-reading Northern Thai text. It was a pleasant surprise to see colour-coded suggested spelling corrections when I used it on Firefox. I had installed the spell-checker for LibreOffice, which currently lacks the colour capability, but Firefox helped itself to it. So you may not like emoji, but the colour extensions have perfectly good uses. Richard. From unicode at unicode.org Wed Feb 21 15:01:10 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 21 Feb 2018 21:01:10 +0000 Subject: Suggestions? In-Reply-To: <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Message-ID: On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode < unicode at unicode.org> wrote: > Where can I post suggestions and feedback for Unicode? > Here is as good as any place. There are specific places for a few specific things, but likely if you do have something that's likely to get changed, you'll need the help of someone here to get through the process. It is a quarter-century old technical standard embedded in most electronics, so I would temper any expectations for major changes; it works the way it works because that's the way previous versions worked, and nobody is interested in the trouble changing them would involve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 15:04:32 2018 From: unicode at unicode.org (=?ISO-8859-1?Q?Christoph_P=E4per?= via Unicode) Date: Wed, 21 Feb 2018 22:04:32 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> Message-ID: <89E51B32-CDDB-41CD-BFD4-3BC49664749C@crissov.de> Philippe Verdy: > > I even hope that there will be a setting in all browsers, OS'es, mobiles, > and apps to refuse any colorful rendering, and just render them as > monochromatic symbols. In summary, COMPLETETY DISABLE the colorful > extensions of OpenType made for them. See and linked issues for CSS. From unicode at unicode.org Wed Feb 21 17:04:34 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 22 Feb 2018 00:04:34 +0100 Subject: Coloured Characters (was: 0027, 02BC, 2019, or a new character?) In-Reply-To: <20180221195431.46d1e37c@JRWUBU2> References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <20180221195431.46d1e37c@JRWUBU2> Message-ID: I'm not speaking about hieroglyphs, even if they are perfectly readable in monochrome on monuments. I was just saying that colorful **emojis** are just a nuisance and colors in them do not add any semantic value (except possibly flags, skin tones were added only to avoid a never-ending battle on ethnic biases in implementations, but even there the disambiguation should make the country name readable and accessible, and for skin tones most of the time they are not meaningful at all!) except making them more visible and in fact look spamming and needlessly distracting. Given that emojis are extremely ambiguous, unreadable and mean actually everything and look very different across implementations, their colorful aspect is also not semantically useful. On the opposite, colored in Arabic or hieroglyph texts is a a useful emphasize and sometimes semantically significant (some rare old scripts also used dictinctive colors): we are in a case similar to encoded semantic variants for mathematics symbols. But here again color cause a severe problem of accessibility and rendering on various surfaces (e.g. is the paper/screen white or black ? if you cannot see the encoded color correctly and it is interpreted verbatim, the text will not be readable at all; what is really needed is a set of symbolic colors: normal color, color vaiant 1, color variant 2, and Unicode can perfectly encode these as combining diacritics !) 2018-02-21 20:54 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Wed, 21 Feb 2018 16:28:14 +0100 > Philippe Verdy via Unicode wrote: > > > I even hope that there will be a setting in all browsers, OS'es, > > mobiles, and apps to refuse any colorful rendering, and just render > > them as monochromatic symbols. In summary, COMPLETETY DISABLE the > > colorful extensions of OpenType made for them. > > But hieroglyphs look so much better in colour! What's more, they were > meant to be read in colour. If you want monochrome, you should make do > with hieratic! > > On a more practical level, I've made a font that colours subscript coda > consonants differently to subscript onset consonants for the purpose of > proof-reading Northern Thai text. It was a pleasant surprise to see > colour-coded suggested spelling corrections when I used it on Firefox. > I had installed the spell-checker for LibreOffice, which currently > lacks the colour capability, but Firefox helped itself to it. > > So you may not like emoji, but the colour extensions have perfectly > good uses. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 17:09:11 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 21 Feb 2018 15:09:11 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <3557C048-9503-43A9-96E7-0E03A0DBD06D@gmail.com> Message-ID: <60727f9e-15ab-c8ae-e35b-38c79d3ec3d8@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 21 19:13:25 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 22 Feb 2018 01:13:25 +0000 Subject: Coloured Characters (was: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <20180221195431.46d1e37c@JRWUBU2> Message-ID: <20180222011325.5f8e6c53@JRWUBU2> On Thu, 22 Feb 2018 00:04:34 +0100 Philippe Verdy via Unicode wrote: > On the opposite, colored in Arabic or hieroglyph texts is a a useful > emphasize and sometimes semantically significant (some rare old > scripts also used dictinctive colors): we are in a case similar to > encoded semantic variants for mathematics symbols. But here again > color cause a severe problem of accessibility and rendering on > various surfaces (e.g. is the paper/screen white or black ? In my case, I just used the colours 'foreground' and 'red'. They work well on both light and dark backgrounds. The difference wasn't so easy to see when the foreground was a different shade of red! > cannot see the encoded color correctly and it is interpreted > verbatim, the text will not be readable at all; what is really needed > is a set of symbolic colors: normal color, color vaiant 1, color > variant 2, and Unicode can perfectly encode these as combining > diacritics !) Heraldry has the same problem when objects are depicted in their natural colours. (The colour term then used in English heraldry is 'proper'.) Microsoft has a scheme of palettes, but the design is that the application choose the palette from a predefined list. The font can nominate palettes for light and dark backgrounds; otherwise the selection protocol is completely up to the application. 'Foreground' and 'background' are the only externally defined colours. There's no ability to explicitly choose, say 'text stroked sable and dotted gules'. Instead, it's 'text stroked sable and dotted proper', with a choice of palettes to define 'proper'. Richard. From unicode at unicode.org Thu Feb 22 01:01:33 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 22 Feb 2018 16:01:33 +0900 Subject: IDC's versus Egyptian format controls In-Reply-To: References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> Message-ID: <67789106-f63c-e68a-30de-48d3cd2e01c4@it.aoyama.ac.jp> On 2018/02/17 08:25, James Kass via Unicode wrote: > Some people studying Han characters use the IDCs to illustrate the > ideographs and their components for various purposes. Well, as far as I understand, this was their original (and is still their main) purpose. > For example: > > U-0002A8B8 ?? ??? > U-0002A8B9 ?? ??? > U-0002A8BA ?? ??? > U-0002A8BB ?? ??? > U-0002A8BC ?? ??? > U-0002A8BD ?? ??? > U-0002A8BE ?? ??? > U-0002A8BF ?? ??? > U-0002A8C0 ?? ??? > U-0002A8C1 ?? ??? Is it only me or did you get some of this data wrong? For me, it looks definitely like U-0002A8BC ?? ??? rather than U-0002A8BC ?? ???, and U-0002A8BF ?? ??? rather than U-0002A8BF ?? ???, and changes seem to be needed for all the others, too. (The descriptions seem to be four lines later than the characters where they actually belong.) > It would be probably be disconcerting if the display of those > sequences changed into their respective characters overnight. Yes indeed. Regards, Martin. From unicode at unicode.org Thu Feb 22 04:37:39 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 22 Feb 2018 02:37:39 -0800 Subject: IDC's versus Egyptian format controls In-Reply-To: <67789106-f63c-e68a-30de-48d3cd2e01c4@it.aoyama.ac.jp> References: <20180216160040.1e630740@JRWUBU2> <20180216182000.5c2a4431@JRWUBU2> <5c7bc0fe-4c4e-66aa-3779-fdf1e0851cdb@ix.netcom.com> <20180216222724.00b2cbb4@JRWUBU2> <67789106-f63c-e68a-30de-48d3cd2e01c4@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote: > Is it only me or did you get some of this data wrong? Yes, sorry. There's an offset. I copy/pasted data from an archive which apparently predates the formal release of Ext C, and IIRC there was some shifting. Unfortunately the font I used to view the data matches the data, and so is also incorrect. From unicode at unicode.org Thu Feb 22 06:21:46 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 22 Feb 2018 12:21:46 +0000 Subject: Coloured Characters In-Reply-To: <26696985.14288.1519296923530.JavaMail.defaultUser@defaultHost> References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <20180221195431.46d1e37c@JRWUBU2> <20180222011325.5f8e6c53@JRWUBU2> <26696985.14288.1519296923530.JavaMail.defaultUser@defaultHost> Message-ID: <20180222122146.70f1b344@JRWUBU2> On Thu, 22 Feb 2018 10:55:23 +0000 (GMT) William_J_G Overington wrote: > Richard Wordingham wrote: > > > 'Foreground' and 'background' are the only externally defined > > colours. There's no ability to explicitly choose, say 'text stroked > > sable and dotted gules'. Instead, it's 'text stroked sable and > > dotted proper', with a choice of palettes to define 'proper'. > External selection of decoration colours would theroretically be > possible, I do not know how difficult this would be to implement. The problem lies in changing existing interfaces. I can only speak with any real knowledge for the OpenType COLR/CPAL method. The change would be a major pain in programming languages with obligatory (even if implicit) typing. At present, foreground and background need to be specified (if only be default) and passed into the painting routines. You now want to expand the foreground argument into a list of colours - or possibly a callback routine. The next issue is what is to happen when the list provided is too short. Without suitable handling, this may cause problems with fonts that already work in applications that at one interface level know nothing about colour fonts. For example, the HTML code that I have been using with my font knows nothing about colour fonts as such. To get colour with my web page, I just select a coloured font. The final issue that springs to mind is that the COLR table of OpenType allows for 65,535 different colours in glyphs; 0xFFFF is the only reserved colour ID. It represents the foreground colour. If there is only one palette in the font, 0xFFFE can be a legitimate user-defined colour ID. I wouldn't be surprised if such an assignment survived the transition from a proof-of-principle font to a released font. A less painful method for interfaces might be the selection of palettes by name. However, there are rather more possible colour combinations than can be accommodated in an sfnt name table, so an approximation algorithm would be required. It would also make the CPAL tables larger and much more difficult to generate. There are also 30 unassigned bits left in the palette's type attribute. Of course, Unicode is not constrained by what is currently available, and as an entity is interested at most in what is feasible rather than the precise mechanisms. Several full members, though, will care about precise mechanisms. Richard. From unicode at unicode.org Thu Feb 22 08:27:52 2018 From: unicode at unicode.org (Dreiheller, Albrecht via Unicode) Date: Thu, 22 Feb 2018 14:27:52 +0000 Subject: AW: metric for block coverage In-Reply-To: <20180220201236.56435946@JRWUBU2> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218120529.funepdzaa2bh3hjt@angband.pl> <20180218191036.44ffa6e0@JRWUBU2> <98a0b14a-8210-33d1-5fe8-01e1e9e06060@sil.org> <3E10480FE4510343914E4312AB46E74212D3A2F9@DEFTHW99EH5MSX.ww902.siemens.net> <20180220201236.56435946@JRWUBU2> Message-ID: <3E10480FE4510343914E4312AB46E74212D3AABE@DEFTHW99EH5MSX.ww902.siemens.net> Thanks a lot. If I understand it right, these are examples in Sanskrit language using Tamil script? More precisely, my question is whether there are examples in (today's) Tamil language using Danda or Double Danda. I tried to detect these characters in Tamil's Wikipedia texts, but I didn't find some. Albrecht -----Urspr?ngliche Nachricht----- Von: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Richard Wordingham via Unicode Gesendet: Dienstag, 20. Februar 2018 21:13 An: unicode at unicode.org Betreff: Re: metric for block coverage On Tue, 20 Feb 2018 15:13:16 +0000 "Dreiheller, Albrecht via Unicode" wrote: > Could someone please supply an example (web link ...) for usage of > danda / double danda in Tamil? Thanks, Albrecht Take your pick from http://www.prapatti.com/slokas/slokasbyname.html . Do they meet your requirements, or do you perhaps want text in the Tamil language as opposed to PDFs of Sanskrit in Tamil script? I found the likes of my example by googling for 'Tamil Shloka' without quotes. Richard. From unicode at unicode.org Thu Feb 22 04:55:23 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Thu, 22 Feb 2018 10:55:23 +0000 (GMT) Subject: Coloured Characters In-Reply-To: <20180222011325.5f8e6c53@JRWUBU2> References: <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> <3D125407-A700-42A1-93C7-EF3347A68A42@umich.edu> <20180221145108.GC1439@macbook.localdomain> <20180221195431.46d1e37c@JRWUBU2> <20180222011325.5f8e6c53@JRWUBU2> Message-ID: <26696985.14288.1519296923530.JavaMail.defaultUser@defaultHost> Richard Wordingham wrote: > 'Foreground' and 'background' are the only externally defined colours. There's no ability to explicitly choose, say 'text stroked sable and dotted gules'. Instead, it's 'text stroked sable and dotted proper', with a choice of palettes to define 'proper'. External selection of decoration colours would theroretically be possible, I do not know how difficult this would be to implement. I remember posting about that somewhere some years ago but I cannot find it at the moment. The following thread now mentions that possibility and also has, from 2014, an idea of how to have shading from one colour to another. https://forum.high-logic.com/viewtopic.php?f=37&t=5024 In that thread, on 7 June 2014, I wrote as follows. quote The standardization process has a rule that if someone (individual or company) puts forward a proposal for standardization, then that person has to agree to provide a working demonstration. I put forward some ideas for how to extend the COLR/CPAL model so as to provide colour shading of glyphs as well as the existing solid colour. Yet I could not formally propose them for standardization as I do not have the facilities to provide a working demonstration. end quote So the ideas are there and maybe they could be implemented, though alas I cannot implement them myself. William Overington Thursday 22 February 2018 From unicode at unicode.org Thu Feb 22 13:39:33 2018 From: unicode at unicode.org (David Corbett via Unicode) Date: Thu, 22 Feb 2018 14:39:33 -0500 Subject: Bidi edge cases in Hangul and Indic Message-ID: Although the Unicode Bidirectional Algorithm clearly defines how to reorder characters in memory, I don?t understand precisely what it means to display one character after another once they?ve been reordered; specifically, when bidi reordering changes the number of user-perceived characters. For example, after a right-to-left override, the Hangul string ?? (?bogi?) becomes ?? (?gibo?) in visual order. However, its NFD form is reordered by jamo instead of by syllable; that is, it looks like ?igob?. I don?t think it is the intent of the algorithm that canonically equivalent strings display so very differently, but I can?t find any explicit guidance. What should a UBA-conformant renderer do? Another unclear case is Indic clusters. ???? is unambiguously two clusters, but after an RLO, and after following rule L3 to put combining marks after their bases, it looks like one cluster: ????. If Devanagari were actually written right-to-left, I would expect it to stay as two clusters: ?????. Does the UBA prefer one rendering over the other, or is this outside its scope? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 22 17:32:45 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 22 Feb 2018 15:32:45 -0800 Subject: Bidi edge cases in Hangul and Indic In-Reply-To: References: Message-ID: On 2/22/2018 11:39 AM, David Corbett via Unicode wrote: > For example, after a right-to-left override, the Hangul string ?? > (?bogi?) becomes ?? (?gibo?) in visual order. However, its NFD form is > reordered by jamo instead of by syllable; that is, it looks like ?igob?. Nope. *tilt* The UBA reorders the display order in layout -- not the underlying string. "bogi" is the sequence <1107, 1169, 1100, 1175> in NFD or in NFC. Because of canonical equivalence, for display of the NFD string, the sequence <1107,1169> needs to be mapped onto the same *glyph* as BCF4, and the sequence <1100,1175> onto the same *glyph* as AE30. If you override the normal left-to-right ordering with bidi override controls, then the layout order is reversed, but what is actually laid out is those two glyphs. So you just reverse the order of the two syllables for display, in either case. You could force display of "igob", but only if you had inserted some character in between the conjoining jamos that was preventing their equivalence to the syllables, anyway. > I don?t think it is the intent of the algorithm that canonically > equivalent strings display so very differently, but I can?t find any > explicit guidance. What should a UBA-conformant renderer do? The right thing. ;-) --Ken From unicode at unicode.org Thu Feb 22 21:21:26 2018 From: unicode at unicode.org (David Corbett via Unicode) Date: Thu, 22 Feb 2018 22:21:26 -0500 Subject: Bidi edge cases in Hangul and Indic In-Reply-To: References: Message-ID: On Thu, Feb 22, 2018 at 6:32 PM, Ken Whistler wrote: > > If you override the normal left-to-right ordering with bidi override > controls, then the layout order is reversed, but what is actually laid out > is those two glyphs. So you just reverse the order of the two syllables for > display, in either case. > My confusion stems from Unicode?s online bidi utility. Compare https://unicode.org/cldr/utility/bidi.jsp?a=%E2%80%AE%EB%B3%B4%EA%B8%B0 (NFC) to https://unicode.org/cldr/utility/bidi.jsp?a=%E2%80%AE%E1%84% 87%E1%85%A9%E1%84%80%E1%85%B5 (NFD). Concatenating each one?s characters in reordered display position order produces canonically different results. Here is more practical example. A sequence of an emoji modifier base and an emoji modifier in an RTL run will be display-reordered such that the modifier is to left of the base. Clearly, the right thing is to not reorder them, because they should ligate to form a single glyph. Contrast this with ?fl? in an RTL run, which will be display-reordered to ?lf?: it would be wrong to apply the previous rationale here just because ?fl? may have a single glyph. It sounds like the UBA doesn?t specify how to reorder the glyphs of the characters within a level run. That?s about what I expected. I was just worried it might require an easily implemented but wrong order, so thanks for the reassurance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 22 21:36:08 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 22 Feb 2018 19:36:08 -0800 Subject: Bidi edge cases in Hangul and Indic In-Reply-To: References: Message-ID: <9d47f560-e447-f121-9505-cc4f48e0171a@att.net> David, On 2/22/2018 7:21 PM, David Corbett via Unicode wrote: > My confusion stems from Unicode?s online bidi utility. That bidi utility has known defects in it. It is not yet conformant with changes to UBA 6.3, let alone later changes to UBA. And the mapping of memory position to display position in that utility does not take into account complex mapping that has to occur in the layout engines and fonts in real applications. --Ken From unicode at unicode.org Thu Feb 22 21:52:33 2018 From: unicode at unicode.org (via Unicode) Date: Fri, 23 Feb 2018 11:52:33 +0800 Subject: =?UTF-8?Q?Suggestions=3F?= In-Reply-To: References: <5a8d8cfd.088a1f0a.a94be.30c3@mx.google.com> <5a8d8ee4.4c8c1f0a.fb753.f463@mx.google.com> Message-ID: On 22.02.2018 05:01, David Starner via Unicode wrote: > On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode > wrote: > >> Where can I post suggestions and feedback for Unicode? > > Here is as good as any place. There are specific places for a few > specific things, but likely if you do have something thats likely to > get changed, youll need the help of someone here to get through the > process. It is a quarter-century old technical standard embedded in > most electronics, so I would temper any expectations for major > changes; it works the way it works because thats the way previous > versions worked, and nobody is interested in the trouble changing > them > would involve. > Yes and no. This list is for informal discussion, so someone unsure about things may start here, but posting on this list does not count as feedback or suggestions to Unicode. So by all means post here some of your ideas and understand more. Regards John Knightley > > Links: > ------ > [1] mailto:unicode at unicode.org From unicode at unicode.org Fri Feb 23 01:17:31 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Fri, 23 Feb 2018 16:17:31 +0900 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <2513463F-4549-41EF-9253-1000C8E07E15@crissov.de> <51A1DC1B-C718-4A16-AF1C-4048AA7FA26D@evertype.com> <20E09CCC-2B6B-4117-981F-01362DD0C62F@evertype.com> <6AB9C520-7C76-47AF-9CF0-A5D8E6A71930@evertype.com> Message-ID: <007d12b4-79fa-a48a-a7de-730f3be2ece6@it.aoyama.ac.jp> On 2018/02/21 12:15, Michael Everson via Unicode wrote: > I absolutely disagree. There?s a whole lot of related languages out there, and the speakers share some things in common. Orthographic harmonization between these languages can ONLY help any speaker of one to access information in any of the others. That expands people?s worlds. That would be a good goal. It's definitely a good goal. But it's not rocket science to learn the different orthographies. If the languages are similar, then different orthographies are just a minor nuisance. As an example, German and Dutch also have different orthographies, but that's really a very minor issue when learning one language from the other even though these languages are very close. Regards, Martin. From unicode at unicode.org Fri Feb 23 12:15:32 2018 From: unicode at unicode.org (Norbert Lindenberg via Unicode) Date: Fri, 23 Feb 2018 10:15:32 -0800 Subject: metric for block coverage In-Reply-To: <20180218112610.GA18088@macbook.localdomain> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> <20180218112610.GA18088@macbook.localdomain> Message-ID: <5FA91BEC-C649-462C-A999-A9D7BDEACA88@lindenbergsoftware.com> > On Feb 18, 2018, at 3:26 , Khaled Hosny via Unicode wrote: > > On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: >> Adam Borowski wrote, >> >>> I'm looking for a way to determine a font's coverage of available scripts. >>> It's probably reasonable to do this per Unicode block. Also, it's a safe >>> assumption that a font which doesn't know a codepoint can do no complex >>> shaping of such a glyph, thus looking at just codepoints should be adequate >>> for our purposes. >> >> You probably already know that basic script coverage information is >> stored internally in OpenType fonts in the OS/2 table. >> >> https://docs.microsoft.com/en-us/typography/opentype/spec/os2 >> >> Parsing the bits in the "ulUnicodeRange..." entries may be the >> simplest way to get basic script coverage info. > > Though this might not be very reliable since OpenType does not have a > definition of what it means for a Unicode block to be supported; some > font authoring tools use a percentage, others use the presence of any > characters in the range, and fonts might even provide incorrect data for > any reason. > > However, I don?t think script or block coverage is that useful, what > users are usually interested in is the language coverage. > > Regards, > Khaled All true. In addition, ulUnicodeRange ran out of bits around Unicode 5.1, so scripts/blocks added to Unicode after that, such as Javanese, Tangut, or Adlam, cannot be represented. Norbert From unicode at unicode.org Tue Feb 27 09:36:55 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Tue, 27 Feb 2018 15:36:55 +0000 Subject: metric for block coverage In-Reply-To: <20180217221825.wovnzpnzftpsjp37@angband.pl> References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: You have clarified what exactly the usage is; you've only asked what it means to cover a script. James Kass mentioned a font's OS/2 table. That is obsolete: as Khaled pointed out, there has never been a clear definition of "supported" and practice has been inconsistent. Moreover, the available bits were exhausted after Unicode 5.2, and we're now working on Unicode 11. Both Apple and Microsoft have started to use 'dlng' and 'slng' values in the 'meta' table of OpenType fonts to convey what a font can and is designed to support ? a distinction that the OS/2 table never allows for, but that is actually more useful. (I'd also point out that, in the upcoming Windows 10 feature update, the 'dlng' entries in fonts is used to determine what preview strings to use in the Fonts settings UI.) For scripts like Latin that have a large set of characters, most of which have infrequent usage, there can still be a challenge in characterizing the font, but the mechanism does provide flexibility in what is declared. But again, you haven't said what data to put into fonts is your issue. If you are trying to determine whether a given font supports a particular language, the OS/2 and 'meta' table provide heuristics ? with 'meta' being recommended; but the only way to know for absolute certain is to compare an exemplar character list for the particular language with the font's cmap table. But note, that can only tell you that a font _is able to support_ the language, which doesn't necessarily imply that it's actually a good choice for users of that language. For example, every font in Windows includes Basic Latin characters, but that definitely doesn't mean that the fonts are useful for an English speaker. This is why the 'dlng' entry in the 'meta' table was created. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Adam Borowski via Unicode Sent: Saturday, February 17, 2018 2:18 PM To: unicode at unicode.org Subject: metric for block coverage Hi! As a part of Debian fonts team work, we're trying to improve fonts review: ways to organize them, add metadata, pick which fonts are installed by default and/or recommended to users, etc. I'm looking for a way to determine a font's coverage of available scripts. It's probably reasonable to do this per Unicode block. Also, it's a safe assumption that a font which doesn't know a codepoint can do no complex shaping of such a glyph, thus looking at just codepoints should be adequate for our purposes. A na?ve way would be to count codepoints present in the font vs the number of all codepoints in the block. Alas, there's way too much chaff for such an approach to be reasonable: ? or ? count the same as LATIN TURNED CAPITAL LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON. Another idea would be giving every codepoint a weight equal to the number of languages which currently use such a letter. Too bad, that wouldn't work for symbols, or for dead scripts: a good runic font will have a complete coverage of elder futhark, anglo-saxon, younger and medieval, while only a completionist would care about franks casket or Tolkien's inventions. I don't think I'm the first to have this question. Any suggestions? ????! -- ??????? ??????? A dumb species has no way to open a tuna can. ??????? A smart species invents a can opener. ??????? A master species delegates. From unicode at unicode.org Tue Feb 27 10:29:35 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 27 Feb 2018 09:29:35 -0700 Subject: Missing Kazakh Latin letters (was: Re: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F=29?= Message-ID: <20180227092935.665a7a7059d7ee80bb4d670165c8327d.8651b8dbdd.wbe@email03.godaddy.com> Michael Everson wrote: > Why on earth would they use Ch and Sh when 1) C isn?t used by itself > and 2) if you?re using ?? you may as well use ?? ??. Philippe Verdy wrote: > The three versions of the Cyrilic letter i is mapped to 1.5 > (distinguished only on lowercase with the Turkic lowercase dotless i, > but not distinguished on uppercase where there's no dot at all...). > It should have used two distinct letters at least (I with or without > acute). There's another problem. No Latin equivalents are listed for the Cyrillic letters ? ? ? ? ? ? ? ? ? ? ? ?, in either the old charts with apostrophes or the new chart with acutes. These are code points 0426, 042A, 042C, 042D, 042E, and 042F and corresponding lowercase. All of these letters, in lowercase or both, are used in the Kazakh translation of the UDHR currently available from the "UDHR in Unicode" project. So either the UDHR translation is wildly incorrect, which seems unlikely, or the transliteration tables are incomplete. Wikipedia shows digraphs I? ?? for ? ?, and Ia ?a for ? ?, and nothing for the others, though it is not clear where the digraphs came from, and of course the usual Wikipedia caveats apply. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Feb 27 10:45:36 2018 From: unicode at unicode.org (Neil Patel via Unicode) Date: Tue, 27 Feb 2018 11:45:36 -0500 Subject: Unicode Digest, Vol 50, Issue 20 In-Reply-To: References: Message-ID: Does the ulUnicodeRange bits get used to dictate rendering behavior or script recognition? I am just wondering about whether the lack of bits to indicate an Adlam charset can cause other issues in applications. -Neil On Sat, Feb 24, 2018 at 1:00 PM, via Unicode wrote: > Send Unicode mailing list submissions to > unicode at unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at unicode.org > > You can reach the person managing the list at > unicode-owner at unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > > Today's Topics: > > 1. Re: metric for block coverage (Norbert Lindenberg via Unicode) > > > ---------- Forwarded message ---------- > From: Norbert Lindenberg via Unicode > To: Khaled Hosny > Cc: James Kass , Adam Borowski < > kilobyte at angband.pl>, Unicode Public , Norbert > Lindenberg > Bcc: > Date: Fri, 23 Feb 2018 10:15:32 -0800 > Subject: Re: metric for block coverage > > > On Feb 18, 2018, at 3:26 , Khaled Hosny via Unicode > wrote: > > > > On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: > >> Adam Borowski wrote, > >> > >>> I'm looking for a way to determine a font's coverage of available > scripts. > >>> It's probably reasonable to do this per Unicode block. Also, it's a > safe > >>> assumption that a font which doesn't know a codepoint can do no complex > >>> shaping of such a glyph, thus looking at just codepoints should be > adequate > >>> for our purposes. > >> > >> You probably already know that basic script coverage information is > >> stored internally in OpenType fonts in the OS/2 table. > >> > >> https://docs.microsoft.com/en-us/typography/opentype/spec/os2 > >> > >> Parsing the bits in the "ulUnicodeRange..." entries may be the > >> simplest way to get basic script coverage info. > > > > Though this might not be very reliable since OpenType does not have a > > definition of what it means for a Unicode block to be supported; some > > font authoring tools use a percentage, others use the presence of any > > characters in the range, and fonts might even provide incorrect data for > > any reason. > > > > However, I don?t think script or block coverage is that useful, what > > users are usually interested in is the language coverage. > > > > Regards, > > Khaled > > > All true. In addition, ulUnicodeRange ran out of bits around Unicode 5.1, > so scripts/blocks added to Unicode after that, such as Javanese, Tangut, or > Adlam, cannot be represented. > > Norbert > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 27 13:09:38 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 27 Feb 2018 20:09:38 +0100 Subject: Unicode Digest, Vol 50, Issue 20 In-Reply-To: References: Message-ID: I bet these bit sets are just for legacy applications depending on these for detecting support for the scripts encoded in it with a simple test. I've not seen if there was a standard extension approved for this legacy bitset. For detecting support in other scripts not encoded in these bitsets, you'll need to check that there's sufficient mappings in the relevant blocks (most of these scripts are not in the BMP and are small enough to be encoded completely, except possibly extended emojis, musical notations or new blocks for games and astrological symbols, and similar, which belong to special symbolic scripts). You cannot just enumerate languages present implemented OpenType "features" tables, as most languages don't need such specific per-language tuning of the font and just just the default (locale-neutral) set of features and these per-language tuning are optional, and most often implemented only in CJK fonts (for script variants: Korean Hanja, Japanese Kanji, Simplified and Traditional Hanzi). 2018-02-27 17:45 GMT+01:00 Neil Patel via Unicode : > Does the ulUnicodeRange bits get used to dictate rendering behavior or > script recognition? > > I am just wondering about whether the lack of bits to indicate an Adlam > charset can cause other issues in applications. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 27 13:32:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 27 Feb 2018 20:32:58 +0100 Subject: metric for block coverage In-Reply-To: References: <20180217221825.wovnzpnzftpsjp37@angband.pl> Message-ID: I agree that the 'dlng' is far better than this old legacy bitset (which was defined in a time where all Unicode was in the BMP, and the envisioned CJK extended blocks outside the BMP were assumed to be handled by the bits defined for CJK). At least 'dlng' is intended to indicate if a font supports adequately the examplar charset set needed for each language (or language-script) rather than a full script. This is however challenging for rendering arbitrary text where the language is not identified (by metadata beside the text itself, including lang="" attributes in HTML/XML and lang() selectors in CSS, or document-level metadata or MIME headers in HTTP or emails): many documents do not properly tag the language they use, and don't identify all embedded foreign languages in multilingual documents; some applications do not even have such info (e.g. text fields in most SQL databases or files with simple structures like CSV, dBF...), and renderers may need to use a "language guesser" heuristic (which may turn to be wrong on short text fields, where it will be simply be better to check if all characters are covered. So there's no simple solution. What has been done in most OSes is to provide a better basic set of preinstalled fonts that have good coverage, and use them as fallbacks each time there's a problem and an application did not indicate a specific font (or just used generic font name aliases like "serif", "sans-serif", "monospace", "symbols"). These OSes (or libraries in indepedant text rendering engines) also contain in their renderers a database of rules for font fallbacks from wellknown font names which may be replaced by other supported fonts with "similar" characteristics and metrics. 2018-02-27 16:36 GMT+01:00 Peter Constable via Unicode : > You have clarified what exactly the usage is; you've only asked what it > means to cover a script. > > James Kass mentioned a font's OS/2 table. That is obsolete: as Khaled > pointed out, there has never been a clear definition of "supported" and > practice has been inconsistent. Moreover, the available bits were exhausted > after Unicode 5.2, and we're now working on Unicode 11. Both Apple and > Microsoft have started to use 'dlng' and 'slng' values in the 'meta' table > of OpenType fonts to convey what a font can and is designed to support ? a > distinction that the OS/2 table never allows for, but that is actually more > useful. (I'd also point out that, in the upcoming Windows 10 feature > update, the 'dlng' entries in fonts is used to determine what preview > strings to use in the Fonts settings UI.) For scripts like Latin that have > a large set of characters, most of which have infrequent usage, there can > still be a challenge in characterizing the font, but the mechanism does > provide flexibility in what is declared. > > But again, you haven't said what data to put into fonts is your issue. If > you are trying to determine whether a given font supports a particular > language, the OS/2 and 'meta' table provide heuristics ? with 'meta' being > recommended; but the only way to know for absolute certain is to compare an > exemplar character list for the particular language with the font's cmap > table. But note, that can only tell you that a font _is able to support_ > the language, which doesn't necessarily imply that it's actually a good > choice for users of that language. For example, every font in Windows > includes Basic Latin characters, but that definitely doesn't mean that the > fonts are useful for an English speaker. This is why the 'dlng' entry in > the 'meta' table was created. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 28 00:37:33 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Wed, 28 Feb 2018 06:37:33 +0000 Subject: Unicode Digest, Vol 50, Issue 20 In-Reply-To: References: Message-ID: The OpenType spec doesn?t not in any way suggest that the bits be used that way. It?s impossible to assert that there are no applications out there that do that, but I wouldn?t expect there to be many widely-used apps that do that today. On the other hand, something that the bits might affect are behaviours like font selection / font binding. For example, if you paste plain text into a rich-text app, it must select a default font for that text, since it?s a rich-text app. Now, an obvious choice would be to use the font applied to the characters on either side of the insertion point. But if it turned out that that font didn?t support the text being pasted, that would create a rendering problem; so the app probably wants to avoid that. An app just might use these bits as a heuristic to decide whether the current font can support the text or not. I say that Unicode-range bits probably wouldn?t affect rendering in current apps, though that wasn?t necessarily the case in the past. Word 97 was one of the very first mainstream apps to support Unicode, but it was limited in the scripts that were actually supported. Word 2000 was still early in terms of mainstream Unicode support, and still had limitations. I recall working on font projects for Ethiopic and Yi scripts (with SIL at the time) and needing to set Unicode range or codepage bits in order to get text working in Word using our fonts One particular issue was a font-binding issue: Word would lump the Yi characters in with CJK (they?re not Western, and they?re not the few complex scripts that are supported, so assume they?re CJK), but wouldn?t allow the font to be applied until I set bits to make Word think the font supports CJK. But then with the Ethiopic font, there was a different effect ? a rendering issue ? that became apparent: Ethiopic characters have many different widths, but Word ignored the actual glyph metrics and displayed every glyph with the same width (the apparent assumption being that the characters are all CJK and all have the same width). Again, bits had to be set to make it observe the actual glyph metrics. IIRC, in one case I needed to set the Shift-JIS code page bit, and in the other case, to set a bit for one of the kana blocks. But that was many years ago now. I can?t think of seeing Unicode-range bits affecting rendering in a long time. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Neil Patel via Unicode Sent: Tuesday, February 27, 2018 8:46 AM To: unicode at unicode.org; unicode-request at unicode.org Subject: Re: Unicode Digest, Vol 50, Issue 20 Does the ulUnicodeRange bits get used to dictate rendering behavior or script recognition? I am just wondering about whether the lack of bits to indicate an Adlam charset can cause other issues in applications. -Neil On Sat, Feb 24, 2018 at 1:00 PM, via Unicode > wrote: Send Unicode mailing list submissions to unicode at unicode.org To subscribe or unsubscribe via the World Wide Web, visit http://unicode.org/mailman/listinfo/unicode or, via email, send a message with subject or body 'help' to unicode-request at unicode.org You can reach the person managing the list at unicode-owner at unicode.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Unicode digest..." Today's Topics: 1. Re: metric for block coverage (Norbert Lindenberg via Unicode) ---------- Forwarded message ---------- From: Norbert Lindenberg via Unicode > To: Khaled Hosny > Cc: James Kass >, Adam Borowski >, Unicode Public >, Norbert Lindenberg > Bcc: Date: Fri, 23 Feb 2018 10:15:32 -0800 Subject: Re: metric for block coverage > On Feb 18, 2018, at 3:26 , Khaled Hosny via Unicode > wrote: > > On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: >> Adam Borowski wrote, >> >>> I'm looking for a way to determine a font's coverage of available scripts. >>> It's probably reasonable to do this per Unicode block. Also, it's a safe >>> assumption that a font which doesn't know a codepoint can do no complex >>> shaping of such a glyph, thus looking at just codepoints should be adequate >>> for our purposes. >> >> You probably already know that basic script coverage information is >> stored internally in OpenType fonts in the OS/2 table. >> >> https://docs.microsoft.com/en-us/typography/opentype/spec/os2 >> >> Parsing the bits in the "ulUnicodeRange..." entries may be the >> simplest way to get basic script coverage info. > > Though this might not be very reliable since OpenType does not have a > definition of what it means for a Unicode block to be supported; some > font authoring tools use a percentage, others use the presence of any > characters in the range, and fonts might even provide incorrect data for > any reason. > > However, I don?t think script or block coverage is that useful, what > users are usually interested in is the language coverage. > > Regards, > Khaled All true. In addition, ulUnicodeRange ran out of bits around Unicode 5.1, so scripts/blocks added to Unicode after that, such as Javanese, Tangut, or Adlam, cannot be represented. Norbert _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 28 04:38:13 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 28 Feb 2018 11:38:13 +0100 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <5A95D192.5050608@unicode.org> (announcements@unicode.org's message of "Tue, 27 Feb 2018 13:45:54 -0800") References: <5A95D192.5050608@unicode.org> Message-ID: <86tvu12ycq.fsf@mimuw.edu.pl> On Tue, Feb 27 2018 at 13:45 -0800, announcements at unicode.org writes: > The 157 new Emoji are now available for adoption, to help the Unicode > Consortium?s work on digitally disadvantaged languages. I'm quite curious what it the relation between the new emojis and the digitally disadvantages languages. I see none. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Wed Feb 28 04:48:22 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Wed, 28 Feb 2018 19:48:22 +0900 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <86tvu12ycq.fsf@mimuw.edu.pl> References: <5A95D192.5050608@unicode.org> <86tvu12ycq.fsf@mimuw.edu.pl> Message-ID: <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> On 2018/02/28 19:38, Janusz S. Bie? via Unicode wrote: > On Tue, Feb 27 2018 at 13:45 -0800, announcements at unicode.org writes: > >> The 157 new Emoji are now available for adoption, to help the Unicode >> Consortium?s work on digitally disadvantaged languages. > > I'm quite curious what it the relation between the new emojis and the > digitally disadvantages languages. I see none. I think this was mentioned before on this list, in particular by Mark: The money collected from character adoptions (where emoji are a prominent target) is (mostly?) used to support work on not-yet-encoded (thus digitally disadvantaged) scripts. See e.g. the recent announcement at http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html. Regards, Martin. From unicode at unicode.org Wed Feb 28 04:53:41 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 28 Feb 2018 11:53:41 +0100 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> References: <5A95D192.5050608@unicode.org> <86tvu12ycq.fsf@mimuw.edu.pl> <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> Message-ID: Also, please click through from the announcement to http://www.unicode.org/consortium/adopt-a-character.html. If it isn't apparent from that page what the relationship is, we have some work to do... Mark On Wed, Feb 28, 2018 at 11:48 AM, Martin J. D?rst via Unicode < unicode at unicode.org> wrote: > On 2018/02/28 19:38, Janusz S. Bie? via Unicode wrote: > >> On Tue, Feb 27 2018 at 13:45 -0800, announcements at unicode.org writes: >> >> The 157 new Emoji are now available for adoption, to help the Unicode >>> Consortium?s work on digitally disadvantaged languages. >>> >> >> I'm quite curious what it the relation between the new emojis and the >> digitally disadvantages languages. I see none. >> > > I think this was mentioned before on this list, in particular by Mark: > The money collected from character adoptions (where emoji are a prominent > target) is (mostly?) used to support work on not-yet-encoded (thus > digitally disadvantaged) scripts. See e.g. the recent announcement at > http://blog.unicode.org/2018/02/adopt-character-grant-to-sup > port-three.html. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 28 05:22:08 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 28 Feb 2018 12:22:08 +0100 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: (Mark Davis's message of "Wed, 28 Feb 2018 11:53:41 +0100") References: <5A95D192.5050608@unicode.org> <86tvu12ycq.fsf@mimuw.edu.pl> <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> Message-ID: <86po4p2wbj.fsf@mimuw.edu.pl> Thanks to all who answered. The answers are very clear, but the original message and the adoption page are in my opinion much less clear. I can however live with it :-) Best regards Janusz On Wed, Feb 28 2018 at 11:53 +0100, mark at macchiato.com writes: > Also, please click through from the announcement to http://www.unicode.org/consortium/adopt-a-character.html. > > If it isn't apparent from that page what the relationship is, we have some work to do... > > Mark > On Wed, Feb 28, 2018 at 11:48 AM, Martin J. D?rst via Unicode wrote: > > On 2018/02/28 19:38, Janusz S. Bie? via Unicode wrote: > > On Tue, Feb 27 2018 at 13:45 -0800, announcements at unicode.org writes: > > The 157 new Emoji are now available for adoption, to help the Unicode > Consortium?s work on digitally disadvantaged languages. > > I'm quite curious what it the relation between the new emojis and the > digitally disadvantages languages. I see none. > > I think this was mentioned before on this list, in particular by Mark: > The money collected from character adoptions (where emoji are a prominent target) is (mostly?) used to support work on not-yet-encoded (thus digitally > disadvantaged) scripts. See e.g. the recent announcement at http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html. -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Wed Feb 28 05:58:45 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 28 Feb 2018 12:58:45 +0100 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <86po4p2wbj.fsf@mimuw.edu.pl> References: <5A95D192.5050608@unicode.org> <86tvu12ycq.fsf@mimuw.edu.pl> <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> <86po4p2wbj.fsf@mimuw.edu.pl> Message-ID: I'm more interested in what areas you found unclear, because wherever you did I'm sure many others would as well. You can reply off-list if you want. Mark Mark On Wed, Feb 28, 2018 at 12:22 PM, Janusz S. Bie? wrote: > > Thanks to all who answered. The answers are very clear, but the original > message and the adoption page are in my opinion much less clear. I can > however live with it :-) > > Best regards > > Janusz > > On Wed, Feb 28 2018 at 11:53 +0100, mark at macchiato.com writes: > > Also, please click through from the announcement to > http://www.unicode.org/consortium/adopt-a-character.html. > > > > If it isn't apparent from that page what the relationship is, we have > some work to do... > > > > Mark > > > On Wed, Feb 28, 2018 at 11:48 AM, Martin J. D?rst via Unicode < > unicode at unicode.org> wrote: > > > > On 2018/02/28 19:38, Janusz S. Bie? via Unicode wrote: > > > > On Tue, Feb 27 2018 at 13:45 -0800, announcements at unicode.org writes: > > > > The 157 new Emoji are now available for adoption, to help the Unicode > > Consortium?s work on digitally disadvantaged languages. > > > > I'm quite curious what it the relation between the new emojis and the > > digitally disadvantages languages. I see none. > > > > I think this was mentioned before on this list, in particular by Mark: > > The money collected from character adoptions (where emoji are a > prominent target) is (mostly?) used to support work on not-yet-encoded > (thus digitally > > disadvantaged) scripts. See e.g. the recent announcement at > http://blog.unicode.org/2018/02/adopt-character-grant-to- > support-three.html. > > > > -- > , > Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 28 07:22:32 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Wed, 28 Feb 2018 14:22:32 +0100 (CET) Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <5A95D192.5050608@unicode.org> References: <5A95D192.5050608@unicode.org> Message-ID: <91680448.22170.1519824152519@ox.hosteurope.de> announcements at unicode.org: > > The 157 new Emoji are now available for adoption > , But Unicode 11.0 (which all new emojis but Pirate Flag and Infinity rely upon) is not even in beta yet. > There are approximately 7,000 living human languages, > but fewer than 100 of these languages are well-supported on computers, > mobile phones, and other devices. Adopt-a-character donations are used > to improve Unicode support for digitally disadvantaged languages, and to > help preserve the world?s linguistic heritage. Why is the announcement mentioning those numbers of languages at all? The script coverage of written living human languages, except for constructed ones, is almost complete in Unicode and rendering for most of them is reasonably well supported by all modern operating systems (despite recently discovered bugs). Availability of translations or original material is another matter entirely. Languages that have no literal tradition are irrelevant to Unicode (but not to the world's linguistic heritage). In other words, no future update to the UCS will significantly change that 100 out of 7000 metric, but the announcement makes it sound like it would. CLDR may have some influence, but character adoptions and the research grants they enable are not at all associated with that. From unicode at unicode.org Wed Feb 28 07:41:14 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 28 Feb 2018 13:41:14 +0000 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> References: <5A95D192.5050608@unicode.org> <86tvu12ycq.fsf@mimuw.edu.pl> <950a5dfa-36a6-b48b-eed1-befb210bb0ec@it.aoyama.ac.jp> Message-ID: On 28 February 2018 at 10:48, Martin J. D?rst via Unicode wrote: >> >>> The 157 new Emoji are now available for adoption, to help the Unicode >>> Consortium?s work on digitally disadvantaged languages. >> >> I'm quite curious what it the relation between the new emojis and the >> digitally disadvantages languages. I see none. > > I think this was mentioned before on this list, in particular by Mark: > The money collected from character adoptions (where emoji are a prominent > target) is (mostly?) used to support work on not-yet-encoded (thus digitally > disadvantaged) scripts. Over $250,000 has been raised from Unicode character adoptions to date. I am curious as to how much of this money has been spent, and would very much like to see annual accounts showing how much money has been received, and how much has been disbursed to whom and for what. Andrew . See e.g. the recent announcement at > http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html. > > Regards, Martin. From unicode at unicode.org Wed Feb 28 09:39:06 2018 From: unicode at unicode.org (QSJN 4 UKR via Unicode) Date: Wed, 28 Feb 2018 17:39:06 +0200 Subject: Bidi edge cases in Hangul and Indic In-Reply-To: <3a5c8b7c-9b86-4654-fbf1-1432011e603f@att.net> References: <9d47f560-e447-f121-9505-cc4f48e0171a@att.net> <3a5c8b7c-9b86-4654-fbf1-1432011e603f@att.net> Message-ID: Thank you. Section 3.5 confused me: Shaping, that is selection of cursive-connected shapes, is applied after the UBA reordering. However other character to glyph conversions are applied before it "(taking the embedding levels into account for mirroring)". >2018-02-26 21:45 GMT+02:00, Ken Whistler : > On 2/26/2018 7:11 AM, QSJN 4 UKR wrote: >>> The UBA reorders the display order in layout -- not the underlying >>> string. >> What? >> >> UBA reorders characters, not glyphs. > > Actually it does not. The backing order storage of the text is > unaffected. See UAX #9: > > "When working with bidirectional text, the characters are still > interpreted in logical order--only the display is affected." > > And see Section 3.4, Reordering Resolved Levels. The character stream is > mapped onto glyphs *in logical order*. From unicode at unicode.org Wed Feb 28 08:00:54 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 28 Feb 2018 14:00:54 +0000 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <91680448.22170.1519824152519@ox.hosteurope.de> References: <5A95D192.5050608@unicode.org> <91680448.22170.1519824152519@ox.hosteurope.de> Message-ID: On 28 February 2018 at 13:22, Christoph P?per via Unicode wrote: >> >> The 157 new Emoji are now available for adoption > > But Unicode 11.0 (which all new emojis but Pirate Flag and Infinity rely upon) is not even in beta yet. Don't even get me started on that! >> There are approximately 7,000 living human languages, >> but fewer than 100 of these languages are well-supported on computers, >> mobile phones, and other devices. Adopt-a-character donations are used >> to improve Unicode support for digitally disadvantaged languages, and to >> help preserve the world?s linguistic heritage. > > Why is the announcement mentioning those numbers of languages at all? I agree, the figures are meaningless and misleading (and intended to mislead). I could list a hundred languages that are written with the Latin script without pausing for breath. There are very very few scripts in modern daily use that are not yet encoded in the UCS, but letting out that secret will not help the Unicode Consortium to raise money from character adoption. The latest grant to Anshu from Character Adoption money is for three historic scripts (http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html). If there were still so many digitally disadvantaged languages urgently in need of script encoding then surely the Unicode Consortium would be sponsoring those as a priority rather than historic scripts. Andrew From unicode at unicode.org Wed Feb 28 16:33:03 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 28 Feb 2018 23:33:03 +0100 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: <91680448.22170.1519824152519@ox.hosteurope.de> References: <5A95D192.5050608@unicode.org> <91680448.22170.1519824152519@ox.hosteurope.de> Message-ID: 2018-02-28 14:22 GMT+01:00 Christoph P?per via Unicode : > > There are approximately 7,000 living human languages, > > but fewer than 100 of these languages are well-supported on computers, > > mobile phones, and other devices. Fewer than 100 languages is a bit small, I can count nearly about 200 languages well supported with all the necessary basic support to develop them with content. The limitation however is elsewhere: in education and litteracy level for these languages so that people start using them as well on the web and in other medias or use them more easily in their daily life and improve the quality and coverage of data available in these languages. This includes developing an orthography (many languages don't have any developed and supported orthography, even if there was attempts to create dictionnaries, including online with Wikitionary). With the encoded scripts, you can already type and view correctly thousands of languages. This these languages are living, it should not be difficult to support most of them with the existing scripts that are already encoded (we've reched the point where we only have to encode historic scripts, to preserve the cultures or languages that have disappeared or are dying fast since the begining of the 20th century). Even if major languages will persist and regional languages will die, this should not be done without reintegrating in those major languages some significant parts of the past regional cultures, which can still become sources for enriching these major languages so that they become more precise and more useful and allow then easier access to past regional languages, possibly then directly in their original script, with people then able to decipher them or being interested to study them. Past languages and preserved texts will then remain as a rich source for keeping existing languages alive, vivid, productive for new terms, without having to necessarily borrow terms from less than 20 large "international" languages (ar, de, en, es, fa, fr, nl, id, ja, ko, pt, ru, hi, zh), written in only 6 well developed scripts (Arab, Latn, Cyrl, Deva, Hang, Hans, Jpan). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 28 21:31:09 2018 From: unicode at unicode.org (via Unicode) Date: Thu, 01 Mar 2018 11:31:09 +0800 Subject: Unicode Emoji 11.0 characters now ready for adoption! In-Reply-To: References: <5A95D192.5050608@unicode.org> <91680448.22170.1519824152519@ox.hosteurope.de> Message-ID: <83722fa3ed05a8b0989a963b3f26833a@koremail.com> On 01.03.2018 06:33, Philippe Verdy via Unicode wrote: > 2018-02-28 14:22 GMT+01:00 Christoph P?per via > Unicode?: > >>> There are approximately 7,000 living human languages, >>> but fewer than 100 of these languages are well-supported on >> computers, >>> mobile phones, and other devices. > > Fewer than 100 languages is a bit small, I can count nearly about 200 > languages well supported with all the necessary basic support to > develop them with content. The limitation however is elsewhere: in > education and litteracy level for these languages so that people > start > using them as well on the web and in other medias or use them more > easily in their daily life and improve the quality and coverage of > data available in these languages. This includes developing an > orthography (many languages dont have any developed and supported > orthography, even if there was attempts to create dictionnaries, > including online with Wikitionary). > > With the encoded scripts, you can already type and view correctly > thousands of languages. This these languages are living, it should > not > be difficult to support most of them with the existing scripts that > are already encoded (weve reched the point where we only have to > encode historic scripts, to preserve the cultures or languages that > have disappeared or are dying fast since the begining of the 20th > century). Even if major languages will persist and regional languages > will die, this should not be done without reintegrating in those > major > languages some significant parts of the past regional cultures, which > can still become sources for enriching these major languages so that > they become more precise and more useful and allow then easier access > to past regional languages, possibly then directly in their original > script, with people then able to decipher them or being interested to > study them. Past languages and preserved texts will then remain as a > rich source for keeping existing languages alive, vivid, productive > for new terms, without having to necessarily borrow terms from less > than 20 large "international" languages (ar, de, en, es,?fa, fr, nl, > id, ja, ko, pt, ru, hi, zh), written in only 6 well developed scripts > (Arab, Latn, Cyrl, Deva, Hang, Hans, Jpan). > Pen, or brush and paper is much more flexible. With thousands of names of people and places still not encoded I am not sure if I would describe hans (simplified Chinese characters) as well supported. nor with current policy which limits China with over one billion people to submitting less than 500 Chinese characters a year on average, and names not being all to be added, it is hard to say which decade hans will be well supported. John Knightley > > > Links: > ------ > [1] mailto:unicode at unicode.org