From unicode at unicode.org Thu Feb 1 01:03:31 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Feb 2018 08:03:31 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180201013858.383c7313@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Wed, 31 Jan 2018 19:45:56 +0100 > Philippe Verdy via Unicode wrote: > > > 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < > > unicode at unicode.org>: > > > > On Mon, 29 Jan 2018 14:15:04 +0100 > > > was meant to be an example of a > > > searched string. For example, > > COMBINING DOT BELOW> contains, under canonical equivalence, the > > > substring . Your regular > > > expressions would not detect this relationship. > > > My regular expression WILL detect this: scanning the text implies > > first composing it to "full equivalent decomposition form" (without > > even reordering it, and possibly recompose it to NFD) while reading > > it and bufering it in forward direction (it just requires the > > decomposition pairs from the UCD, including those that are "excluded" > > from NFC/NFD). > > No. To find BELOW>, you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100": > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > To be consistent, to find you > would construct > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * > ( > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > > | > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* > > ) > > (A final ')' got lost between brain and text; I have restored it.) > This was a minor omission, ONLY this final parenthese was missing, as it was truncated from its last single line where this was the only character (don't know why it was truncated there, but is easy to restore). You did not correct anything else. > > However, decomposes to > DOT BELOW>. It doesn't match your regular expression, for between > COMBINING > DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which > ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not match because the combining macron blocks the combining dot below and combining diaeresis from commuting, and so there's no canonical equivalence. or cannot be matched in any case by searching using canonical equivalence rules. So this regexp is perfectly correct. No error at all (except the missing final parenthese), and my argument remains valid. > > The regexp exgine will then only process the "fully decomposed" input > > text to find matches, using the regexp transformed as above (which > > has some initial "complex" setup to "fully decompose" the initial > > regexp), but only once when constructing it, but not while processing > > the input text which can be then done straightforward with its full > > decomposition easily performed on the fly without any additional > > buffering except the very small lookahead whose length is never > > longer than the longest "canonical" decompositions in the UCD, i.e. > > at most 2 code points of look ahead). > > Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND > YPOGEGRAMMENI decomposes to . > > Conversion to NFD on input only requires a small buffer for natural > orthographies. I suspect the worst in natural language will come from > either narrow IPA transcriptions or Classical Greek. > OK, the canonical decompositions may expand to more than 2, because some canonical decomposition pairs may contain decomposable pais, but this is still bounded (4 here). The complete set of full decompositions from the UCD is well known, it fits in a resonnably small static table for each version of Unicode (and its size grows very slowly only when there are new character encoded that are decomposable without breaking the statibility rules about all existing combining characters). Even if this expands the input text to 4 times its length (in number of code points), it will still be a very small input look ahead buffer. Very few entries in this table will be decomposable to more than 2 and this only occurs for the oldest characters in Unicode, notably in Greek because there's a **single** case of a combining character that has a canonical decomposition pair (this comes from the encoding of a combining character mapped for compatibility from a legacy non-Unicode charset). All the other pairs are a base character (cc=0), possibly decomposable again only one time (e.g. Vietnamese Latin letters and a single non decomposable combining character with cc>0), and fully decomposable to 3 characters: these encoded characters have multiple diacritics, and are quite rare in the UCD except in the extended Latin blocks. > > The automata is of course using the classic NFA used by regexp engine > > (and not the DFA which explodes combinatorially in all regexps), but > > which is still fully deterministic (the current "state" in the > > automata is not a single integer for the node number in the traversal > > graph, but a set of node numbers, and all regexps have a finite > > number of nodes in the traversal graph, this number being > > proportional to the length of the regexp, it does not need lot of > > memory, and the size of the current "state" is also fully bounded, > > never larger than the length of the regexp). Optimizing some > > contextual parts of the NFA to DFA is possible (to speedup the > > matching process and reduce the maximum storage size of the "current > > state") but only if it does not cause a growth of the total number of > > nodes in the traversal graph, or as long as this growth of the total > > number does not exceed some threshold e.g. not more than 2 or 3 times > > the regexp size). > > In your claim, what is the length of the regexp for searching for ? in a > trace? Is it 3, or is it abut 14? If the former, I am very interested > in how you do it. If the latter, I would say you already have a form > of blow up in the way you cater for canonical equivalence. > For searching ?, the transformed regexp is just [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * The NFA traversal graph contains nodes at location pointed below by apostrophes: ' ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] ' * ' ' It has 5 nodes only (assuming that the regexp engine will compute lookup tables to build the character classes). When there's a quantifier (like "*" here) which is not "{1,1}" after each character or character class it inserts a node after it. No node is inserted in the NFA traversal graph for non-capturing parentheses, but nodes may be inserted for capturing parentheses, and these's the two nodes represening the start of the regexp and the end of the regexp (1 node is also insert before the "^" or "$" context delimiters to match the start and and of input lines (they are character classes, excluded from the capture), or start and end of input text (for regexps using the "multiline" flag) > Even with the dirty trick of normalising the searched trace for input > (I wanted the NFA propagation to be defined by the trace - I also didn't > want to have to worry about the well-formedness of DFAs or NFAs), I > found that the number of states for a concatenation of regular > languages of traces was bounded above by the product of the number of > states This worst case occurs when each regexps can match a zero-length input string (i.e. its final node in the traversal graph is a quantifier like "{0,m}" or "*" or "?" that applies to the whole regexp), or the traversal graph is made of parallel branches starting from the same point and having each one such quantifier. The traversal graph needs to resolves the parentheses and capturing groups to combine them into single quantifier nodes, this transform from the bounded non resolved graph does not cause its expansion in size (number of nodes), but instead causes it to be compacted (characters classes from the link input going to the same target node an be factorized by computing their intersection and separate it from each branch and merge it to a separate branch and you drop the remaining branches where there remaining character classes without this intersection is empty). This is simple to compute. For this worst case, you don't generate a product of the two NFA traversal graphs, you can directly concatenate each graph, the growth in size remaing propertional to the total lenth of the initial regexp, with a small bounded factor (this factor depends on how you represent each node (with character classes possibly including SOT and EOT, or without character classes by separate nodes for each character or SOT or EOT). It does not seem unreasonable to build these character classes and compute their unions/intersections where necessary across branches when factorizing them. you just need to take care of nodes added for non-{1,1} quantifiers. The regexp engine can also choose to expand {m,} or {m,n} quantifiers where m > 1, by concatenating at most m occurences of the subgraph before it (it can do that at least once so that /(a){2,}/ (without capturing groups for parantheses here) is treated like if it was /(a)(a)*/, and if the subgraph for /(a)/ is small (e.g. not more than 64 nodes, for example) it can perform this expansion more times. For example The NFA traversal graph for /(a|b}{10,}/ is (assuming that a and b are orthogonal graphs and not single characters that can be combined by computing character classes): ' / \ 'a 'b \ / '{10,} | ' it has 5 nodes (including those for SOT and EOT). Expanding it one time becomes: ' / \ 'a 'b / \ / \ 'a 'b 'a 'b \ / \ / '{9,} '{9,} \ / ' This is for this preexpansion of quantifiers that you can see the graph expansion. If you expand /A{m,n}/ one time to /AA{m-1,n-1}/ where the graph for /A{m,n}/ has (k+2) nodes (including SOT and EOT), then the new graph will have (2k+2) nodes if n>2, or only (2k) nodes if n=2. For example, in /a{2}/ has 4 nodes (still marked by leading apostrophes here), and so k=2: ' | 'a | '{2,2} | ' You can expand the '{2} quantifier one time it gives a graph with 2k nodes (not 2k+2 because n=2 in this quantifier and the expansion finally drops the the remaining {1,1} quantifier), i.e. ' | 'a | 'a | ' In that case, the expansion does not grow the graph size because n=2 only in the quantifier and k=1 is small enough (the subgraph on which the quantifier loops is just a single character or character class !) The regexp always has the option of precomuting this expansion... or not. The expansion does not lower the number of "active states" in the graph, it just allows faster traversal of the graph by avoiding to pass through quantifier nodes (that need counters in their state and require an additional step). I suggest not expanding quantifiers {m,n} not more than one to 4 times and not doing it at all if the subgraph is large enough (not more than 4 nodes). Above this the gain of performance is marginal given that your graph will have its size grow dramatically, and you'll reduce the locality of data memory caches in processors for the graph itself, and you'll need to allocate more dynamic memory. This tuning is to experiment by profiling your actual implementation of the graph traversal for the matcher (when scannin the input), and the resources (time and storage space) needed to compile the regexp into this expanded graph. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 1 02:19:48 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Feb 2018 09:19:48 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: 2018-02-01 8:03 GMT+01:00 Philippe Verdy : > > > 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > >> >> For example, in /a{2}/ has 4 nodes (still marked by leading apostrophes > here), and so k=2: > > ' <---- > \ / \ > 'a | > | ^ > '{2,2} | > / \ / > ' ---> > I forgot to describe how I represent the graph (when compiling regexps only). And I forgot the second (looping) link from the quantifier (added above) The graph is just a vector of nodes stored as a linear array indexed by integers. Nodes (with leaing apostrophes in the notation above) are objects with one of 4 types: SOT, character class, quantifier, or EOT. - The SOT and EOT node types are trivial and has no other properties, they exists once and only once in all graphs, they can be the same actual type. - the quantifier node type has two integer properties: min and max taking a positive or null value, and INFINITE for the unbounded quantifer (e.g. the "+", "*", or "{2,}" quantifiers) with the constraint that min<=max. INFINITE can be represented in an integer as -1 or MAXINT. It also has another computed property, the counter index, i.e. an index in the array of counter values you'll allocate for the "state" variable you'll use in the matcher. It has two other properties: the next node number (in the represented graph) to take whever the counter has reached the [min, max] interval or not, and possibly a "greedy" flag to specify which condition (false or true) you'll evaluate first (instead of this flag you can create two separate subtypes for greedy quantifiers and non-greedy quantifiers to avoid this test at runtime in the matcher). - the character class node type can have subtypes : single character, basic character class (like [abc]), or a more complex character class, implemented by a character class method "is(c)" where c is the input code point from the text being scanned. The functional method can evaluate for example Unicode character properties such as "isupper(c)" evaluating the [:gc=Lu:] character class, or "isdigit(c)" evaluating the [:gc=Nd:] character class, or isin(c, "string") to detect if c is present in the given string containing a list of characters. or isnotin(c, "string") for the inverse. In Javascript, it is trivial to build functions on the fly, in C/C++ you need a representation to call the appropriate method with some parameters. It You may also have two additional node types for capturing groups : SOC (start of capturing group) and EOC (end of capturing group) both with a single property: the capturing group index. You'll allocate a new index for the array of captured groups to allocate in the "state" variable used by the matcher. This type of node is unconditionally traversed, but traversing it just consists into storing the current input position in the array of captured groups (and making sure that, while runniong the macher, you'll keep the past scanned input text in a buffer as soon as you've started caputuring any group) All node types have also an array of output node index (this array is unordered: all of them will be taken simultaneously by a step in the matcher, so you can segregate node types instead of using polymorphic nodes, and then store in each node an array of indexes for each output node type),and then instead of using a single array of nodes for storing the graph, use a separate array for quantifiers nodes, and an separate array for character class nodes. As SOT will never be part of the output nodes to link to, you'll just need a boolean flag to say if the EOT node is part of the output nodes from a node in your graph, -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 1 13:20:04 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Feb 2018 19:20:04 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: <20180201192004.4ab8c9c5@JRWUBU2> On Thu, 1 Feb 2018 08:03:31 +0100 Philippe Verdy via Unicode wrote: > 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: >> On Wed, 31 Jan 2018 19:45:56 +0100 >> Philippe Verdy via Unicode wrote: >>> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < >>> unicode at unicode.org>: >>>> For example, >>> COMBINING DOT BELOW> contains, under canonical equivalence, the >>>> substring . Your regular >>>> expressions would not detect this relationship. >>> My regular expression WILL detect this: scanning the text implies >>> first composing it to "full equivalent decomposition >>> form" (without even reordering it, and possibly recompose it to >>> NFD) while reading it and bufering it in forward direction (it >>> just requires the decomposition pairs from the UCD, including >>> those that are "excluded" from NFC/NFD). >> To be consistent, to find >> you would construct (i.e. Philippe Verdy would construct) >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * >> ( >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> >> | >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* >> >> ) >> However, >> decomposes to > MACRON, COMBINING DOT BELOW>. It doesn't match your regular >> expression, for between COMBINING >> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for >> which ccc = above! > And my regexp contained all the necessay asterisk, so yes it does not > match because the combining macron blocks the combining dot below and > combining diaeresis from commuting, and so there's no canonical > equivalence. I'm not sure what you mean by 'commuting' in this case, but either your statement or your deduction is wrong! Although adjacent characters with the same non-zero canonical combining class cannot be interchanged, that does not stop the members of the pair commuting with their neighbour with a different non-zero ccc whilst preserving canonical equivalence. Thus the searched string is canonically equivalent . In your scheme, assuming you are looking for the most compact match, one should generate the regular expression: [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * ( [[ [^[[:cc=0:]]] - [[:cc=below:]] ]]* | [[ [^[[:cc=0:]]] - [[:cc=above:]] ]]* ) Have you got a program doing this and reporting to you, or did you assemble the construction by hand? Constructing regular expressions is known to be tricky. You cannot replace this by a more restrictive albeit wordier regex as you suggested on Sunday 28 January (http://www.unicode.org/mail-arch/unicode-ml/y2018-m01/0145.html). There is no upper bound on the length of matching expressions. >>> The automata is of course using the classic NFA used by regexp >>> engine (and not the DFA which explodes combinatorially in all >>> regexps), but which is still fully deterministic (the current >>> "state" in the automata is not a single integer for the node >>> number in the traversal graph, but a set of node numbers, and all >>> regexps have a finite number of nodes in the traversal graph, >>> this number being proportional to the length of the regexp, it >>> does not need lot of memory, and the size of the current "state" >>> is also fully bounded, never larger than the length of the >>> regexp). Optimizing some contextual parts of the NFA to DFA is >>> possible (to speedup the matching process and reduce the maximum >>> storage size of the "current state") but only if it does not >>> cause a growth of the total number of nodes in the traversal >>> graph, or as long as this growth of the total number does not >>> exceed some threshold e.g. not more than 2 or 3 times the regexp >>> size). >> In your claim, what is the length of the regexp for searching for ? >> in a trace? Is it 3, or is it abut 14? If the former, I am very >> interested in how you do it. If the latter, I would say you >> already have a form of blow up in the way you cater for canonical >> equivalence. > For searching ?, the transformed regexp is just > > [[ [^[[:cc=0:]]] - [[:cc=above:] ]] * > > The NFA traversal graph contains nodes at location pointed below by > apostrophes: > > ' > ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]] > ' * > ' > ' > > It has 5 nodes only (assuming that the regexp engine will compute > lookup tables to build the character classes). I asked about ???, not ???. Anyway, you?ve effectively answered the question. You?re talking about regular expressions for strings, not regular expressions for traces. > When there's a > quantifier (like "*" here) which is not "{1,1}" after each character > or character class it inserts a node after it. No node is inserted in > the NFA traversal graph for non-capturing parentheses, but nodes may > be inserted for capturing parentheses, and these's the two nodes > represening the start of the regexp and the end of the regexp (1 node > is also insert before the "^" or "$" context delimiters to match the > start and and of input lines (they are character classes, excluded > from the capture), or start and end of input text (for regexps using > the "multiline" flag) Quantifier nodes like {2,4} probably break down for non-deterministic expressions like (a(bc)?|b){2,4}. The string "ab" contains 2 iterations, but the longer string "abc" contains one iteration. > > Even with the dirty trick of normalising the searched trace for > > input (I wanted the NFA propagation to be defined by the trace - I > > also didn't want to have to worry about the well-formedness of DFAs > > or NFAs), I found that the number of states for a concatenation of > > regular languages of traces was bounded above by the product of the > > number of states > For this worst case, you don't generate a product of the two NFA > traversal graphs, you can directly concatenate each graph, the growth > in size remaing propertional to the total lenth of the initial > regexp, with a small bounded factor (this factor depends on how you > represent each node (with character classes possibly including SOT > and EOT, or without character classes by separate nodes for each > character or SOT or EOT). It does not seem unreasonable to build > these character classes and compute their unions/intersections where > necessary across branches when factorizing them. you just need to > take care of nodes added for non-{1,1} quantifiers. Concatenation doesn?t work with traces when the first expression can end in non-starters and the second expression can begin with them. This is a real issue when parsing sequences of characters for grammatical correctness, which is why I got interested in traces. A regular trace expression of the form [:ccc=1:][:ccc=2:]?[:ccc=n:] seems to require 2^n states in your scheme. As I effectively only apply the regex to NFD input strings, I use fewer states. However, the efficiency of my scheme depends on the order of the commuting factors - reverse order would require the 2^n states. Richard. From unicode at unicode.org Thu Feb 1 17:45:07 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Feb 2018 23:45:07 +0000 Subject: Internationalised Computer Science Exercises - Correction In-Reply-To: <20180201013858.383c7313@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> Message-ID: <20180201234507.27831346@JRWUBU2> On Thu, 1 Feb 2018 01:38:58 +0000 Richard Wordingham via Unicode wrote: > I believe the concurrent star of a language A is (|A|)*, where > > |A| = {x ? A : {x}* is a regular language} > > (The definition works for the trace of fully decomposed Unicode > character strings under canonical equivalence.) I misremembered. The notation is (/A/)*, where starter-free x ? A is dealt with by converting x into its maximal substrings all of whose characters are of the same canonical combining class and putting them in /A/ in place of x. > Concurrent star is not a perfect generalisation. If ab = ba, then > X = {aa, ab, b} has the annoying property that X* is a regular trace > language, but |X|* is a proper subset of X*. For Unicode, X would be > a rather unusual regular language. So this is the other way round. We will get /X/ = {aa, a, b}, so X* is a proper subset of /X/*. Richard. From unicode at unicode.org Mon Feb 5 15:37:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 5 Feb 2018 21:37:30 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180201192004.4ab8c9c5@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> <20180201013858.383c7313@JRWUBU2> <20180201192004.4ab8c9c5@JRWUBU2> Message-ID: <20180205213730.3abe8ce1@JRWUBU2> On Thu, 1 Feb 2018 19:20:04 +0000 Richard Wordingham via Unicode wrote: > A regular trace expression of the form > > [:ccc=1:][:ccc=2:]?[:ccc=n:] > > seems to require 2^n states in your scheme. As I effectively only > apply the regex to NFD input strings, I use fewer states. However, > the efficiency of my scheme depends on the order of the commuting > factors - reverse order would require the 2^n states. I've overstated the compactness of my scheme. Firstly, I split the state for an optionally final matched character into two states according to whether it is to be the final character or not. Secondly, the DFA for a Unicode character is quite large. I've kept it simple and identify most states by the matched Unicode character, which means I have nearly a million states, whereas I could probably whittle it down to more like a thousand or so, at a vast increase in complexity. Richard. From unicode at unicode.org Wed Feb 7 17:47:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 7 Feb 2018 23:47:11 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? Message-ID: <20180207234712.63df3470@JRWUBU2> I am having trouble identifying just what is represented by ? U+0E46 THAI CHARACTER MAIYAMOK. My problem is that the grammatical texts that I have state that when the Thai punctuation mark mai yamok (??????) is used with words, it is flanked by spaces, a position reiterated by the Thai Wikipedia entry on the mark at http://th.wikipedia.org/wiki/??????. It is not clear to me whether the Unicode character includes those spaces or not. I have encountered fonts whose glyph for U+0E46 has so much space on the left that I believe is intended to give the appearance of a preceding space. The glyphs in the reference chart appear to be centred, so I cannot tell whether spaces are incorporated. It does appear that those who believe U+0E46 is flanked by spaces between words omit the following space before Western punctuation marks. So, does U+0E46 include either of those flanking spaces, and, if so, which? A related question is whether dictionary items like "?????? ?", which lacks a corresponding simplex "??????", constitute a single word in any sense relevant to Unicode, the CLDR or ICU. I think a spell-checker will work better if they do. Richard. From unicode at unicode.org Wed Feb 7 22:16:21 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 7 Feb 2018 20:16:21 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <20180207234712.63df3470@JRWUBU2> References: <20180207234712.63df3470@JRWUBU2> Message-ID: In the example, "?????? ?", there's a space character in the text, which seems right. There's no space between MAIYAMOK and the closing quotation mark, which also seems right. If a font included extra spacing around MAIYAMOK, the display of something like... THAI CHARACTER MAIYAMOK (?) ...would be off, I'd think. > ... when the Thai punctuation mark mai yamok (??????) > is used with words, it is flanked by spaces, ... Is there a contrasting use where this mark is not used with words? Maybe numbers? From unicode at unicode.org Wed Feb 7 23:23:06 2018 From: unicode at unicode.org (Theppitak Karoonboonyanan via Unicode) Date: Thu, 8 Feb 2018 12:23:06 +0700 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2> Message-ID: On Thu, Feb 8, 2018 at 11:16 AM, James Kass via Unicode wrote: > In the example, "?????? ?", there's a space character in the text, > which seems right. There's no space between MAIYAMOK and the closing > quotation mark, which also seems right. If a font included extra > spacing around MAIYAMOK, the display of something like... > THAI CHARACTER MAIYAMOK (?) > ...would be off, I'd think. I think the thin space in the glyph is a hack, not the norm. The regulation as defined by the Thai Royal Institute is to use a space or thin space before MAIYAMOK. And if it's followed by a word, use another space after. But the current uses are not consistent. For example: - ????????? (without space at all) - ???? ? ???? (with space before and after, as regulated) - ????? ???? (without space before, but with one after) The argument for not using space before MAIYAMOK is that most like break algorithms will break line before it, tearing it apart from its associated word, which is undesirable. To mitigate this, while fullfilling the regulation when printed, some font creators hack the leading space into the glyph and suggest their users not to prepend a space before MAIYAMOK at all. But the hack also affects people who follow the regulation, as they get too wide space between the word and MAIYAMOK. An apparent way to do it properly is to use NBSP before MAIYAMOK and a normal space after, and not to include any leading space in the glyph, but it seems inconvenent to input NBSP in common text editors. >> ... when the Thai punctuation mark mai yamok (??????) >> is used with words, it is flanked by spaces, ... > > Is there a contrasting use where this mark is not used with words? > Maybe numbers? None. MAIYAMOK is only used with words by definition. Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From unicode at unicode.org Thu Feb 8 00:02:28 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 7 Feb 2018 22:02:28 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2>

Message-ID: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 8 00:20:15 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 7 Feb 2018 22:20:15 -0800 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> References: <20180207234712.63df3470@JRWUBU2>

<68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> Message-ID: Asmus Freytag wrote, > Any text editor that has the ability to handle > slightly more complex input scenarios could be > programmed to convert SP to NBSP before MAIYAMOK. Yes. If I were developing a Thai text editor it would also globally replace any instances of SPACE + MAIYAMOK with NBSP + MAIYAMOK upon File-Save automatically. (To handle updating older files and copy-pasting text from external sources.) From unicode at unicode.org Thu Feb 8 01:57:37 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Feb 2018 07:57:37 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: <68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> References: <20180207234712.63df3470@JRWUBU2>

<68e38377-bd82-a6f8-73ed-dacc9a831f52@ix.netcom.com> Message-ID: <20180208075737.415863bc@JRWUBU2> On Wed, 7 Feb 2018 22:02:28 -0800 Asmus Freytag via Unicode wrote: > On 2/7/2018 9:23 PM, Theppitak Karoonboonyanan via Unicode wrote: > An apparent way to do it properly is to use NBSP before > MAIYAMOK and a normal space after, and not to include > any leading space in the glyph, but it seems inconvenent > to input NBSP in common text editors. > > Any text editor that has the ability to handle slightly more complex > input scenarios could be programmed to convert SP to NBSP before > MAIYAMOK. > > A./ > For any compliant tailorable implementation of the Unicode line-breaking algorithm, the correct method is for U+0E46 to be tailored to have line_break=exclamation. (U+0021 EXCLAMATION MARK is often offset by a space in Thai.) I know it works in ICU Version 53; I haven't tested later versions. Although NBSP is available in Windows-874 and IANA-registered tis620 (as 0xA0), it is not available in TIS-620, the national 8-bit standard. What of a word break between a letter and MAIYAMOK in text tagged as Thai? Should it be never, always or sometimes? Should it depend on whether there is an intermediate space? Richard. From unicode at unicode.org Thu Feb 8 01:59:14 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Feb 2018 07:59:14 +0000 Subject: What is U+0E46 THAI CHARACTER MAIYAMOK? In-Reply-To: References: <20180207234712.63df3470@JRWUBU2> Message-ID: <20180208075914.0e8e5609@JRWUBU2> On Wed, 7 Feb 2018 20:16:21 -0800 James Kass via Unicode wrote: > Is there a contrasting use where this mark is not used with words? > Maybe numbers? The only other use I've seen is quotation of the mark - putting it in parentheses seems quite common. Richard. From unicode at unicode.org Fri Feb 9 14:31:11 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 9 Feb 2018 21:31:11 +0100 (CET) Subject: Cross-Locale Keyboard Features for the General Public Message-ID: <5529894.24216.1518208271272.JavaMail.www@wwinf1c20> Approx. 400 or more subscribers of Unicode Public happen not to be subscribed to CLDR-Users. Now there is a thread that might be of some interest also to non-CLDR?users. It?s about some main functionalities of keyboards intended for many locales, not about specific details of a particular locale tailoring. Please take a look at: http://unicode.org/pipermail/cldr-users/2018-February/000731.html to learn more if interested. Regards, Marcel From unicode at unicode.org Sun Feb 11 17:26:49 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Mon, 12 Feb 2018 00:26:49 +0100 Subject: Fwd: Unicode Emoji 11.0 characters now final for 2018 In-Reply-To: References: <5A7B4E41.7000503@unicode.org> Message-ID: Enthusiastic reactions to the new emojis announcement: https://xkcd.co m/1953/ PB On Wed, Feb 7, 2018 at 8:06 PM, wrote: > Emoji 11.0 data has been > released, with 157 new emoji such as: > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 13 02:34:35 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Tue, 13 Feb 2018 14:04:35 +0530 Subject: Emoji blooper Message-ID: Recently sent this message to a friends list: ???????????? Apparently one font has the trumpet facing left and one has it facing right! So before hitting Send in GMail's web interface, the text appeared fine but after doing so, in my browser it is showing as if the music is emanating from the back of the trumpet! LOL. -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Tue Feb 13 02:39:56 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Tue, 13 Feb 2018 14:09:56 +0530 Subject: Emoji blooper In-Reply-To: References: Message-ID: To illustrate? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? -------------- next part -------------- A non-text attachment was scrubbed... Name: trumpet-left+wrong.png Type: image/png Size: 2065 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: trumpet-right-on-both-counts.png Type: image/png Size: 1757 bytes Desc: not available URL: From unicode at unicode.org Tue Feb 13 13:27:48 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 13 Feb 2018 11:27:48 -0800 Subject: Emoji blooper In-Reply-To: References:

Message-ID: On my machine (Chromebox+Gmail), the trumpets point down to the lower left. If you want to convey precise images, then send images... markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 02:53:28 2018 From: unicode at unicode.org (Erik Pedersen via Unicode) Date: Wed, 14 Feb 2018 00:53:28 -0800 Subject: Why so much emoji nonsense? Message-ID: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Dear Unicode Digest list members, Emoji, in my opinion, are almost entirely outside the scope of the Unicode project. Unlike text composed of the world?s traditional alphabetic, syllabic, abugida or CJK characters, emoji convey no utilitarian and unambiguous information content. Let us, therefore, abandon Emoji support in Unicode as a project that failed. If corporations want to maintain support for Emoji, let?s require them to use only the Private Use Area and, henceforth, confine Unicode expansion to attested characters from so far unsupported scripts. Kind regards, Erik Bj?rn Pedersen ? Victoria, B.C., Canada From unicode at unicode.org Wed Feb 14 03:18:51 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 14 Feb 2018 09:18:51 +0000 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On Wed, Feb 14, 2018 at 12:55 AM Erik Pedersen via Unicode < unicode at unicode.org> wrote: > Dear Unicode Digest list members, > > Emoji, in my opinion, are almost entirely outside the scope of the Unicode > project. Unlike text composed of the world?s traditional alphabetic, > syllabic, abugida or CJK characters, emoji convey no utilitarian and > unambiguous information content. Let us, therefore, abandon Emoji support > in Unicode as a project that failed. If corporations want to maintain > support for Emoji, let?s require them to use only the Private Use Area and, > henceforth, confine Unicode expansion to attested characters from so far > unsupported scripts. > Because ' has so much unambiguous information content. Or even just c. (What's the phonetic value of that letter? Okay, I'll be "easy" on you; what's the phonetic value of that letter in English? What about e?) Also, who are the full members of Unicode? http://www.unicode.org/consortium/members.html says Google, Apple, Huawei, Facebook, Microsoft, etc. By show of hands, who wants a substantial part of the user's data to become incompatible? I think they just voted this down. Even ignoring that, this road has been crossed. Unicode will not tear out anything, but if they could, people could probably survive Cuneiform or Linear A going by the wayside. A not insubstantial part of the Unicode data in the world includes emoji, and removing it would break everything. Like many standards before that were radical changes, a new Unicode standard without emoji would be dead in the water, and someone else would create a competing back-compatible character standard and everyone would forget about Unicode? and start using The One CCS?. It's like demanding that C use bounds checking on its arrays, or that "island" go back to being spelled "iland" now that we recognize it's not related to "isle". Even if mistakes were made, they were carved into stone, and going back is not an option. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 05:23:06 2018 From: unicode at unicode.org (Konstantin Ritt via Unicode) Date: Wed, 14 Feb 2018 14:23:06 +0300 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-14 12:18 GMT+03:00 David Starner via Unicode : > Even if mistakes were made, they were carved into stone, and going back is > not an option. > Sure. However that doesn't mean Unicode should keep adding more and more emoji nonsense. A billion of cat faces, pile of poo, * skin tone Santa/vampire/superwoman/levitating man, keycaps and clocks - are they really that important for the Standard to be encoded separately?! Well, that was a rhetorical question... Regards, Konstantin -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 07:25:50 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 18:55:50 +0530 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: >From a mail which I had sent to two other Unicode contributors just a few days ago: Frankly I agree that this whole emoji thing is a Pandora box. It should have been restricted to emoticons to express facial or physical gestures which are insufficiently representable by words. When it starts representing objects like ???? then it becomes a problem as to where to draw the line. I mean I can see the argument for ?? representing gratitude, but which fruits are valid and which not... And which food items are valid and which not, else you would get proposals for idli and dosa emojis as well! (Those who don't know what those are see https://en.wikipedia.org/wiki/Idli and https://en.wikipedia.org/wiki/Dosa) It seems to me that graphical items previously rejected as such are now being encoded. I mean, if other things like bat ball etc then "why not this one" cannot be refused, but the question is whether encoding bat ball in the first place was keeping with the original intention or spirit of Unicode. Anyhow, what is done is done and the Pandora's box is now open and I don't envy the ESC their job. I don't know, maybe sometimes they may just feel like hitting "ESC" too! -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Wed Feb 14 10:14:06 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 21:44:06 +0530 Subject: UNICODE vehicle vanity registration? Message-ID: Given that in the US vanity vehicle registrations with arbitrary alphanumeric sequences upto 7 characters are permitted (I am correct I hope?), I wonder who (here?) owns the UNICODE registration? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Wed Feb 14 10:24:37 2018 From: unicode at unicode.org (Stephane Bortzmeyer via Unicode) Date: Wed, 14 Feb 2018 17:24:37 +0100 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: Message-ID: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> On Wed, Feb 14, 2018 at 09:44:06PM +0530, Shriramana Sharma via Unicode wrote a message of 6 lines which said: > Given that in the US vanity vehicle registrations with arbitrary > alphanumeric sequences upto 7 characters are permitted (I am correct > I hope?), I wonder who (here?) owns the UNICODE registration? Won't work in New York, unfortunately https://dmv.ny.gov/learn-about-personalized-plates "A character is a letter (A-Z), number (0-9) or space. Each space counts as one character." From unicode at unicode.org Wed Feb 14 10:29:53 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 14 Feb 2018 21:59:53 +0530 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: Sorry but "UNICODE" does fit within those rules doesn't it? On 14-Feb-2018 21:54, "Stephane Bortzmeyer" wrote: On Wed, Feb 14, 2018 at 09:44:06PM +0530, Shriramana Sharma via Unicode wrote a message of 6 lines which said: > Given that in the US vanity vehicle registrations with arbitrary > alphanumeric sequences upto 7 characters are permitted (I am correct > I hope?), I wonder who (here?) owns the UNICODE registration? Won't work in New York, unfortunately https://dmv.ny.gov/learn-about-personalized-plates "A character is a letter (A-Z), number (0-9) or space. Each space counts as one character." -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 10:32:42 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Wed, 14 Feb 2018 16:32:42 +0000 Subject: UNICODE vehicle vanity registration? In-Reply-To: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr> Message-ID: You can use ????? in California. Someone has U+1F913 ?? ( https://www.instagram.com/p/BVYtIHensDu/) Andrew On 14 February 2018 at 16:24, Stephane Bortzmeyer via Unicode < unicode at unicode.org> wrote: > On Wed, Feb 14, 2018 at 09:44:06PM +0530, > Shriramana Sharma via Unicode wrote > a message of 6 lines which said: > > > Given that in the US vanity vehicle registrations with arbitrary > > alphanumeric sequences upto 7 characters are permitted (I am correct > > I hope?), I wonder who (here?) owns the UNICODE registration? > > Won't work in New York, unfortunately > > https://dmv.ny.gov/learn-about-personalized-plates > > "A character is a letter (A-Z), number (0-9) or space. Each space > counts as one character." > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 10:34:15 2018 From: unicode at unicode.org (Stephane Bortzmeyer via Unicode) Date: Wed, 14 Feb 2018 17:34:15 +0100 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr>

Message-ID: <20180214163415.xbsktiqut4xskzfa@nic.fr> On Wed, Feb 14, 2018 at 09:59:53PM +0530, Shriramana Sharma wrote a message of 54 lines which said: > Sorry but "UNICODE" does fit within those rules doesn't it? I doubt that the Departement of Motor Vehicles will accept "but it is in category Ll" as a good reason :-) From unicode at unicode.org Wed Feb 14 11:15:44 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 14 Feb 2018 17:15:44 +0000 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr>

Message-ID: <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> On 14 Feb 2018, at 16:29, Shriramana Sharma via Unicode wrote: > > Sorry but "UNICODE" does fit within those rules doesn't it? Yes. Stephane has misunderstood. (Shriramana meant the literal text ?UNICODE?, which is indeed composed of letters A-Z and meets the definition quoted.) I?d hope that Mark Davis has ?UNICODE? on his car. However, I?m not sure how relevant it really is to this mailing list. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Feb 14 11:45:27 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Wed, 14 Feb 2018 17:45:27 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> On 14 Feb 2018, at 13:25, Shriramana Sharma via Unicode wrote: > > From a mail which I had sent to two other Unicode contributors just a > few days ago: > > Frankly I agree that this whole emoji thing is a Pandora box. It > should have been restricted to emoticons to express facial or physical > gestures which are insufficiently representable by words. When it > starts representing objects like ???? then it becomes a problem as to > where to draw the line. A lot of the emoji were encoded because they were in use on Japanese mobile phones. A fair proportion of those may very well not meet the selection factors (see ) required for new emoji, but they were definitely within the scope of the Unicode project as encoding them provides interoperability. As for newer emoji, whether they are encoded or not is up to the UTC, and as I say, they apply (or are supposed to apply) the criteria on the ?Submitting Emoji Proposals? page. There is certainly an argument that the encoding of new emoji should be discouraged in favour of functionality at higher layers (e.g. tags in HTML), but, honestly, I think that ship has probably sailed. Similarly there are, I think, good reasons to object to the skin tone and gender modifiers, but we?ve already opened that can of worms and so will now have to put up with demands for red hair (or quite probably, freckles, monobrows, different hats, hair, beard and moustache styles and so on). Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Wed Feb 14 12:37:01 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 15 Feb 2018 00:07:01 +0530 Subject: UNICODE vehicle vanity registration? In-Reply-To: <4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr>

<4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> Message-ID: On 14-Feb-2018 22:45, "Alastair Houghton" wrote: I?d hope that Mark Davis has ?UNICODE? on his car. However, I?m not sure how relevant it really is to this mailing list. You're right. My apologies. It *is* somewhat OT to the actual purpose of this list. But I figured if anyone knew the answer to my question they'd be here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 13:14:22 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 11:14:22 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: Alastair Houghton wrote, > ...but they were definitely within the scope of the > Unicode project as encoding them provides interoperability. That's one way of looking at it. Another way would be that the emoji were definitely outside the scope of the Unicode project as encoding them violated Unicode's initial encoding principles. The opposition was strong, but resistance was futile. Anyone interested in the arguments made at the time should check the Unicode public list archives in late 2008 and early 2009. Here's the link for January 2009: http://www.unicode.org/mail-arch/unicode-ml/y2009-m01/index.html Surprisingly, though, I have found at least one roundabout use for the emoji. When reading message boards and comment pages I've found that it's quite simple to skip any messages which are peppered with emoji without missing anything of substance. As far as interoperability goes, there's scads of emoji in the wild which aren't currently in Unicode. Every kind of hobby or interest seems to generate emoji specific to that area of interest. From unicode at unicode.org Wed Feb 14 13:50:01 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 14 Feb 2018 11:50:01 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote: > Unlike text composed of the world?s traditional alphabetic, syllabic, abugida or CJK characters, emoji convey no utilitarian and unambiguous information content. I think this represents a misunderstanding of the function of emoji in written communication, as well as a rather narrow concept of how writing systems work and why they have evolved. RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS The invention and development of word spacing, punctuation, and casing, among other elements of typography, represent the addition of meta-level information to written communication that assists in legibility, helps identify lexical and syntactic units, conveys prosody, and other information that is not well conveyed by simply setting down letters of an alphabet one right after the other. Emoticons were invented, in large part, to fill another major hole in written communication -- the need to convey emotional state and affective attitudes towards the text. This is the kind of information that face-to-face communication has a huge and evolutionarily deep bandwidth for, but which written communication typically fails miserably at. Just adding a little happy face :-) or sad face :-( to a short email manages to convey some affect much more easily and effectively than adding on entire paragraphs trying to explain how one feels about what was just said. Novelists have the skill to do that in text without using little pictographic icons, but most of us are not professional writers! Note that emoticons were invented almost as soon as people started communicating in digital mediums like email -- so long predate anything Unicode came up with. Other kinds of emoji that we've been adding recently may have a somewhat more uncertain trajectory, but the ones that seem to be most successful are precisely those which manage to connect emotionally with people, and which assist them in conveying how they *feel* about what they are writing. So I would suggest that people not just dismiss (or diss) this ongoing phenomenon. Emoji are widely used for many good reasons. And of course, like any other aspect of writing, get mis-used in various ways, as well. But you can be sure that their impact on the evolution of world writing is here to stay and will be the topic of serious scholastic papers by scholars of writing for decades to come. ;-) --Ken From unicode at unicode.org Wed Feb 14 14:49:57 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 14 Feb 2018 21:49:57 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: 2018-02-14 20:50 GMT+01:00 Ken Whistler via Unicode : > > On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote: > >> Unlike text composed of the world?s traditional alphabetic, syllabic, >> abugida or CJK characters, emoji convey no utilitarian and unambiguous >> information content. >> > > I think this represents a misunderstanding of the function of emoji in > written communication, as well as a rather narrow concept of how writing > systems work and why they have evolved. > > RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS > RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! The concept of vowels as distinctive letters came later, even the letter A was initially a representation of a glottal stop consonnant, sometimes mute, only written to indicate a word that did not start by a consonnant in their first syllable, letter. This has survived today in abjads and abugidas where vowels became optional diacritics, but that evolved as plain diacritics in Indic abugidas. The situation is even more complex because clusters of consonnants were also represented in early vowel-less alphabets to represent full syllables (this has formed the base of todays syllabaries when only some glyph variants of the base consonnant was introduced to distinguish their vocalization; Indic abugidas with their complex clusters where vowel diacritic create contextual variant forms of the base consonnant is also a remnant of this old age): the separation of phonetic consonnants came only later. Today's alphabets have a long history of evolution and adaptation to new needs for more precise communication and easier distinctions in languages that have also evolved; some new letters or diacritics were progressively abandonned, and but as the historic alphabets have persisted, then came the concept of digrams to represent a single sound by multiple letters, instead of inventing a new letter or diacritic, because the language in which these digrams were used almost never needed the phonetic letter pairs or their phonology (or such letter pair was too rarely needed that such use of digrams did not make the text undecipherable given the context of use). Over time the alphabets became less and less representative of the phonology (which evolved more rapidly than orthographies for texts that languages wanted to preserve, or because various local phonetic variants of the languages could stil lremain unified by keeping mute letters or letters representing sounds realized differently across regions). The invention of bicameral scripts later allowed easier distinction or reading when contextual forms could be used to emphasize the structure without necessarily using punctuation signs (the lowercase letters came from handwriting, because the initial engraved letters were to difficult to trace with a plum or pencil: letters were joined). Punctuation signs came later which could have deprecated the use of bicameral orthography, but languages have constinued to borrow terms from other languages, and the bicameral distinction became important to preserve. The invention of printing also produced artefacts in the orthography by the adoption of many abbreviation signs (because the paper or parchemins were expensive), and forced some simplifications of the handwritten style with a plum or pencil. Our recent age of computers (or even before the mechanical typewritters) have also dramatically simplified the alphabets because the character set was severely reduced by limitations of the initial technologies (this could have potentially killed all the abjads, abugidas, syllabaries or ideo-phonographic scripts during the 20th century, if there was not a popular resistance to preserve the culture of the initial texts written by humans, and notably the precious religious books): it is still difficult today to preserve many of the non-alphabetic scripts, and there's also difficulties to preserve the meaning diacritics in abjads and abugidas and even in alphabets, as well as bicameral distinctions. Finally the preservation of letters inherited from etymology to allow readers to infer semantics from words is difficult: this is the wellknown problem of orthographic reforms that tend to remove mute letters, remove some phonetic distinctions in letters and infer more and more the semantic from the context: we are in fact slowly returning to the old age of: RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! And the use (or abuse) of emojis is returning us to the prehistory when people draw animals on walls of caverns: this was a very slow communication, not giving a rich semantic, full of ambiguities about what is really meant, and in fact a severe loss of knowledge where people will not communicate easily and rapidly. The Emojis are a threat to the inherited culture, knowledge and science in general: we won't understand what was meant, and will loose our language to a point where it will be very unproductive and will generate more conflicts against people... Since the begining of the 20th century (and notably since WW2) we have developed lot of communication means, but we also see recently a severe degradation of litteracy and a growing social fracture for accessing the knowledge: the huge recent development of audio/video instead of text is a sever threat to preservation of culture: these audio/video contents are much more difficult to preserve than text. We can expect a degradation of general knowledge by the population, and a growing gap with those that have access to the inherited culture if we don't preserve (with Unicode) our text heritage which has proven to be very productive and allowed the development of science, and allowed to coordinate variable societies and allowed to communicat with people with variable cultures or across generations... -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 15:37:05 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 14 Feb 2018 21:37:05 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net> Message-ID: On Wed, Feb 14, 2018 at 11:16 AM James Kass via Unicode wrote: > That's one way of looking at it. Another way would be that the emoji > were definitely outside the scope of the Unicode project as encoding > them violated Unicode's initial encoding principles. > They were characters being interchanged as text in current use. They are more inside the scope than many of the line-drawing characters for 8-bit computers that have been there since day one, and analogous to many of the dingbats that have also been there since day one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 16:33:48 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 14:33:48 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

Message-ID: David Starner wrote, > They were characters being interchanged as text > in current use. They were in-line graphics being interchanged as though they were text. And they still are. And we still disagree. From unicode at unicode.org Wed Feb 14 16:34:22 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 14 Feb 2018 14:34:22 -0800 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: Message-ID: <19f51a1f-2dd5-969f-7d00-33ae0887fd6e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 17:26:24 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 14 Feb 2018 15:26:24 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: <2efcd3a8-f41d-82c0-754c-5e17db993f4d@att.net> On 2/14/2018 12:49 PM, Philippe Verdy via Unicode wrote: > > > RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! > > [ ... lots to say about the history of writing ... ] > And the use (or abuse) of emojis is returning us to the prehistory > when people draw animals on walls of caverns: this was a very slow > communication, not giving a rich semantic, full of ambiguities about > what is really meant, and in fact a severe loss of knowledge where > people will not communicate easily and rapidly. =-O Perhaps Philippe was missing my point about how and why emoji are actually used. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 17:47:04 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 14 Feb 2018 15:47:04 -0800 Subject: UNICODE vehicle vanity registration? In-Reply-To: References: <20180214162437.zaxd3zqisv5dz3cu@nic.fr>

<4087B8B9-7DF2-457D-ACD5-019F8E411347@alastairs-place.net> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 19:14:26 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Thu, 15 Feb 2018 01:14:26 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

Message-ID: On Wed, Feb 14, 2018 at 2:35 PM James Kass via Unicode wrote: > David Starner wrote, > > > They were characters being interchanged as text > > in current use. > > They were in-line graphics being interchanged as though they were > text. And they still are. And we still disagree. > They were units of things being interchanged in formats of MIME types starting with text/ . From the beginning, Unicode has supported all the cruft that's being interchanged in formats of MIME types starting with text/. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 14 19:49:05 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 17:49:05 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

Message-ID: On Wed, Feb 14, 2018 at 5:14 PM, David Starner wrote: > They were units of things being interchanged in formats of MIME types > starting with text/ . From the beginning, Unicode has supported all the > cruft that's being interchanged in formats of MIME types starting with > text/. Yes, except that Unicode "supported" all manner of things being interchanged by setting aside a range of code points for private use. Which enabled certain cell phone companies to save some bandwidth by assigning various popular in-line graphics to PUA code points. The "problem" was that these phone companies failed to get together on those PUA code point assignments, so they could not exchange their icons in a standard fashion between competing phone systems. [Image of the world's smallest violin playing.] I've personally exchanged text data with others using the PUA for both Klingon and Ewellic. [winks] From unicode at unicode.org Wed Feb 14 20:20:49 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 15 Feb 2018 11:20:49 +0900 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

Message-ID: <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> On 2018/02/15 10:49, James Kass via Unicode wrote: > Yes, except that Unicode "supported" all manner of things being > interchanged by setting aside a range of code points for private use. > Which enabled certain cell phone companies to save some bandwidth by > assigning various popular in-line graphics to PUA code points. The original Japanese cell phone carrier emoji where defined in the unassigned area of Shift_JIS, not Unicode. Shift_JIS doesn't have an official private area, but using the empty area by companies had already happened for Kanji (by IBM, NEC, Microsoft). Also, there was some transcoding software initially that mapped some of the emoji to areas in Unicode besides the PUA, based on very simplistic conversion. > The > "problem" was that these phone companies failed to get together on > those PUA code point assignments, so they could not exchange their > icons in a standard fashion between competing phone systems. [Image > of the world's smallest violin playing.] Emoji were originally a competitive device. As an example, NTT Docomo allowed the ticket service PIA to have an emoji for their service, most probably in order to entice them to sign up to participate in the original I-mode (first case of Web on mobile phones) service. Of course, that specific emoji (or was it several) wasn't encoded in Unicode because of trademark issues. Regards, Martin. From unicode at unicode.org Wed Feb 14 20:59:14 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Feb 2018 18:59:14 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

<46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote: > The original Japanese cell phone carrier emoji where defined in the > unassigned area of Shift_JIS, not Unicode. Thank you (and another list member) for reminding that it was originally hacked SJIS rather than proper PUA Unicode. From unicode at unicode.org Thu Feb 15 08:16:28 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 15 Feb 2018 15:16:28 +0100 (CET) Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

<46b3edf0-4d4e-1354-fc36-baf3691f0d3a@it.aoyama.ac.jp> Message-ID: <706530656.61950.1518704188693@ox.hosteurope.de> James Kass via Unicode : > Martin J. D?rst > >> The original Japanese cell phone carrier emoji where defined in the >> unassigned area of Shift_JIS, not Unicode. > > Thank you (and another list member) for reminding that it was > originally hacked SJIS rather than proper PUA Unicode. Japanese telcos were also not the first to use this space for pictographs and ideographs. Look at Sharp electronic typewriters from the early 1990s for instance (which can also be considered laptop computers), e.g. WD-A521 or WD-A551 or WD-A750. They already included much of what later became J-Phone / Vodafone / Softbank emojis. From unicode at unicode.org Thu Feb 15 11:21:26 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 09:21:26 -0800 Subject: Unicode of Death 2.0 Message-ID: This article: https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-crash-apple/?ncid=mobilenavtrend The single Unicode symbol referred to in the article results from a string of Telugu characters. The article doesn't list or display the characters, so Mac users can visit the above link. A link in one of the comments leads to a page which does display the characters. From unicode at unicode.org Thu Feb 15 12:58:10 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 15 Feb 2018 19:58:10 +0100 Subject: Unicode of Death 2.0 In-Reply-To: References: Message-ID: That's probably not a bug of Unicode but of MacOS/iOS text renderers with some fonts using advanced composition feature. Similar bugs could as well the new advanced features added in Windows or Android to support multicolored emojis, variable fonts, contextual glyph transforms, style variants, or more font formats (not just OpenType); the bug may also be in the graphic renderer (incorrect clipping when drawing the glyph into the glyph cache, with buffer overflows possibly caused by incorrectly computed splines), and it could be in the display driver (or in the hardware accelerator having some limitations on the compelxity of multipolygons to fill and to antialias), causing some infinite recursion loop, or too deep recursion exhausting the stack limit; Finally the bug could be in the OpenType hinting engine moving some points outside the clipping area (the math theory may say that such plcement of a point outside the clipping area may be impossible, but various mathematical simplifcations and shortcuts are used to simplify or accelerate the rendering, at the price of some quirks. Even the SVG standard (in constant evolution) could be affected as well in its implementation. There are tons of possible bugs here. 2018-02-15 18:21 GMT+01:00 James Kass via Unicode : > This article: > https://techcrunch.com/2018/02/15/iphone-text-bomb-ios- > mac-crash-apple/?ncid=mobilenavtrend > > The single Unicode symbol referred to in the article results from a > string of Telugu characters. The article doesn't list or display the > characters, so Mac users can visit the above link. A link in one of > the comments leads to a page which does display the characters. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 14:53:00 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 12:53:00 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> Message-ID: Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another > major hole in written communication -- the need to convey > emotional state and affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. > This is the kind of information that face-to-face > communication has a huge and evolutionarily deep > bandwidth for, but which written communication > typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! From unicode at unicode.org Thu Feb 15 15:38:19 2018 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 15 Feb 2018 21:38:19 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: For voice we certainly get clues about the speaker's intent from their tone. That tone can change the meaning of the same written word quite a bit. There is no need for video to wildly change the meaning of two different readings of the exact same words. Writers have always taken liberties with the written word to convey ideas that aren't purely grammatically correct. This may be most obvious in poetry, but it happens even in other writings. Maybe their entire reason was so that future English teachers would ask us why some author chose some peculiar structure or whatever. I find it odd that I write things like "I'd've thought" (AFAIK I hadn't been exposed to I'd've and it just spontaneously occurred, but apparently others (mis)use it as well). I realize "I'd've" isn't "right", but it better conveys my current state of mind than spelling it out would've. Similarly, if I find myself smiling internally while I'm writing, it's going to get a :) Though I may use :), I agree that most of my use of emoji is more decorative, however including other emoji can also make the sentence feel more "fun". If I receive a ?? as the only response to a comment I made, that conveys information that I would have a difficult time putting into words. I don't find emoji to necessarily be a "post-literate" thing. Just a different way of communicating. I have also seen them used in a "pre-literate" fashion. Helping people that were struggling to learn to read get past the initial difficulties they were having on their way to becoming more literate. -Shawn -----Original Message----- From: Unicode On Behalf Of James Kass via Unicode Sent: Thursday, February 15, 2018 12:53 PM To: Ken Whistler Cc: Erik Pedersen ; Unicode Public Subject: Re: Why so much emoji nonsense? Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another major hole in > written communication -- the need to convey emotional state and > affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. > This is the kind of information that face-to-face communication has a > huge and evolutionarily deep bandwidth for, but which written > communication typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! From unicode at unicode.org Thu Feb 15 16:24:18 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 15 Feb 2018 23:24:18 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: 2018-02-15 22:38 GMT+01:00 Shawn Steele via Unicode : > > I don't find emoji to necessarily be a "post-literate" thing. Just a > different way of communicating. I have also seen them used in a > "pre-literate" fashion. Helping people that were struggling to learn to > read get past the initial difficulties they were having on their way to > becoming more literate. > If you just look at how more and more people "communicate" today on the Internet, it's only by video, most of them of poor quality and actually no graphic value at all where a single photo of the speaker on his profile would be enough. So the web is overwhelmed now by poor videos just containing speech, with very low value. But the worse is that this fabulous collection is almost impossible to qualify, sort, organize, it is not reusable, almost not transmissible (except on the social network where they are posted and where they'll soon disappear because there's simply no way to build efficient archives that would be usable in some near future: just a haystack where even the precious gold needles are extremely difficult to find. If people don't know how to read and cannot reuse the content and transmit it, they become just consumers and in fact less and less productors or creators of contents. Just look at opinions under videos, most of them are just "thumbs up", "like", "+1", barely counted only, unqualifiable (there's not even a thumb down). Even these terms are avoided on the interface and you just see an icon for the counter: do you have something to learn when seeing these icons? I fear that those in the near futuyre that won't be able to read and will only be able to listen the medias produced by others, will not even be able to make any judgement, and then will be easily manipulated. And it's in the mission of Unicode, IMHO, to promote litteracy because it is necessary for preserving, transmitting, and expanding the cultures, as well as reconciliate peopel with sciences instead of just following the voice of new gurus only because they look "fun". -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 16:30:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 15 Feb 2018 22:30:49 +0000 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: <20180215223049.3e4c3692@JRWUBU2> On Thu, 15 Feb 2018 21:38:19 +0000 Shawn Steele via Unicode wrote: > I realize "I'd've" isn't > "right", Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? Richard. From unicode at unicode.org Thu Feb 15 16:33:12 2018 From: unicode at unicode.org (Oren Watson via Unicode) Date: Thu, 15 Feb 2018 17:33:12 -0500 Subject: Invisible characters must be specified to be visible in security-sensitive situations Message-ID: https://securelist.com/zero-day-vulnerability-in-telegram/83800/ You could disallow these characters in filenames, but when filename handling is charset-agnostic due to the extended-ascii principle this is impractical. I think a better solution is to specify a visible form of these characters to be used (e.g. through otf font variants) when security is of importance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 16:35:23 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 15 Feb 2018 22:35:23 +0000 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

Message-ID: <20180215223523.6a7a5abb@JRWUBU2> On Wed, 14 Feb 2018 17:49:05 -0800 James Kass via Unicode wrote: > I've personally exchanged text data with others using the PUA for both > Klingon and Ewellic. [winks] But wasn't that using a supplementary standard, the ConScript Unicode Registry? Richard. From unicode at unicode.org Thu Feb 15 16:35:54 2018 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 15 Feb 2018 22:35:54 +0000 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: <20180215223049.3e4c3692@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

<20180215223049.3e4c3692@JRWUBU2> Message-ID: Depends on your perspective I guess ;) -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Thursday, February 15, 2018 2:31 PM To: unicode at unicode.org Subject: Re: Why so much emoji nonsense? - Proscription On Thu, 15 Feb 2018 21:38:19 +0000 Shawn Steele via Unicode wrote: > I realize "I'd've" isn't > "right", Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? Richard. From unicode at unicode.org Thu Feb 15 16:41:23 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 15 Feb 2018 14:41:23 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote: > And it's in the mission of Unicode, IMHO, to promote litteracy Um, no. And not even literacy, either. ;-) https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy --Ken From unicode at unicode.org Thu Feb 15 16:47:51 2018 From: unicode at unicode.org (Nelson H. F. Beebe via Unicode) Date: Thu, 15 Feb 2018 15:47:51 -0700 Subject: Invisible characters must be specified to be visible in security-sensitive situations Message-ID: A list poster reported this story today: https://securelist.com/zero-day-vulnerability-in-telegram/83800/ For a view from the co-father of the Internet, see this recent article: Desirable Properties of Internet Identifiers Vinton G. Cerf https://www.computer.org/csdl/mags/ic/2017/06/mic2017060063.html ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: beebe at math.utah.edu - - 155 S 1400 E RM 233 beebe at acm.org beebe at computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - ------------------------------------------------------------------------------- From unicode at unicode.org Thu Feb 15 16:52:28 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 14:52:28 -0800 Subject: Why so much emoji nonsense? In-Reply-To: <20180215223523.6a7a5abb@JRWUBU2> References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca> <8259CB2B-0416-4053-9D64-D258B2F66A45@alastairs-place.net>

<20180215223523.6a7a5abb@JRWUBU2> Message-ID: Richard Wordingham wrote, >> Klingon and Ewellic. [winks] > > But wasn't that using a supplementary standard, the ConScript Unicode > Registry? The code points registered with CSUR were used for the interchange. But, to clarify, CSUR is not an official supplement to The Unicode Standard. Of course, any exchange of PUA data requires an agreement between senders and recipients. CSUR offers character mappings which private individuals may agree to use for data exchange. From unicode at unicode.org Thu Feb 15 17:19:41 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 15:19:41 -0800 Subject: Why so much emoji nonsense? - Proscription In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

<20180215223049.3e4c3692@JRWUBU2> Message-ID: I'd not've thought "I'd've" was proscribed. Who woulda guessed? On Thu, Feb 15, 2018 at 2:35 PM, Shawn Steele via Unicode wrote: > Depends on your perspective I guess ;) > > -----Original Message----- > From: Unicode On Behalf Of Richard Wordingham via Unicode > Sent: Thursday, February 15, 2018 2:31 PM > To: unicode at unicode.org > Subject: Re: Why so much emoji nonsense? - Proscription > > On Thu, 15 Feb 2018 21:38:19 +0000 > Shawn Steele via Unicode wrote: > >> I realize "I'd've" isn't >> "right", > > Where did that proscription come from? Is it perhaps a perversion of the proscription of "I'd of"? > > Richard. > From unicode at unicode.org Thu Feb 15 17:49:18 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 16 Feb 2018 00:49:18 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: Oh well the 1 to 2 T is a minor English typo (there's 2 T in French for the similar word family, sorry). But I included "IMHO", which means that even if it's not official, it has been the motivating reason why various members joined the project and try to put an end to the destruction of written languages and loss of our written heritage which is still the essential way to communicate for the humanity (much more than oral languages that are all threatened of rapid death and being fogotten if it's not written). Written languages easily cross the borders, the generations, the cultures, with it you can extend your own language and culture, and get more ideas, more inventions, you better understand the world, and you have the mean to be more creative, and not follow only what the most visible leaders are saying. Everywhere, literacy is improving people life and offers more means of living. And it really helps preserving your own personal memory (you do that with photos/videos or audio which are almost impossible to organize without attaching text to it)! 2018-02-15 23:41 GMT+01:00 Ken Whistler : > > > On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote: > >> And it's in the mission of Unicode, IMHO, to promote litteracy >> > > Um, no. And not even literacy, either. ;-) > > https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 18:16:23 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 16:16:23 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: Philippe Verdy wrote, >>> And it's in the mission of Unicode, IMHO, to promote litteracy >> >> Um, no. And not even literacy, either. ;-) > > Oh well the 1 to 2 T is a minor English typo (there's 2 T in French for the > similar word family, sorry). > > But I included "IMHO", which means that even if it's not official, it has > been the motivating reason why various members joined the project ... In this case the punctuation emoticon tacked onto Ken's message apparently did little to diminish the sting of his correcting both your spelling and your opinion. Unicode's stated mission is more along the lines of ensuring that computer text can be universally interchanged in a standard fashion. As a tool, Unicode can be used to promote either literacy or illiteracy. It can be used to exchange messages of joy and love, or hatred and despair. I completely agree that promoting literacy and preserving texts has been a motivating factor for many people supporting the project. From unicode at unicode.org Thu Feb 15 18:17:17 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 16 Feb 2018 01:17:17 +0100 Subject: Invisible characters must be specified to be visible in security-sensitive situations In-Reply-To: References: Message-ID: The suggested filename has no real importance, it could be garbage, Displaying it exactly has no importance. What is important is to display the MIME type (which is transmitted separately of the filemane, and frequently as well without the filename, a browser trying to infer a suitable filename from the URL, but it should respect the MIME type). The acceptable MIME types (and especially here when they are executable like here a javascript), should be clearly identified, and then the file-extension removed from what is displayed when it matches the MIME type. With these, the user would not be confused by the presence of a Bidi override control So. "photo_high_re"++"gnp.js" becomes the text field (to embed in an isolate like ) " photo_high_re"++"gnp (text/javascript)" rendered as "photo_high_regnp" (text/javascript). The browser may also be smarter by describing it as an executable script. But here in an alert box, where it detects a potential harmful content, the suggested filename to display should be simply filtered from these Bidi controls, and the suggested file extension removed and replaced by the default extension for the MIME type outside the isolate). The user would then see; "photo_high_regnp.js" (text/javascript) where the suggested filename was altered (in such alert, the suggested file names should also be truncated to a maximum limit and an indication of the truncation before the replaced extension, such as: "photo_high[...].js" (text/javascript) As well the generic icon used is not enough descriptive and counter productive as the user may think the icon is a preview of a PNG image, that's why the MIOME type should be clearly exposed. 2018-02-15 23:33 GMT+01:00 Oren Watson via Unicode : > https://securelist.com/zero-day-vulnerability-in-telegram/83800/ > > You could disallow these characters in filenames, but when filename > handling is charset-agnostic due to the extended-ascii principle this is > impractical. I think a better solution is to specify a visible form of > these characters to be used (e.g. through otf font variants) when security > is of importance. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 18:59:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 16 Feb 2018 00:59:11 +0000 Subject: Origin of Alphasyllabaries (was: Why so much emoji nonsense?) In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: <20180216005911.7b2b9012@JRWUBU2> On Wed, 14 Feb 2018 21:49:57 +0100 Philippe Verdy via Unicode wrote: > The concept of vowels as distinctive letters came later, even the > letter A was initially a representation of a glottal stop consonnant, > sometimes mute, only written to indicate a word that did not start by > a consonnant in their first syllable, letter. This has survived today > in abjads and abugidas where vowels became optional diacritics, but > that evolved as plain diacritics in Indic abugidas. OK. > The situation is even more complex because clusters of consonnants > were also represented in early vowel-less alphabets to represent full > syllables (this has formed the base of todays syllabaries when only > some glyph variants of the base consonnant was introduced to > distinguish their vocalization; The only syllabary where what you say might be true is the Ethiopic syllabary, and I have grave doubts as to that case. I hope you are aware that most syllabaries do not derive from alphabets, abjads or abugidas. > Indic abugidas with their complex > clusters where vowel diacritic create contextual variant forms of the > base consonnant is also a remnant of this old age): I see no reasons to regard consonant-vowel ligatures as going back to an earlier system without dependent vowels. > the separation of > phonetic consonnants came only later. Old Brahmi stacked consonants are generally very clear compositions. Opaque ligatures are a later development. Writing consonants linearly is a later development; is this what you are referring to? Richard. From unicode at unicode.org Thu Feb 15 20:19:22 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 16 Feb 2018 10:19:22 +0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: 2018-02-16 04:55, "James Kass via Unicode" wrote: Ken Whistler replied to Erik Pedersen, > Emoticons were invented, in large part, to fill another > major hole in written communication -- the need to convey > emotional state and affective attitudes towards the text. There is no such need. If one can't string words together which 'speak for themselves', there are other media. I suspect that emoticons were invented for much the same reason that "typewriter art" was invented: because it's there, it's cute, it's clever, and it's novel. By the standard of "if one can't string word together that speak for themselves can use otger media", then we can scrap Unicode and simply use voice recording for all the purposes. ?_? > This is the kind of information that face-to-face > communication has a huge and evolutionarily deep > bandwidth for, but which written communication > typically fails miserably at. Does Braille include emoji? Are there tonal emoticons available for telephone or voice transmission? Does the telephone "fail miserably" at oral communication because there's no video to transmit facial tics and hand gestures? Did Pontius Pilate have a cousin named Otto? These are rhetorical questions. Tonal emoticon for telephone or voice transmission? There are tones for voice based transmission system And yes, there are limits in these technology which make teleconferencing still not all that popular and people still have to fly across the world just to attend all different sort of meetings. For me, the emoji are a symptom of our moving into a post-literate age. We already have people in positions of power who pride themselves on their marginal literacy and boast about the fact that they don't read much. Sad! Emoji is part of the literacy. Remember that Japanese writing system use ideographic characters plus kana, it won't be odd to add yet another set of pictographic writing system in line to express what you don't want to spell out. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 15 20:37:02 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 18:37:02 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: On Thu, Feb 15, 2018 at 6:19 PM, Phake Nick via Unicode wrote: > > > 2018-02-16 04:55, "James Kass via Unicode" wrote: > > Ken Whistler replied to Erik Pedersen, > >> Emoticons were invented, in large part, to fill another >> major hole in written communication -- the need to convey >> emotional state and affective attitudes towards the text. > > There is no such need. If one can't string words together which > 'speak for themselves', there are other media. I suspect that > emoticons were invented for much the same reason that "typewriter art" > was invented: because it's there, it's cute, it's clever, and it's > novel. > > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? > > >> This is the kind of information that face-to-face >> communication has a huge and evolutionarily deep >> bandwidth for, but which written communication >> typically fails miserably at. > > Does Braille include emoji? Are there tonal emoticons available for > telephone or voice transmission? Does the telephone "fail miserably" > at oral communication because there's no video to transmit facial tics > and hand gestures? Did Pontius Pilate have a cousin named Otto? > These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. > > > For me, the emoji are a symptom of our moving into a post-literate > age. We already have people in positions of power who pride > themselves on their marginal literacy and boast about the fact that > they don't read much. Sad! > > Emoji is part of the literacy. Remember that Japanese writing system use > ideographic characters plus kana, it won't be odd to add yet another set of > pictographic writing system in line to express what you don't want to spell > out. From unicode at unicode.org Thu Feb 15 20:46:16 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 18:46:16 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: Phake Nick wrote, > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? Not for me, I can still type faster than I can talk. Besides, voice recordings are all about communicating by stringing words together. >> These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. At least, that's what they tell their accountants and tax people, right? > Emoji is part of the literacy. Remember that Japanese writing system use > ideographic characters plus kana, it won't be odd to add yet another set of > pictographic writing system in line to express what you don't want to spell > out. Yes, it's a done deal. For better or for worse. From unicode at unicode.org Thu Feb 15 21:26:00 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Feb 2018 19:26:00 -0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: If someone were to be smiling and shrugging while giving you the finger, would you be smiling too? Heck, I'd probably be laughing out loud while running for my life! So, poor example. OK. A smiling creep is still a creep. Suppose for a moment that you and I are pals in the same room having a face-to-face conversation. I advise you that, due to unforeseen events, I'm a bit financially strapped and could use a spot of cash to sort of tide me over until my ship comes into orbit. You smile and nod your head while saying "no". Which response applies? Words suffice. We go by what people actually say rather than whatever they might have meant. When we read text, we go by what's written. An inability to communicate any essential feelings and overtones using words is not a gross failure of either language or writing. It's more about the skill levels of the speaker, listener, author, and reader. As for the thread title question, perhaps the exchanges within the thread offer insight. Emoji exist and are interchanged. Unicode enables them to be interchanged in a standard fashion. Even if they're just for fun, frivolous, silly, and ephemeral. Even if some people consider them beyond the scope of The Unicode Standard. The best time to argue against the addition of emoji to Unicode would be 2007 or 2008, but you'd be wasting your time travel. Trust me. From unicode at unicode.org Thu Feb 15 22:58:31 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Fri, 16 Feb 2018 05:58:31 +0100 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: On Fri, Feb 16, 2018 at 4:26 AM, James Kass via Unicode wrote: > The best time to argue against the addition of emoji to Unicode would be > 2007 or 2008, but you'd be wasting your time travel. Trust me. But it's always a good time to argue against the addition of more nonsense to what we already have got. From unicode at unicode.org Thu Feb 15 23:24:58 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 15 Feb 2018 22:24:58 -0700 Subject: +1 (was: Re: Why so much emoji nonsense?) Message-ID: <4D59975202364CD9959680B1D523958F@DougEwell> Philippe Verdy wrote: > If people don't know how to read and cannot reuse the content and > transmit it, they become just consumers and in fact less and less > productors or creators of contents. Just look at opinions under > videos, most of them are just "thumbs up", "like", "+1", barely > counted only, unqualifiable (there's not even a thumb down). +1 is actually a convenient shorthand when all that needs to be said is "I agree" or "me too" (especially now that the latter has taken on a highly charged meaning in the U.S.). It is especially popular in the IETF. It is not intended for situations that require explanation or details. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Feb 15 23:33:58 2018 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Thu, 15 Feb 2018 23:33:58 -0600 Subject: =?utf-8?Q?End_of_discussion,_please_=E2=80=94_Re:_Why_so_much_em?= =?utf-8?Q?oji_nonsense=3F?= In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: <40CA88F8-8C33-4F3A-9ECA-38B66A5B4680@umich.edu> > On Feb 15, 2018, at 10:58 PM, Pierpaolo Bernardi via Unicode wrote: > > On Fri, Feb 16, 2018 at 4:26 AM, James Kass via Unicode > wrote: > >> The best time to argue against the addition of emoji to Unicode would be >> 2007 or 2008, but you'd be wasting your time travel. Trust me. > > But it's always a good time to argue against the addition of more > nonsense to what we already have got. I think it?s a good time to end this conversation. Whether ?nonsense? or not, emoji are here and they?re in Unicode. This conversation has itself become nonsense, d?y?all agree? The amount of time that people have spent on this discussion could?ve been directed towards work on any one of the unencoded scripts listed at: http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html As many have noted during this discussion, the emoji ?ship has already sailed?. I?d?ve jumped aboard sooner, but this metaphor is now also quite tired. ?? All my best, Anshu -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 16 00:04:12 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 16 Feb 2018 14:04:12 +0800 Subject: Why so much emoji nonsense? In-Reply-To: References: <1C417885-6C52-490F-90E6-661E56160A6D@shaw.ca>

Message-ID: 2018-02-16 10:46, "James Kass" wrote Phake Nick wrote, > By the standard of "if one can't string word together that speak for > themselves can use otger media", then we can scrap Unicode and simply use > voice recording for all the purposes. ?_? Not for me, I can still type faster than I can talk. Besides, voice recordings are all about communicating by stringing words together. There are thousands of situations where one would want to express something in text form instead of voice form other than to be fast. Voice communication isn't just about communicating "string of words" together. Emotion and any other rhibgs are also transferred. That's also why carriers are supporting HQ Voice transmission over telephony system for better clarity in this aspect. >> These are rhetorical questions. > > Tonal emoticon for telephone or voice transmission? There are tones for > voice based transmission system > And yes, there are limits in these technology which make teleconferencing > still not all that popular and people still have to fly across the world > just to attend all different sort of meetings. At least, that's what they tell their accountants and tax people, right? Then why do those people who pay for their own trip still do so? > [?] 2018-02-16 11:27, "James Kass via Unicode"