From unicode at unicode.org Fri May 11 05:14:16 2018 From: unicode at unicode.org (Maggie Oates via Unicode) Date: Fri, 11 May 2018 06:14:16 -0400 Subject: code of ethics and conduct Message-ID: Greetings, The "Unicode Consortium Whistleblower Policy" listed on the Consortium's policies page references the "Unicode Consortium's Code of Ethics and Conduct." I'm having trouble finding this code, and was wondering if someone could *point me to a copy of that code of ethics*. The closest reference I've found is a small section in the bylaws for the board members: "Article 3. Sec 11. Standard of Conduct." Thanks! --- Maggie Oates Societal Computing, PhD student Carnegie Mellon University moates at cmu.edu she/her/hers -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 11 12:37:27 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 11 May 2018 18:37:27 +0100 Subject: Choosing the Set of Renderable Strings Message-ID: <20180511183727.54a341bf@JRWUBU2> For assembling a rendering system for a script with combining marks, is there a guide as to how to decide what strings one should exclude, and which one should strive to support? There will also be characters outside the script that should be supported. For a font, there are lists of characters for Microsoft Word and for the Universal Scripting Engine, and it is frequently desirable for a font to be able to display its own name. There are also various control and formatting characters, and punctuation characters from outside the script. I believe compromises are necessary. There are issues with stacking combining marks - at one point does one throw oneself on the mercy of the application? Making characters small enough to accommodate a cross-line stack of 20 within the nominal line separation is usually not acceptable! (There are Sanskrit manuscripts where a stack extends across several lines.) There are also problems if glyphs cannot simply be stacked - it is not unknown for a 'subscript' glyph to obligatorily have a part on the baseine - preposed 'subscript' RA can required different glyphs depending on how deeply it is stacked. If canonical equivalence does not eliminate homographs, there is the question of which homographs to tolerate. I have hit this issue with Tai Tham. The essence of the problem is that a CVCV word with identical consonants can be abbreviated to CVV, as in some other scripts, and dependent vowels can be written using several vowel symbols. All vowels have ccc=0. Now, the accepted proposal (i.e. the one accepted by the UTC for the ISO process) gave an order for the vowels in such polygraphs, and most combinations resulting from such contraction comply with this order. The existence of such a contraction can be indicated in writing by the (ambiguous) mark MAI SAM, and in such cases the proposed encoding of Tai Tham text is of the form CVxV where 'x' is MAI SAM. In such cases I allow the constraint on vowel order to apply to each vowel separately. This allows homographs, but I take the view that I am rejecting homographs to facilitate searching, not to prevent spoofing. The prevention of spoofing would use stricter rules, which would ban some words, just as the English word "caf?" is prohibited in British domain names. (The doublet "cafe" refers to a lower class of establishment in British English.) However, the mark MAI SAM is not always used. Now, if Tai Tham vowels had non-zero combining marks, I would separate the vowels from the two phonetic syllables by the general disruptor, CGJ, to facilitate sorting. At the very least the word should then be sorted with other words starting with the same CV, and with preprocessing, the CGJ could be replaced by the omitted consonant. Now, Tai Tham vowels have ccc=0, but I favour retaining the CGJ to mark the location of the repeated consonant. This CGJ also enables me to make some check as to whether the individual phonetic syllables' vowel symbols are in the correct order. So: (a) If the vowel symbols in CVV are in the permitted order, the string is accepted. (b) If the word is typed as CVV and the vowels on either side of CGJ are in the correct order, the string is accepted. (c) If the word is typed as CVV and the vowel symbols are not in the permitted order, and I can detect this, I allow the implementation of the Universal Script Engine (be it Microsoft, AAT or HarfBuzz) to insert its dotted circles. More precisely, I don't remove them. Is this a reasonable approach to allowing both collation and suppressing needless homographs? My contribution to the rendering is only the provision of a font. Richard. From unicode at unicode.org Sat May 12 09:01:44 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 May 2018 15:01:44 +0100 Subject: Lack of ulUnicodeRange Bit for Adlam In-Reply-To: References: Message-ID: <20180512150144.46092c59@JRWUBU2> On Tue, 27 Feb 2018 11:45:36 -0500 Neil Patel via Unicode wrote, under topic heading "Re: Unicode Digest, Vol 50, Issue 20": > Does the ulUnicodeRange bits get used to dictate rendering behavior or > script recognition? > > I am just wondering about whether the lack of bits to indicate an > Adlam charset can cause other issues in applications. (Answering in case the problem is still relevant - I had misfiled this post.) The lack is unlikely to cause any problems in anything recent enough to understand the concept of "Adlam" - the bits were not added for blocks newer than Unicode 5.1. As Adlam was only added in Version 9.0, there is a significant risk of rendering engines not supporting cursive joining. On the other hand, the Adlam characters have been right-to-left since Version 5.2. (It shouldn't matter whether one of the characters switched from right to left to NSM.) Richard. From unicode at unicode.org Mon May 14 01:15:10 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 13 May 2018 22:15:10 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180511183727.54a341bf@JRWUBU2> References: <20180511183727.54a341bf@JRWUBU2> Message-ID: Richard Wordingham asked, ? Is this a reasonable approach to allowing both collation ? and suppressing needless homographs? My contribution to ? the rendering is only the provision of a font. If anything about this approach was unreasonable, one of the experts on this list would probably have pointed it out by now. Trailblazers such as yourself will help to establish the guidelines you seek. One does the best that one can in anticipating the character strings the font will be expected to support, follows the font specs, and puts the results out there for the public. Then, the user community, if any, may provide appropriate feedback to the developers so that adjustments can be made. Riding along with the insertion of the dotted circles by the USE enables the actual users to see immediately that the text needs to be modified in order to render reasonably on that system with the shaping engine and font selected. If users consider any such insertion inappropriate, then it's feedback time. ? ... and it is frequently desirable for a font to be able ? to display its own name. Does the font name have to be in a Latin-based script? From unicode at unicode.org Mon May 14 02:47:55 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 May 2018 08:47:55 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <20180511183727.54a341bf@JRWUBU2> Message-ID: <20180514084755.7da895d7@JRWUBU2> On Sun, 13 May 2018 22:15:10 -0800 James Kass via Unicode wrote: > Richard Wordingham asked, > > ? Is this a reasonable approach to allowing both collation > ? and suppressing needless homographs? My contribution to > ? the rendering is only the provision of a font. > > If anything about this approach was unreasonable, one of the experts > on this list would probably have pointed it out by now. Not necessarily; some may still be recovering from the recent UTC meeting. Moreover, it took many years before we were told that there was no character to suppress word boundaries wrongly deduced by Thai breaking algorithms. The character we had been using, U+2060 WORD JOINER, is apparently only for suppressing line breaks. > Riding along with the insertion of the dotted circles by the USE > enables the actual users to see immediately that the text needs to be > modified in order to render reasonably on that system with the shaping > engine and font selected. If users consider any such insertion > inappropriate, then it's feedback time. The massive failure of USE was reported within hours of USE being announced on the Unicode forum. So far there has only been tinkering, and an encouragement of bad spelling. For example, at least about 23% of Northern Thai monosyllables can be rendered only by clear misspelling - see the results in http://www.wrdingham.co.uk/lanna/random_test.htm. The USE specification brushes over this with the statement, "Note: Tai Tham support is limited to mono-syllabic clusters", which gives the misleading impression that mono-syllabic clusters are supported. Basically, support is limited to (C)+(V)* clusters with a liberal interpretation of C and V. Crw and Cry aren't supported either. At the moment, one is generally better off using a Thai hack font that uses paiyannoi to toggle between the various forms and placements of Tai Tham characters. That has the advantage that the text is still intelligible when you have no font that renders it as Tai Tham. The main limitation of such schemes is in plain text. > ? ... and it is frequently desirable for a font to be able > ? to display its own name. > > Does the font name have to be in a Latin-based script? Postscript certainly gets unhappy if there isn't an ASCII name for it; I don't know the requirements for the various PDF generators. Richard. From unicode at unicode.org Mon May 14 07:12:56 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 May 2018 04:12:56 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> Message-ID: In response to William Overington's post, it's easier to transcode data from a PUA scheme into Unicode than it is to enter the data from scratch. (The same could be said for a customized ASCII font.) Some users may not wish to wait even the handful of years it took for mainstream Indic complex scripts to be rendered properly. At this phase of Unicode's progress, however, we shouldn't encourage the interchange of such PUA data. Since it's simple to transcode, any such data should be transcoded prior to interchange or permanent storage. Recipients lacking systems supporting proper Unicode rendering for complex scripts such as Tai Tham could then transcode it to the PUA scheme for display/printing purposes. An OpenType font, a keyboard driver, and a text conversion utility might go a long way towards supporting complex scripts for users whose systems cannot otherwise currently support them. A good keyboard driver should be able to remove some of the burden off of the OpenType tables, enabling multiple fonts covering the same script to be used without having bloated and redundant OpenType tables, by offering some degree of control over the actual character strings which are being stored (and presented to the font for rendering). (Many font developers might consider that any kind of normalization should be handled at input rather than left up to the font. Keyboard developers might have a different idea, though.) A hundred years from now, properly encoded Tai Tham text should be legible. But the ability to display data using temporary PUA schemes which were set up in lieu of proper rendering support appears to fade away over time. From unicode at unicode.org Mon May 14 02:55:05 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 14 May 2018 08:55:05 +0100 (BST) Subject: Choosing the Set of Renderable Strings In-Reply-To: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> Message-ID: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> One possibility that might be worth consideration is to map each otherwise unmapped glyph in the font each to a distinct code point in the Private Use Area. This being as well as all of the automated glyph substitution, not instead of it. This is not an ideal solution and may be regarded by some people as a wrong approach but if an end user of the font is trying to produce a hard copy print out or a PDF (Portable Document Format) document and is stuck because he or she cannot otherwise get the desired glyphs for the desired printable display from the font, the facility of being able to insert a desired glyph from the Private Use Area can get the desired result produced. Certainly, an end user could follow that up with feedback to the font producer so that in due course the display can become producible without needing to use a Private Use Area code point, yet having the glyphs available in the Private Use Area could sometimes be useful when a result is needed straightaway. William Overington Monday 14 May 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/05/14 - 07:15 (GMTDT) To : richard.wordingham at ntlworld.com Cc : unicode at unicode.org Subject : Re: Choosing the Set of Renderable Strings Richard Wordingham asked, ? Is this a reasonable approach to allowing both collation ? and suppressing needless homographs? My contribution to ? the rendering is only the provision of a font. If anything about this approach was unreasonable, one of the experts on this list would probably have pointed it out by now. Trailblazers such as yourself will help to establish the guidelines you seek. One does the best that one can in anticipating the character strings the font will be expected to support, follows the font specs, and puts the results out there for the public. Then, the user community, if any, may provide appropriate feedback to the developers so that adjustments can be made. Riding along with the insertion of the dotted circles by the USE enables the actual users to see immediately that the text needs to be modified in order to render reasonably on that system with the shaping engine and font selected. If users consider any such insertion inappropriate, then it's feedback time. ? ... and it is frequently desirable for a font to be able ? to display its own name. Does the font name have to be in a Latin-based script? From unicode at unicode.org Mon May 14 11:47:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 May 2018 17:47:11 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> Message-ID: <20180514174711.7d8109b6@JRWUBU2> On Mon, 14 May 2018 08:55:05 +0100 (BST) William_J_G Overington via Unicode wrote: > One possibility that might be worth consideration is to map each > otherwise unmapped glyph in the font each to a distinct code point in > the Private Use Area. This being as well as all of the automated > glyph substitution, not instead of it. That's what the Xishuangbanna News does for final consonants. My issues are generally not with producing the right image, but rather with enabling the semantically correct sequence of characters. (It would be daft to impose phonetic order on the users and then prohibit it piecemeal.) I can overcome the USE, the question is which battles the font is to fight. Richard. From unicode at unicode.org Mon May 14 14:31:15 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 May 2018 20:31:15 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> Message-ID: <20180514203115.5c093920@JRWUBU2> On Mon, 14 May 2018 04:12:56 -0800 James Kass via Unicode wrote: > In response to William Overington's post, it's easier to transcode > data from a PUA scheme into Unicode than it is to enter the data from > scratch. (The same could be said for a customized ASCII font.) Some > users may not wish to wait even the handful of years it took for > mainstream Indic complex scripts to be rendered properly. > > At this phase of Unicode's progress, however, we shouldn't encourage > the interchange of such PUA data. Since it's simple to transcode, any > such data should be transcoded prior to interchange or permanent > storage. > Recipients lacking systems supporting proper Unicode > rendering for complex scripts such as Tai Tham could then transcode it > to the PUA scheme for display/printing purposes. The PUA scheme would be roughly equivalent to the glyph sequence produced by the shaper. (The ccmp feature is in general not available for the PUA, though CSS allows its use to be forced.) However, there would be no extra channels, such as the component-mark association often needed for some cursive scripts. For example, in ???? 'to direct', SIGN U may be realised as a mark below left, a mark below , or a spacing mark on the right of . One could argue that the three positions require different glyphs for SIGN U. Each font would need its own PUA. > An OpenType font, a keyboard driver, and a text conversion utility > might go a long way towards supporting complex scripts for users whose > systems cannot otherwise currently support them. This is where Apple had the right idea, but difficult of implementation, and the OTL paradigm is deficient. There are several places in Tai Tham layout where I want to swap glyphs round, but for the layout engine to do so for me would cause grief for other Tai Tham fonts. This rearrangement cannot be delegated to the rendering engine. There are Tai Tham fonts which handle Indic rearrangement in the ccmp feature, but they are then totally defeated by either ccmp not being enabled or by the USE doing basic Indic shaping. There are now two approaches for Tai Tham - (1) fix USE or restore/create a separate shaper for scripts with CVC... aksharas, and (2) overcome the USE in the font. For the latter I need to make the work-arounds in Da Lekh easier to copy. I have transferred them to Ed Trager's Hariphunchai font, yielding Lamphun, but Lamphun still needs some further revision to the positioning logic. It wasn't as complete as I'd hoped. I've done a quick fix for the vowels below, but I suspect much more work is needed to conform to the spirit of the Hariphunchai font. I could do with someone artistic to help with the combinations of NYA and subscript consonant such as NY.CA, and Pali LL.HA is currently a disaster. On Track 1, there's also more tinkering to do, such as making MEDIAL LA and MEDIAL RA 'consonant subscript' rather than 'consonant medial' /lw/ is an allowed onset in the Tai languages using the Tai Tham script, so we get orthographic onset with MEDIAL LA in the West. The main problem is that we do not have characters *MEDIAL WA and *MEDIAL YA - the general subscript WA and YA are used instead, and these can function as matres lectionis. (In Unicode Khmer, the matres lectionis have been reanalysed as vowels.) I think it would also help to make SIGN AA and SIGN TALL AA into letters as far as the USE is concerned. The default grapheme segmentation rules already treat them as consonants. The possible downside is that so doing might mess up some fonts. > A good keyboard driver should be able to remove some of the burden off > of the OpenType tables, enabling multiple > fonts covering the same script to be used without having bloated and > redundant OpenType tables, by offering some degree of control over the > actual character strings which are being stored (and presented to the > font for rendering). It won't work. The text input delivered by X still needs to be supported, and without modifying the application, X can only input one character at a time. Not everyone uses an 'input method'. > (Many font developers might consider that any kind of normalization > should be handled at input rather than left up to the font. Keyboard > developers might have a different idea, though.) Apparently, Hangul input should not be canonically normalised in South Korea. I've seen an implementation of the USE render canonically equivalent strings differently. It wouldn't be HarfBuzz - it normalises, as we saw when it briefly messed up Tai Tham rendering when it swapped to . That was rapidly fixed to normalise the other way round. I'd completely forgotten that Thai, Lao and Tai Tham tone marks had different combining classes. However, in Northern Thai, and seem to render the same, so normalisation might not be relevant. Unsurprisingly, that's the only pair of tone-marks I've seen in the same akshara, so I don't know how the other pairs of distinct tone marks combine. A pair arises when two chained syllables have different tone marks. If they have the same tone mark, one is suppressed. Richard. From unicode at unicode.org Tue May 15 05:18:11 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 15 May 2018 02:18:11 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180514203115.5c093920@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: Richard Wordingham replied, ?? ...Private Use Area... ? ? That's what the Xishuangbanna News does for final consonants. I failed to find a link for their web site, but only spent about an hour and a half searching for it. There is a web site for "Xishuangbanna Daily", but the pages I saw there were all in Chinese. If Xishuangbanna News is publishing using PUA, then they probably offer a font for download. I was just curious to see what their web pages looked like, and wondered how pervasive the PUA use is. If their site only resorts to PUA for final consonants, then a presumption would be that the USE supports all other shaping requirements for the script. ? My issues are generally not with producing the right image, ? but rather with enabling the semantically correct sequence ? of characters. Because you started out with all the Tai Tham glyphs mapped to the PUA, and are now trying to produce a working font using the standard encoding? From unicode at unicode.org Tue May 15 07:19:42 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 15 May 2018 04:19:42 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180514203115.5c093920@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode wrote: > ... One could argue that the three positions require > different glyphs for SIGN U. Each font would need its own PUA. Or a consensus. > ... There are several > places in Tai Tham layout where I want to swap glyphs round, but for > the layout engine to do so for me would cause grief for other Tai Tham > fonts. This rearrangement cannot be delegated to the rendering > engine. There are Tai Tham fonts which handle Indic rearrangement in > the ccmp feature, but they are then totally defeated by either ccmp not > being enabled or by the USE doing basic Indic shaping. Suppose the OpenType specs were revised to include a bit which could be set for disabling basic Indic shaping by the USE? I wouldn't set it if I were just starting out to make a font for a complex script requiring basic Indic shaping, and cannot imagine why anyone else just starting out would. > ... > > I think it would also help to make SIGN AA and SIGN TALL AA into > letters as far as the USE is concerned. The default grapheme > segmentation rules already treat them as consonants. The possible > downside is that so doing might mess up some fonts. The possibility of messing up some fonts has seldom (if ever) stopped needed revisions to shaping engines before. I should know. >> A good keyboard driver ... > > It won't work. The text input delivered by X still needs to be > supported, and without modifying the application, X can only input one > character at a time. Not everyone uses an 'input method'. Every keyboard uses a driver, though. I can't speak for "X", but my understanding is that the keyboard driver acts as sort of a buffer between the user's key strokes and the application. > Apparently, Hangul input should not be canonically normalised in South > Korea. I've seen an implementation of the USE render canonically > equivalent strings differently. ... Because the USE failed or because the font provided look-ups for each of those strings to different glyphs? Best regards, James Kass From unicode at unicode.org Tue May 15 09:04:45 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 15 May 2018 06:04:45 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: Display behaviour which is script-specific should be handled by the rendering/shaping engine. Only that which is font-specific should be handled by the font. The font's OpenType tables will include pointers to presentation forms which aren't directly encoded, the location and repertoire of which would naturally differ from font to font. Likewise, the font's GPOS tables will handle things such as mark positioning, because each font's metrics are going to be different. Because the USE apparently accesses current on-line Unicode data, the USE will re-order anything which needs to be moved around. From unicode at unicode.org Tue May 15 09:15:07 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 15 May 2018 15:15:07 +0100 (BST) Subject: Colours - both for emoji and otherwise Message-ID: <26212264.32041.1526393707864.JavaMail.defaultUser@defaultHost> Years ago this mailing list had some wonderful long discussions. A similar such discussion may be interesting now on the topic of Colours - both for emoji and otherwise, as recent developments could possibly be leading towards a major change in Unicode. A few days - including a weekend - before the recent UTC (Unicode Technical Committee) meeting there appeared in the Current UTC Document Register for 2018 the following document. http://www.unicode.org/L2/L2018/18141-emoji-colors.pdf I wrote some comments and sent them in as feedback. They are available as the last listed item in the Encoding Feedback for that particular UTC meeting. http://www.unicode.org/L2/L2018/18117-pubrev.html#Encoding_Feedback However, the original 18141-emoji-colors.pdf document has been revised twice since that feedback and the following is the present version. http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf It seems to me that there are, in a Unicode context, at least two possible ways that the use of a white square next to an emoji of a brown bear could "indicate that an emoji has a different color". One way is that the person viewing the white square next to an emoji of a brown bear 'knows' that a white bear is intended and 'understands' that that is the intended meaning - that could be useful as it is language-independent so communication through the language barrier of mention of a white bear is possible. I just wrote language-independent but I am wondering if 'knowing' that and 'understanding' that mean that the use of those characters in that way is part of an emoji-based language. I am not a linguist and maybe some people who are linguists might like to comment on that and also maybe on the whole notion of emoji characters being used to produce languages - not necessarily constructed languages but also languages that are arising and evolving naturally but at a much faster rate than natural languages evolved historically. Another way is that the rendering system displays an emoji of a white bear instead of the white square next to an emoji of a brown bear. Yet would what I have just referred to as an emoji of a white bear actually be an emoji as such or would it be a "just" a picture glyph and not an emoji as such as it is not a separately encoded character? What makes the present situation interesting though and thus worth a discussion is the following. The new characters about colours are listed in sections 5 and 6 of the following document. http://www.unicode.org/L2/L2018/18176-future-adds.pdf Yet the minutes of the UTC meeting, http://www.unicode.org/L2/L2018/18115.htm has the following. > Discussion. UTC took no action at this time. Now maybe that was later overridden by later discussions yet not listed in the minutes under Emoji Colors as such, but I am wondering if that refers to whether, and if so, how, a white square next to an emoji of a brown bear could be specified within The Unicode Standard so that such a sequence were to become rendered as a glyph of a white bear. Yet I am wondering if another set of characters, colour operators, should be defined for such an automated purpose, yet also have a displayable glyph for graceful fall-back display when automated rendering is not possible: the colour operators being encoded in plane 14; yet also having a mode where the colour operator could be displayed as a zero-width space as an alternative graceful fall-back display. Yet colours are being talked about in relation to emoji. What about with other characters, such as letters of the alphabet? The encoding of colours is fascinating and may be the next big thing with Unicode, so a discussion in this mailing list as to what is possible and what is desirable could be of importance. William Overington Tuesday 15 May 2018 From unicode at unicode.org Tue May 15 12:47:51 2018 From: unicode at unicode.org (Johnny Farraj via Unicode) Date: Tue, 15 May 2018 13:47:51 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols Message-ID: Dear Unicode list members, I wish to get feedback about a new symbol submission proposal. Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: 266D ? MUSIC FLAT SIGN 266F ? MUSIC SHARP SIGN while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT 1D12C ?? MUSICAL SYMBOL FLAT UP 1D12D ?? MUSICAL SYMBOL FLAT DOWN 1D130 ?? MUSICAL SYMBOL SHARP UP 1D131 ?? MUSICAL SYMBOL SHARP DOWN 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT None of these matches what's used in Arabic music notation. I am proposing the addition of 2 new characters to the Musical Symbols table: - the half-flat sign (lowers a note by a quarter tone) - the half-sharp sign (raises a note by a quarter tone) [image: Inline image] [image: Inline image] These are the correct symbols for Arabic music notation, and they express intervals that are multiples of quarter tones. it would be really nice to be able to include them directly in an HTML page or rich text document using a native code rather than an image. I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com, the most widely used online resource on Arabic music theory, in English. My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com, another important online reference for Arabic music theory. Together, we are in the process of publishing a book on Arabic music theory and performance with Oxford University Press, coming out late 2018. I can also enlist the support of many academics in the music theory field who specialize in Arabic music. I welcome any feedback on this proposal. thanks Johnny Farraj -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-sharp sign.png Type: image/png Size: 2754 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-flat sign.png Type: image/png Size: 3617 bytes Desc: not available URL: From unicode at unicode.org Tue May 15 15:29:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 May 2018 21:29:49 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: <20180515212949.73568f11@JRWUBU2> On Tue, 15 May 2018 02:18:11 -0800 James Kass via Unicode wrote: > Richard Wordingham replied, > > ?? ...Private Use Area... > ? > ? That's what the Xishuangbanna News does for final consonants. > > I failed to find a link for their web site, but only spent about an > hour and a half searching for it. There is a web site for > "Xishuangbanna Daily", but the pages I saw there were all in Chinese. There's a sample at New sample: http://www.dw12.com/DigitalNewspaper/xsbnbold/content/20180325/ArticelA04001DK.htm . I'd have added a link, but the sample page wasn't working. The page is currently suffering an attack of dittography (seen on both IE on windows 7 and Firefox on Ubuntu). > If Xishuangbanna News is publishing using PUA, then they probably > offer a font for download. I was just curious to see what their web > pages looked like, and wondered how pervasive the PUA use is. If > their site only resorts to PUA for final consonants, then a > presumption would be that the USE supports all other shaping > requirements for the script. > > ? My issues are generally not with producing the right image, > ? but rather with enabling the semantically correct sequence > ? of characters. > > Because you started out with all the Tai Tham glyphs mapped to the > PUA, and are now trying to produce a working font using the standard > encoding? No. The problem is a grammar Nazi of a rendering engine. I have been working from a set of characters, and what has happened is that some glyphs (in the ISO sense, not in the sense used for fonts) that looked as though they may have needed variation sequences have been split off as formally unrelated characters - MEDIAL LA, MEDIAL RA, SIGN SA and SIGN LOW PA OR HIGH RATHA. What do you mean by 'standard encoding'? It is agreed that there is a standard coding for *characters*. I have been using the encoding proposal accepted by the Unicode Technical Committee as the definition of the encoding of text; that, interpreted in the light of the changes to the encoding for characters, is what I have been using as the definition of the encoding of characters. A problem is that it seems that Unicode does not specify the encoding of text. HarfBuzz used to more or less implement the rules in the proposal, and rendering generally worked. Then HarfBuzz switched to USE. For example, what prompted my question was the encoding of the words /t??n t??/ and /t?? t??n/, both meaning 'hornet'. If the subscript consonant representing /n/ and and the vowel /??/ form a ligature which is ambiguous as to the order of the phonemes, or the vowel truly falls through below the consonant, then the contracted form is the same for both words, and will be rendered if I type it as ??????? . However, the logical reading of that spelling is /ta?n?? ta?n??/, which sounds like a slightly unusual intensifier. If we follow the principal of using phonetic order, then /t??n t??/ will be encoded ??????? and /t?? t??n/ will be encoded ??????? . Both get a dotted circle because of the sequence . The second one gets a dotted circle because of tone before vowel; misapplying the single subsyllable rule from the proposal, the offence is having a tone mark before a vowel not on the right. Without the tone mark or MAI KANG, the offence would be having a below matra (SIGN OA BELOW) before a left matra (SIGN AE). When MAI KANG was a vowel, back in Unicode 9.0, a USE implementation would detect two different offences: (a) Having a top matra (MAI KANG) before a left matra (SIGN AE) and (b) Following the accepted proposal for Tai Tham and having a bottom matra (SIGN OA BELOW) before a top matra (MAI KANG). A fastidious writer would separate the two subsyllables with MAI SAM, which is a visible mark. My specific question was whether, in the absence of MAI SAM, it was in order to use CGJ to separate the two subsyllables, so that a grammar checker would know where the boundary between the subsyllables lay. The issue is that the TUS says that CGJ does not affect rendering, just after an example of it affecting rendering in Hebrew. Now, a possible argument is that it may affect whether rendering occurs; the insertion of a dotted circle is to be interpreted as meaning that the renderer has refused to render the string. Richard. From unicode at unicode.org Tue May 15 16:40:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 May 2018 22:40:11 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: <20180515224011.13c4b348@JRWUBU2> On Tue, 15 May 2018 04:19:42 -0800 James Kass via Unicode wrote: > On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode > wrote: > > > ... One could argue that the three positions require > > different glyphs for SIGN U. Each font would need its own PUA. > > Or a consensus. One would end up with a large glyph list to accommodate all designs. Imagine applying this approach to Devanagari, with all its Sanskrit conjuncts to be supported although some converters would only target a small subset. > > ... There are several > > places in Tai Tham layout where I want to swap glyphs round, but for > > the layout engine to do so for me would cause grief for other Tai > > Tham fonts. This rearrangement cannot be delegated to the rendering > > engine. There are Tai Tham fonts which handle Indic rearrangement > > in the ccmp feature, but they are then totally defeated by either > > ccmp not being enabled or by the USE doing basic Indic shaping. > > Suppose the OpenType specs were revised to include a bit which could > be set for disabling basic Indic shaping by the USE? I wouldn't set > it if I were just starting out to make a font for a complex script > requiring basic Indic shaping, and cannot imagine why anyone else just > starting out would. One would need to set the bit while the script was not yet in Unicode, and then you may well need to set it when the USE bites. As another concrete example, one couldn't use USE for the Khmer script - it too has CVC syllables. I believe there are also lurking problems with the ordering of the rarer marks. You'd come unstuck if you found your script had both preposed subscripts and optionally preposed matras. The USE can't handle both in the same syllable. One might need to ignore syllable boundaries before Indic re-ordering, though that's probably a preference rather than a requirement. Tai Tham has a troublesome mark, U+1A58 TAI THAM SIGN MAI KANG LAI. In the West, it's 'Consonant final' and is a mark above or above right. In the East, it works like Burmese kinzi, and acts like a repha. Revision 1 of the Maefahluang Dictionary of Northern Thai sits on the border. In its text, it behaves one way in some environments, and the other way in others. Finally, many scripts had fonts before windows supported them. Indeed, isn't significant Tai Tham renderer support on Windows 7 restricted to HarfBuzz clients? (I don't believe M17n is significant, and I fear my interfacing set-up only works for my fonts.) > >> A good keyboard driver ... > > > > It won't work. The text input delivered by X still needs to be > > supported, and without modifying the application, X can only input > > one character at a time. Not everyone uses an 'input method'. > > Every keyboard uses a driver, though. I can't speak for "X", but my > understanding is that the keyboard driver acts as sort of a buffer > between the user's key strokes and the application. X attempts to present the key strokes to the application. The application may chose to present these key stroke to an input method to handle, but these input methods are not reliable. I have a battery of three inputs methods for most applications on Ubuntu - raw X keyboard mapping, ibus using Keyman for Linux, and fcitx using M17n. Additionally, I find Emacs is easier to use if I talk to it in ASCII and use its input methods for other character sets. The advantage there is that Emacs knows whether I am entering a command, which must be in ASCII, or text, for which it uses the active input method. Another issue is that normalised text can be highly inconvenient for a font. HarfBuzz chooses a non-standard normalisation for several scripts simply because that makes things easier for a font. > > I've seen an implementation of the USE render > > canonically equivalent strings differently. ... > > Because the USE failed or because the font provided look-ups for each > of those strings to different glyphs? Remember that the USE changes the string presented to the font by inserting dotted circles. Essentially, and can be penalised differently - Microsoft inserts more dotted circles than does HarfBuzz. Richard. From unicode at unicode.org Tue May 15 16:46:05 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 15 May 2018 14:46:05 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: Message-ID: On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode < unicode at unicode.org> wrote: > Dear Unicode list members, > > I wish to get feedback about a new symbol submission proposal. > Just to clarify, this is a discussion list where you may get some useful feedback. This is not where you would submit an actual proposal. See https://www.unicode.org/pending/proposals.html I am proposing the addition of 2 new characters to the Musical Symbols > table: > > - the half-flat sign (lowers a note by a quarter tone) > - the half-sharp sign (raises a note by a quarter tone) > In an actual proposal, I would expect a discussion of whether you are proposing to encode established symbols, or whether you are proposing new symbols to be adopted by the community (in which case Unicode would probably wait & see if they get established). A proposal should also show evidence of usage and glyph variations. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 15 17:48:14 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 15 May 2018 15:48:14 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: Message-ID: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote: > > I am proposing the addition of 2 new characters to the Musical > Symbols table: > > - the half-flat sign (lowers a note by a quarter tone) > - the half-sharp sign (raises a note by a quarter tone) > > > In an actual proposal, I would expect a discussion of whether you are > proposing to encode established symbols, or whether you are proposing > new symbols to be adopted by the community (in which case Unicode > would probably wait & see if they get established). > > A proposal should also show evidence of usage and glyph variations. > And should probably refer to the relationship between these signs and the existing: U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT which are also half-sharp or half-flat accidentals. The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat. And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp. So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See: https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 15 17:51:35 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 May 2018 23:51:35 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: <20180515235135.7df264c2@JRWUBU2> On Tue, 15 May 2018 06:04:45 -0800 James Kass via Unicode wrote: > Display behaviour which is script-specific should be handled by the > rendering/shaping engine. Only that which is font-specific should be > handled by the font. That makes a lot of sense. Unfortunately, script-specific behaviour often needs to be fixed or is completely absent. It annoys me that my font has to redo the bits of basic Indic shaping that are left undone because the USE chops the aksharas up. > The font's OpenType tables will include pointers to presentation forms > which aren't directly encoded, the location and repertoire of which > would naturally differ from font to font. Likewise, the font's GPOS > tables will handle things such as mark positioning, because each > font's metrics are going to be different. > > Because the USE apparently accesses current on-line Unicode data, the > USE will re-order anything which needs to be moved around. In Thai, the sequence is converted to . Please tell me where in the on-line Unicode data it says that: 1) Tai Tham is reordered to : (a) When the base consonant is NA; and also (b) in a typical Northern Thai font, but not a Lao*, Tai Lue or Tai Khuen font. *Some claim that Lao Tham doesn't use tone marks, but some version at least does, or Gregory Kourilsky wouldn't have included them in his encoding of the Tham script. **The placement may be different to that of MAI KANG in /b?? wa?/ ?????? or ?????? - I don't know whether the first or the second tone mark is dropped. (Getting the tone and MAI KANG to interact after has formed the NAA ligature from seems impossible. I assume this is because such interaction is undesirable for Arabic.) 2) needs to be rearranged to (or equivalent). And how am I supposed to position MAI SAM to the right of the rightmost of the level 1 marks above? Is this a standard positioning as opposed to a stylistic decision? Incidentally, how does Unicode document the handling of a tone mark before U+0E33 THAI CHARACTER SARA AM? Richard. From unicode at unicode.org Tue May 15 19:19:58 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 16 May 2018 01:19:58 +0100 Subject: Complete Definition of Each Supported Script Message-ID: <20180516011958.11fb1276@JRWUBU2> I just found this assertion in https://en.wikipedia.org/wiki/Uniscribe: "Microsoft worked with the Unicode Technical Committee to make shaping requirements available in a machine readable format, so a complete definition of each supported script will be included in the Unicode standard and updating or adding new scripts will be significantly simplified." It was added on 14 August 2016. Apart from the holding of discussions and the consequences for Uniscribe/DirectWrite, is this true or is someone adding two and two together and making five? 1. In particular, is this anything more than reading too much into the General_Category, Indic_syllabic_Category and Indic_Positional_Category? They could only work if the regular expressions in the documentation of the Universal Script Engine were correct (we know they aren't), and there are many shaping requirements that font developers have to discover from other sources. If there is more to it: 2. Are there "shaping requirements available in machine readable format", and if so, how can one obtain them? 3. When will these "complete definition[s] of each supporting script [...] be included in the Unicode standard"? How will they be checked? (There is a very good chance that many of them would be wrong.) Richard. From unicode at unicode.org Tue May 15 23:32:24 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Tue, 15 May 2018 21:32:24 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: Message-ID: What happened to the previous proposal? As I recall, there was some good discussion after an email from you back in 2015 < http://www.unicode.org/mail-arch/unicode-ml/y2015-m03/0118.html> and Michael Everson offered assistance, but no formal proposal has been submitted to the Documents Register since then. These symbols are also used in Turkish notation and Western microtonal notation. They are far more common than MUSICAL SYMBOL QUARTER TONE SHARP and MUSICAL SYMBOL QUARTER TONE FLAT, which AFAICT only appear in the Unicode code charts and nowhere else. On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode < unicode at unicode.org> wrote: > > Dear Unicode list members, > > I wish to get feedback about a new symbol submission proposal. > > Currently the Miscellaneous Symbols table (2600-26FF) includes the > following characters: > > 266D ? MUSIC FLAT SIGN > 266F ? MUSIC SHARP SIGN > > while the Musical Symbols table (1D100 - 1D1FF) includes the following > characters: > > 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP > 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT > 1D12C ?? MUSICAL SYMBOL FLAT UP > 1D12D ?? MUSICAL SYMBOL FLAT DOWN > 1D130 ?? MUSICAL SYMBOL SHARP UP > 1D131 ?? MUSICAL SYMBOL SHARP DOWN > 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP > 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT > > None of these matches what's used in Arabic music notation. > > I am proposing the addition of 2 new characters to the Musical Symbols > table: > > - the half-flat sign (lowers a note by a quarter tone) > - the half-sharp sign (raises a note by a quarter tone) > > [image: Inline image] > [image: Inline image] > > > These are the correct symbols for Arabic music notation, and they express > intervals that are multiples of quarter tones. it would be really nice to > be able to include them directly in an HTML page or rich text document > using a native code rather than an image. > > I am the primary sponsor of this proposal. As far as my credentials, I am > the owner of http://maqamworld.com, the most widely used online resource > on Arabic music theory, in English. > > My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com, > another important online reference for Arabic music theory. > > Together, we are in the process of publishing a book on Arabic music > theory and performance with Oxford University Press, coming out late 2018. > > I can also enlist the support of many academics in the music theory field > who specialize in Arabic music. > > I welcome any feedback on this proposal. > > thanks > > Johnny Farraj > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-flat sign.png Type: image/png Size: 3617 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-sharp sign.png Type: image/png Size: 2754 bytes Desc: not available URL: From unicode at unicode.org Wed May 16 02:42:31 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 16 May 2018 09:42:31 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> Message-ID: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> > On 16 May 2018, at 00:48, Ken Whistler via Unicode wrote: > > On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote: >> I am proposing the addition of 2 new characters to the Musical Symbols table: >> >> - the half-flat sign (lowers a note by a quarter tone) >> - the half-sharp sign (raises a note by a quarter tone) >> >> In an actual proposal, I would expect a discussion of whether you are proposing to encode established symbols, or whether you are proposing new symbols to be adopted by the community (in which case Unicode would probably wait & see if they get established). >> >> A proposal should also show evidence of usage and glyph variations. > > And should probably refer to the relationship between these signs and the existing: It would be best to encode the SMuFL symbols, which is rather comprehensive and include those: https://www.smufl.org http://www.smufl.org/version/latest/ > U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP > U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT > > which are also half-sharp or half-flat accidentals. > > The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat. > > And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp. > > So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See: > > https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue. From unicode at unicode.org Wed May 16 07:37:06 2018 From: unicode at unicode.org (Johnny Farraj via Unicode) Date: Wed, 16 May 2018 08:37:06 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: Message-ID: Hi Garth, You are right, I sent a similar posting to the list 3 years ago. at that time I was hoping get help from some of the more experienced members on the list to write a proposal. this is a very specialized job and it could take me months to figure out the process and learn the language. But no one was able to help. so I'm trying again. My motivation is being able to type these symbols directly into a MS-Word document or HTML page, just like you would type a Western flat or sharp accidental symbol today. My motivation is NOT to make these symbols available in sheet music notation software; there are solutions for that today and it's a whole different problem domain. About the existing symbols U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT I don't know what musical tradition these belong to, as far as I know no one uses them in real life. I need to make the case for new symbols called Arabic Half Flat / Sharp. I don't see my proposal really as a duplication of existing symbols for the following reason: there is no universal way to notate such accidentals, and every musical tradition with concepts such as half-flat and half-sharp has its own standard. I am not an expert in any tradition other than Arabic. therefore all I'm trying to do is add the Arabic version of these symbols. the Arabic symbols I'm proposing are established, and have been the standard for a good 75 years. Any Arabic notation (except for a few remnants from the 1930s before this standard in use) uses the symbols I'm proposing to add. I do not foresee any disagreement over what the half-sharp/half-flat Arabic symbols look like, and I can include tons of evidence in my proposal. Can someone on the list volunteer to guide with writing a proposal? I'm willing to do all the work, I just don't know how to start. I need a template, and I will be happy to complete it with all the required information. thanks Johnny Farraj On Wed, May 16, 2018 at 12:32 AM, Garth Wallace wrote: > What happened to the previous proposal? As I recall, there was some good > discussion after an email from you back in 2015 < > http://www.unicode.org/mail-arch/unicode-ml/y2015-m03/0118.html> and > Michael Everson offered assistance, but no formal proposal has been > submitted to the Documents Register since then. > > These symbols are also used in Turkish notation and Western microtonal > notation. They are far more common than MUSICAL SYMBOL QUARTER TONE SHARP > and MUSICAL SYMBOL QUARTER TONE FLAT, which AFAICT only appear in the > Unicode code charts and nowhere else. > > On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode < > unicode at unicode.org> wrote: > >> >> Dear Unicode list members, >> >> I wish to get feedback about a new symbol submission proposal. >> >> Currently the Miscellaneous Symbols table (2600-26FF) includes the >> following characters: >> >> 266D ? MUSIC FLAT SIGN >> 266F ? MUSIC SHARP SIGN >> >> while the Musical Symbols table (1D100 - 1D1FF) includes the following >> characters: >> >> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP >> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT >> 1D12C ?? MUSICAL SYMBOL FLAT UP >> 1D12D ?? MUSICAL SYMBOL FLAT DOWN >> 1D130 ?? MUSICAL SYMBOL SHARP UP >> 1D131 ?? MUSICAL SYMBOL SHARP DOWN >> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP >> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT >> >> None of these matches what's used in Arabic music notation. >> >> I am proposing the addition of 2 new characters to the Musical Symbols >> table: >> >> - the half-flat sign (lowers a note by a quarter tone) >> - the half-sharp sign (raises a note by a quarter tone) >> >> [image: Inline image] >> [image: Inline image] >> >> >> These are the correct symbols for Arabic music notation, and they express >> intervals that are multiples of quarter tones. it would be really nice to >> be able to include them directly in an HTML page or rich text document >> using a native code rather than an image. >> >> I am the primary sponsor of this proposal. As far as my credentials, I am >> the owner of http://maqamworld.com, the most widely used online resource >> on Arabic music theory, in English. >> >> My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com, >> another important online reference for Arabic music theory. >> >> Together, we are in the process of publishing a book on Arabic music >> theory and performance with Oxford University Press, coming out late 2018. >> >> I can also enlist the support of many academics in the music theory field >> who specialize in Arabic music. >> >> I welcome any feedback on this proposal. >> >> thanks >> >> Johnny Farraj >> >> >> >> >> > -- Johnny -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-sharp sign.png Type: image/png Size: 2754 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-flat sign.png Type: image/png Size: 3617 bytes Desc: not available URL: From unicode at unicode.org Wed May 16 08:23:08 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 May 2018 05:23:08 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180515235135.7df264c2@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> Message-ID: In response to Richard Wordingham, Sorry I can't answer many of your questions. Hoping someone who can does. Note that although the proposal gave canonical combining class zero to both the tone marks and the vowel signs, the on-line Unicode data gives canonical combining class 230 to the tone marks. > **The placement may be different to that of MAI KANG > in /b?? wa?/ ?????? SIGN AA> or ?????? SIGN AA> - I don't know whether the first or the second > tone mark is dropped. FWIW, neither is dropped in the display here, although they don't display identically. The first string shows TONE-1 positioned to the right of MAI KANG, the second string superimposes them. (Windows 7 running LibreOffice in order to enable the USE from HarfBuzz.) > (Getting the tone and MAI KANG to interact after SIGN AA, MAI KANG> has formed the NAA ligature from > seems impossible. Substituting U+1A36 TAI THAM LETTER NA for BA in the above strings, ?????? ??????, and trying to get the ligature are in the attached *.PNG file. Here's the four strings for the PNG: \u1A36\u1A74\u1A75\u1A60\u1A45\u1A63 \u1A36\u1A74\u1A60\u1A45\u1A75\u1A63 \u1A36\u1A75\u1A63\u1A74 \u1A36\u1A63\u1A74\u1A75 -------------- next part -------------- A non-text attachment was scrubbed... Name: TaiTham_20180516.PNG Type: image/png Size: 2363 bytes Desc: not available URL: From unicode at unicode.org Wed May 16 08:25:59 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 16 May 2018 15:25:59 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: <9ADA9726-54A3-4642-936F-FFB9A6DDC757@telia.com> > On 16 May 2018, at 09:42, Hans ?berg via Unicode wrote: > >> On 16 May 2018, at 00:48, Ken Whistler via Unicode wrote: >> >>> A proposal should also show evidence of usage and glyph variations. >> >> And should probably refer to the relationship between these signs and the existing: > > It would be best to encode the SMuFL symbols, which is rather comprehensive and include those: > https://www.smufl.org > http://www.smufl.org/version/latest/ > >> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP >> U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT >> >> which are also half-sharp or half-flat accidentals. >> >> The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat. >> >> And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp. >> >> So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See: >> >> https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation > > These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue. Clarification: The Arabic accidentals, listed here as separate entities http://www.smufl.org/version/latest/range/arabicAccidentals/ appear in LilyPond as ordinary microtonal accidentals: http://lilypond.org/doc/v2.18/Documentation/notation/the-feta-font#accidental-glyphs So what I meant above is that originally, they were the same, i.e., when starting to use them in Arabic music, one took some Western microtonal accidentals. Now they mean microtones in the style of Arabic music, and the musical interpretation varies. From unicode at unicode.org Wed May 16 15:46:22 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 16 May 2018 13:46:22 -0700 Subject: L2/18-181 Message-ID: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf This is a fascinating proposal to disunify the Assamese script from Bengali on the following bases: 1. The identity of Assamese as a script distinct from Bengali is in jeopardy. 2. Collation is different between the Assamese and Bengali languages, and code point order should reflect collation order. 3. Keyboard design is more difficult because consonants like ??? are encoded as conjunct forms instead of atomic characters. 4. The use of a single encoded script to write two languages forces users to use language identifiers to identify the language. 5. Transliteration of Assamese into a different script is problematic because letters have different phonological value in Assamese and Bengali. It will be interesting to see where this proposal goes. Given that all or most of these issues can be claimed for English, French, German, Spanish, and hundreds of other languages written in the Latin script, if the Assamese proposal is approved we can expect similar disunification of the Latin script into language-specific alphabets in the future. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed May 16 16:39:36 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 16 May 2018 22:39:36 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> Message-ID: <20180516223936.32a843d1@JRWUBU2> On Wed, 16 May 2018 05:23:08 -0800 James Kass via Unicode wrote: > Note that although the proposal gave canonical combining class > zero to both the tone marks and the vowel signs, the on-line Unicode > data gives canonical combining class 230 to the tone marks. There were several changes from ccc=0 to non-zero that were sneaked in between the UTC agreeing to proceed with the proposal and Unicode 5.2 being published. That may have been a test of vigialnce; we failed. I have seen no benefit from the changes - U+A160 TAI THAM SIGN SAKOT is not a virama (it should not appear in valid text), and having the tone marks and the invisible stacker have distinct non-zero classes has caused lots of irritation. We should probably have risked Tai Tham being excluded from the BMP and gone for the Tibetan model; normalised would not then damage Tai tham text. > > **The placement may be different to that of MAI KANG > > in /b?? wa?/ ?????? > SIGN AA> or ?????? > SIGN AA> - I don't know whether the first or the second > > tone mark is dropped. > FWIW, neither is dropped in the display here, although they don't > display identically. The first string shows TONE-1 positioned to the > right of MAI KANG, the second string superimposes them. (Windows 7 > running LibreOffice in order to enable the USE from HarfBuzz.) The full uncontracted writing is . Both syllables have TONE-1, but I have not seen two identical tone marks from different phonetic syllables in the same stack. The person typing the contraction drops a tone mark, not the rendering system. > Substituting U+1A36 TAI THAM LETTER NA for BA in the above strings, > ?????? ??????, and trying to get the ligature are in the attached > *.PNG file. Here's the four strings for the PNG: > > \u1A36\u1A74\u1A75\u1A60\u1A45\u1A63 > \u1A36\u1A74\u1A60\u1A45\u1A75\u1A63 > \u1A36\u1A75\u1A63\u1A74 > \u1A36\u1A63\u1A74\u1A75 A lot of fonts have trouble ligating NA and AA when there is material between them. (Hint: Classify all non-spacing subscript consonants as marks, and spacing subscript consonants as bases, and set the ligating lookup to ignore marks.) Your example appears to be using the font called 'A Tai Tham KH New'. While the only way to type Pali _bho_ 'O' after other text in this font or 'A Tai Tham KH' is to enter the correct sequence , the former font cannot render Pali _mano_ 'mind' (also used in Northern Thai and probably also Tai Khuen) if one types the correct sequence . One has to type ! The *older* font 'A Tai Tham KH (at Version 2.0) does render the correct spelling properly. As an example of correct rendering, I include the Pali for 'O mind!', _bho mano_, encoded , as rendered by the Lamphun font. Richard. -------------- next part -------------- A non-text attachment was scrubbed... Name: o_mind.png Type: image/png Size: 2049 bytes Desc: not available URL: From unicode at unicode.org Wed May 16 17:01:10 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 16 May 2018 23:01:10 +0100 Subject: L2/18-181 In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> Message-ID: <20180516230110.31f9efa2@JRWUBU2> On Wed, 16 May 2018 13:46:22 -0700 Doug Ewell via Unicode wrote: > http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf > > This is a fascinating proposal to disunify the Assamese script from > Bengali on the following bases: > 3. Keyboard design is more difficult because consonants like ??? > are encoded as conjunct forms instead of atomic characters. Users of X do have a valid gripe here. An X keyboard mapping can only accept single codepoints; sequences require explicit support by the application. Advanced applications get round this by using an input method, but they can be unreliable, particularly over networks. (I ended up creating an X keyboad mapping as back-up, but when I use it I lose all my 'ligature' keys.) However, that seems to be an argument for deprecating Bengali, rather than for disunifying Bengali and Assamese. I think simple Windows keyboards have a limit of 4 16-bit code units; for an Indic SMP script, one couldn't map to a single key, as it would require 6 code units. It would be handy to have characters whose only use was to input text; adding characters that are subject to composition exclusions would not change whether text is in NFC, in NFD, or neither. Of course, if the scripts were disunified, would we have to ban Assamese domain names in the new 'Assamese script' as they would be ambiguous with Bengali names. Richard. From unicode at unicode.org Wed May 16 17:41:12 2018 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Wed, 16 May 2018 17:41:12 -0500 Subject: Fwd: L2/18-181 In-Reply-To: <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> Message-ID: > On May 16, 2018, at 3:46 PM, Doug Ewell via Unicode wrote: > > http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf > > This is a fascinating proposal to disunify the Assamese script from > Bengali on the following bases: ?Fascinating? is a not a term I?d use for this proposal. If folks are interested in a valid proposal for disunification of Bengali, please look at the proposal for Tirhuta. > 1. The identity of Assamese as a script distinct from Bengali is in > jeopardy. This is not a technical matter. Moreover, its typical rhetoric used by various language communities in South Asia. Fairly standard fare for those familiar with such issues. The proposal needs to show how the two scripts differ, ie. conjuncts, CV ligatures, etc. The number forms are similar to those already encoded. Again, cf. Tirhuta. > 2. Collation is different between the Assamese and Bengali languages, > and code point order should reflect collation order. The same issue applies to dictionary order for Hindi, Marathi, which differ from the conventional Sanskrit order for Devanagari. Orthographies for various languages put conjuncts and other things at the end, which are not considered atomic letters. Nothing special in this regard for Assamese and Bengali. > 3. Keyboard design is more difficult because consonants like ??? > are encoded as conjunct forms instead of atomic characters. Ignorant question on my part: is it difficult to use character sequences as labels for keys? I see keys for both ??? and ??? on the iOS Hindi keyboard, and ??? is tucked away under ?. > 4. The use of a single encoded script to write two languages forces > users to use language identifiers to identify the language. Same applies to each of the 40+ varieties of Hindi, as well as Marathi, etc. Another ignorant question: how to identify the various languages that use Arabic and Cyrillic? > 5. Transliteration of Assamese into a different script is problematic > because letters have different phonological value in Assamese and > Bengali. Transliteration or transcription? In any case, this applies to other languages written using similar scripts: a Marathi speaker pronounces ? and ? differently than a Hindi speaker does. > It will be interesting to see where this proposal goes. Hopefully, it does not go too far. What it proposes is contrary to Unicode and redundant. > Given that all > or most of these issues can be claimed for English, French, German, > Spanish, and hundreds of other languages written in the Latin script, if > the Assamese proposal is approved we can expect similar disunification > of the Latin script into language-specific alphabets in the future. Fascinating. I mean, terrible. All my best, Anshuman From unicode at unicode.org Wed May 16 18:34:35 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 17 May 2018 00:34:35 +0100 Subject: L2/18-181 In-Reply-To: <20180516230110.31f9efa2@JRWUBU2> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <20180516230110.31f9efa2@JRWUBU2> Message-ID: <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com> This is not a fault of the encoding. > On 16 May 2018, at 23:01, Richard Wordingham via Unicode wrote: > > I think simple Windows keyboards have a limit of 4 16-bit code units; > for an Indic SMP script, one couldn't map to a single key, as it > would require 6 code units. From unicode at unicode.org Wed May 16 18:38:24 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 17 May 2018 00:38:24 +0100 Subject: L2/18-181 In-Reply-To: References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> Message-ID: <49C16312-F60E-4495-B3E5-D74079FE5F9B@evertype.com> And Icelandic. And Irish. And so on. > On 16 May 2018, at 23:41, Anshuman Pandey via Unicode wrote: > >> 2. Collation is different between the Assamese and Bengali languages, >> and code point order should reflect collation order. > > The same issue applies to dictionary order for Hindi, Marathi, which > differ from the conventional Sanskrit order for Devanagari. From unicode at unicode.org Wed May 16 19:20:43 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 01:20:43 +0100 Subject: L2/18-181 In-Reply-To: <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <20180516230110.31f9efa2@JRWUBU2> <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com> Message-ID: <20180517012043.2d5a2f7d@JRWUBU2> On Thu, 17 May 2018 00:34:35 +0100 Michael Everson via Unicode wrote: > This is not a fault of the encoding. > > > On 16 May 2018, at 23:01, Richard Wordingham via Unicode > > wrote: > > > > I think simple Windows keyboards have a limit of 4 16-bit code > > units; for an Indic SMP script, one couldn't map to a single > > key, as it would require 6 code units. It is a consequence of the policy of avoiding precomposed characters. If there were a precomposed character for , the keyboard could emit that character - job done. One objection is that one would need a sequence of decompositions: = = Some people are vehemently opposed to unnatural characters like . Presumable the official view is that Windows Text Services have taken us beyond that point, and the likes of above are not needed. If X persists, perhaps named sequences should be assigned numbers so that X can make a generic allocation of keysym codes to named sequences. Richard. From unicode at unicode.org Wed May 16 19:24:11 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 01:24:11 +0100 Subject: L2/18-181 In-Reply-To: References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> Message-ID: <20180517012411.4dfcdaac@JRWUBU2> On Wed, 16 May 2018 17:41:12 -0500 Anshuman Pandey via Unicode wrote: > > 3. Keyboard design is more difficult because consonants like ??? > > are encoded as conjunct forms instead of atomic characters. > > Ignorant question on my part: is it difficult to use character > sequences as labels for keys? I see keys for both ??? and ??? on the > iOS Hindi keyboard, and ??? is tucked away under ?. It can be. It depends on the technology. Pure X seems to be the worst. At the basic level, one has a bewildering map of key plus active modifier key to a single Unicode character. (The space also include function keys.) An *application* can map keys to strings, but I know of no way of doing that to all of a user's applications, both those running and those that will run. Even the logic for dead keys has to be applied by the application, though I believe there are standard libraries that will handle that. The old method on Windows uses sets of data tables that may be termed keyboards. Populated sets are saved as DLLs, and there are limits on what they can contain. Windows' Microsoft Keyboard Layout Creator (MSKLC) is a popular tool for creating and packaging these DLLs. A key plus it modifiers can be mapped to: 1) A sequence of UTF-16 code units. The documented limit is, I believe 4, but there are reports of people being able to use 6. The four sequences listed above each constitute a sequence of 3 code units, so they can be readily accommodated. This technique may well not work for a script in the SMP, and I think one cannot use the MSKLC simply to create the DLLs storing long sequences. So here is an added layer of complexity, though not relevant to the Bengali script. 2) A key can be designated a 'dead key'. I think it has to have a fallback to a BMP character, or rather, a single UTF-16 code unit. On then pressing a key that maps to a single code unit, this is converted to another single code unit, which is the character that the combination types. The restriction is built into the data structure. There is a technique to chain dead keys, but that is not relevant to the difficulty or ease of typing ligatures. The next level up I am acquainted with is the level of input methods. Here, one types a sequence of characters on a 'simple' keyboard, and this sequence controls the derivation of characters being input to the application. Modifier keys may be available to influence this derivation. Now, some of these input methods may be unreliable, and there may be problems for users who can switch between simple keyboards, e.g. US and British, or US and Hindi. If this type of method works, then inputting sequences in response to a single keystroke is not a problem. Multiple key strokes can be a different matter, as the interface with applications may be ill-defined or broken. I have found this a problem with using the backslash key to cycle through candidate characters, and deleting SMP characters in LibreOffice has in the past resulted in the creation of lone surrogates. Now, writing these input methods can be easy. I have fairly simple input methods for inputting both true characters and sequences perceived as characters for Emacs, ibus (using KMfL) and fcitx (using M17n). However, the ibus method has been unreliable in the past, and I have fallen back to a simple X keyboard map. When I do that, I lose the ability to input sequences by a single keystroke. Richard. From unicode at unicode.org Wed May 16 19:24:09 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Thu, 17 May 2018 01:24:09 +0100 Subject: L2/18-181 In-Reply-To: <20180517012043.2d5a2f7d@JRWUBU2> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <20180516230110.31f9efa2@JRWUBU2> <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com> <20180517012043.2d5a2f7d@JRWUBU2> Message-ID: It sounds to me like a fault in the keyboard software, which could be fixed by the people who own and maintain that software. > On 17 May 2018, at 01:20, Richard Wordingham via Unicode wrote: > > On Thu, 17 May 2018 00:34:35 +0100 > Michael Everson via Unicode wrote: > >> This is not a fault of the encoding. >> >>> On 16 May 2018, at 23:01, Richard Wordingham via Unicode >>> wrote: >>> >>> I think simple Windows keyboards have a limit of 4 16-bit code >>> units; for an Indic SMP script, one couldn't map to a single >>> key, as it would require 6 code units. > > It is a consequence of the policy of avoiding precomposed characters. > If there were a precomposed character for , the keyboard could emit > that character - job done. > > One objection is that one would need a sequence of decompositions: > > = > = > > Some people are vehemently opposed to unnatural characters like > . > > Presumable the official view is that Windows Text Services have taken us > beyond that point, and the likes of above are not needed. > > If X persists, perhaps named sequences should be assigned numbers so > that X can make a generic allocation of keysym codes to named > sequences. > > Richard. From unicode at unicode.org Wed May 16 19:49:21 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 01:49:21 +0100 Subject: L2/18-181 In-Reply-To: References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <20180516230110.31f9efa2@JRWUBU2> <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com> <20180517012043.2d5a2f7d@JRWUBU2> Message-ID: <20180517014921.7be07a44@JRWUBU2> On Thu, 17 May 2018 01:24:09 +0100 Michael Everson via Unicode wrote: > It sounds to me like a fault in the keyboard software, which could be > fixed by the people who own and maintain that software. We had this discussion a few years ago. See http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0036.html. Richard. From unicode at unicode.org Thu May 17 01:47:45 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Wed, 16 May 2018 23:47:45 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode < unicode at unicode.org> wrote: > > > On 16 May 2018, at 00:48, Ken Whistler via Unicode > wrote: > > > > On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote: > >> I am proposing the addition of 2 new characters to the Musical Symbols > table: > >> > >> - the half-flat sign (lowers a note by a quarter tone) > >> - the half-sharp sign (raises a note by a quarter tone) > >> > >> In an actual proposal, I would expect a discussion of whether you are > proposing to encode established symbols, or whether you are proposing new > symbols to be adopted by the community (in which case Unicode would > probably wait & see if they get established). > >> > >> A proposal should also show evidence of usage and glyph variations. > > > > And should probably refer to the relationship between these signs and > the existing: > > It would be best to encode the SMuFL symbols, which is rather > comprehensive and include those: > https://www.smufl what should be unified.org > http://www.smufl.org/version/latest/ If you want to write up a proposal for that entire set of characters, godspeed and good luck. > U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP > > U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT > > > > which are also half-sharp or half-flat accidentals. > > > > The wiki on flat signs shows this flat with a crossbar, as well as a > reversed flat symbol, to represent the half-flat. > > > > And the wiki on sharp signs shows this sharp minus one vertical bar to > represent the half-sharp. > > > > So there may be some use of these signs in microtonal notation, outside > of an Arabic context, as well. See: > > > > https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation > > These are otherwise originally the same, but has since drifted. So whether > to unify them or having them separate might be best to see what SMuFL does, > as they are experts on the issue. > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures). There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp! The last, though meaning something different in Turkish context (Turkish theory divides tones into 1/9-tones), is still clearly the same symbol. The "Arabic accidentals" section even re-encodes all of the non-microtonal accidentals (basic sharp, flat, natural, etc.) for no reason that I can determine. There are definitely many things in SMuFL where you could make a claim that they should be in Unicode proper. But not all, and the standard itself is a bit of a mess. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 17 02:40:54 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 17 May 2018 09:40:54 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: > On 17 May 2018, at 08:47, Garth Wallace via Unicode wrote: > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode wrote: >> >> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those: >> https://www.smufl what should be unified.org >> http://www.smufl.org/version/latest/ >> ... >> >> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue. >> > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures). The reason is probably because it is intended for use with music engraving, and they should then be rendered differently. > There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp! But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings. > The last, though meaning something different in Turkish context (Turkish theory divides tones into 1/9-tones), is still clearly the same symbol. The "Arabic accidentals" section even re-encodes all of the non-microtonal accidentals (basic sharp, flat, natural, etc.) for no reason that I can determine. In Turkish AEU (Arel-Ezgi-Uzdilek) notation the sharp # is a microtonal symbol, not the ordinary sharp, so it should be different. In Arabic music, they are the same though, so they can be unified. > There are definitely many things in SMuFL where you could make a claim that they should be in Unicode proper. But not all, and the standard itself is a bit of a mess. You need to work through those little details to see what fits. Should it help with music engraving, or merely be used in plain text? Should symbols that that look alike but have different musical meaning be unified? From unicode at unicode.org Thu May 17 03:51:55 2018 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Thu, 17 May 2018 10:51:55 +0200 Subject: L2/18-181 In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> Message-ID: <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de> Am 2018-05-16 um 22:46 Uhr hat Doug Ewell geschrieben: > 2. Collation is different between the Assamese and Bengali languages, > and code point order should reflect collation order. ? > 4. The use of a single encoded script to write two languages forces > users to use language identifiers to identify the language. I wonder how English and French ever could be made to use a single script, let alone German (???), Icelandic (???), Swedish (???), Latvian (???), Chech (???) or ? you name it. Best wishes, Otto Stolz From unicode at unicode.org Thu May 17 01:49:55 2018 From: unicode at unicode.org (dinar qurbanov via Unicode) Date: Thu, 17 May 2018 09:49:55 +0300 Subject: how to make custom combining diacritical marks for arabic letters? Message-ID: how to make custom combining diacritical marks for arabic letters? should only font drivers and programs support it, or should also unicode support it, for example, have special area for them? as i know, private use area can be used to make combining diacritical marks for latin script without problems. but when i tried, several years ago, to make that for arabic script, with fontforge, i had to use right to left override mark, and manually insert beginning, middle, ending forms of arabic letters, and even then, my custom marks were not located very properly above letters. From unicode at unicode.org Thu May 17 05:04:25 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Thu, 17 May 2018 11:04:25 +0100 (BST) Subject: L2/18-181 In-Reply-To: <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de> Message-ID: <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost> Otto Stolz wrote: > I wonder how English and French ever could be made to use a single script, let alone German (???), Icelandic (???), Swedish (???), Latvian (???), Chech (???) or ? you name it. Years ago I used to hand set metal type - letterpress printing was a family hobby. For a fount of type of a particular style and size and case, there was a typecase, subdivided into areas of various sizes and there was a more or less standard lay of the typecase so that, for example, a lowercase e was in a larger area than a lowercase q, because there were more pieces of type of a lowercase e than of a lowercase q, and e and q were in a known place within the typecase so that a lowercase e in any of the typecases was in the same place within the typecase. There were a number of extra small areas near the edge of the typecase which were unspecified and could be used for extra sorts as they were known. I had become interested in Esperanto and bought some sorts, some of each of twelve sorts, so as to augment a fount used for printing in English be able to print in Esperanto as well. These sorts were placed in some of the small areas near the edge of the typecase. Had I wanted to print in French I could have bought the accented sorts needed for French. Indeed the type catalogue from the typefounder had a list of which sorts were needed for each of various European languages. I learned most of that list. This has proved useful at times, such as in the early 1970s when two researchers were trying to translate a research paper from what they thought was Spanish into English and were having problems and I was able to point out that it was not Spanish but Portuguese as there was an a tilde in the text, even though I do not know Portuguese. There was a publication by the Monotype Corporation, published in 1963. Languages of the world that can be set on 'Monotype' machines / compiled by R.A. Downie. I have just looked it up in the British Library online catalogue. I bought a copy of the publication in the 1960s. I do not have it immediately to hand. Does anyone have a copy readily available and can say what is said about Assamese in that book please? Going back to look at what was done in relation to Assamese with metal type - not just the Monotype brand - could be an interesting insight. I notice that Otto Stolz mentions the following. > Icelandic (???), Yet the thorn character was part of English too. Yet it was lost from English. Was that because William Caxton got his founts of metal type from the European mainland and the necessary sort was not in the font? Is the same sort of thing happening now, over five hundred years later, in relation to Assamese? Maybe people should be helping to get this resolved to the satisfaction of all and helping rather than criticising. By the way, in relation to language identification, Unicode has a perfectly good plain text mechanism for language identification built into it, using the character U+E0001 LANGUAGE TAG and other tag characters. All of the tag characters were deprecated years ago, against opposition by at least two of the contributors to this present thread, then all except U+E0001 have been undeprecated more recently. There is a note in the code chart. >> This character is deprecated, and its use is strongly discouraged. It does not say by whom it is discouraged though nor why. www.unicode.org/charts/PDF/UE0000.pdf I opine that it time for a rethink on this and that U+E0001 should be undeprecated and its application be encouraged instead of all the stuff about using higher level protocols all the time - after all, higher level protocols are not encouraged instead when people want to send emoji. William Overington Thursday 17 May 2018 From unicode at unicode.org Thu May 17 09:47:25 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Thu, 17 May 2018 07:47:25 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: On Thu, May 17, 2018 at 12:41 AM Hans ?berg wrote: > > > On 17 May 2018, at 08:47, Garth Wallace via Unicode > wrote: > > > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode < > unicode at unicode.org> wrote: > >> > >> It would be best to encode the SMuFL symbols, which is rather > comprehensive and include those: > >> https://www.smufl what should be unified.org > >> http://www.smufl.org/version/latest/ > >> ... > >> > >> These are otherwise originally the same, but has since drifted. So > whether to unify them or having them separate might be best to see what > SMuFL does, as they are experts on the issue. > >> > > SMuFL's standards on unification are not the same as Unicode's. For one > thing, they re-encode Latin letters and Arabic digits multiple times for > various different uses (such as numbers used in tuplets and those used in > time signatures). > > The reason is probably because it is intended for use with music > engraving, and they should then be rendered differently. Exactly. But Unicode would consider these a matter for font switching in rich text. > There are duplicates all over the place, like how the half-sharp symbol > is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as > "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as > "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". > They are graphically identical, and the first three even all mean the same > thing, a quarter tone sharp! > > But the tuning system is different, E24 and Pythagorean. Some Latin and > Greek uppercase letters are exactly the same but have different encodings. Tuning systems are not scripts. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 17 09:51:54 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 May 2018 06:51:54 -0800 Subject: L2/18-181 In-Reply-To: <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de> <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost> Message-ID: William Overington offered a suggestion, ? Maybe people should be helping to get this resolved ? to the satisfaction of all and helping rather than ? criticising. That's a noble thought, but as long as Assamese continues to be written using the Eastern Nagari script, which is referred to as "BENGALI" in the Unicode naming tables, any disunification proposal will be a non-starter. Hence the criticism. We should strive to keep any criticism constructive rather than derisive. If I'm not mistaken, the character naming for this script was inherited from the ISCII standard, so it was the Indian government's convention. I believe most English speakers aware of the script call it Bengali. https://en.wikipedia.org/wiki/Eastern_Nagari_script ? U+E0001 LANGUAGE TAG ? ? ... ? ? There is a note in the code chart. ? ? >> This character is deprecated, and its use is strongly ? discouraged. ? ? It does not say by whom it is discouraged though nor why. The reason people shouldn't use it is because it is deprecated. It was originally deprecated because people shouldn't use it. Arguably, a plain-text computer character encoding standard which is language-neutral does not need a language tagging mechanism. By encoding scripts rather than languages, Unicode ensures that the data is legible in plain-text. If the recipient of an untagged plain-text file doesn't know the language well enough to recognize it, then a tag won't help. If the recipient wants to translate it anyway, various on-line translators are fairly sophisticated in language identification. If that fails, it's a mystery. Everybody loves a mystery. From unicode at unicode.org Thu May 17 10:08:19 2018 From: unicode at unicode.org (Martinho Fernandes via Unicode) Date: Thu, 17 May 2018 17:08:19 +0200 Subject: The Unicode Standard and ISO Message-ID: Hello, There are several mentions of synchronization with related standards in unicode.org, e.g. in https://www.unicode.org/versions/index.html, and https://www.unicode.org/faq/unicode_iso.html. However, all such mentions never mention anything other than ISO 10646. I was wondering which ISO standards other than ISO 10646 specify the same things as the Unicode Standard, and of those, which ones are actively kept in sync. This would be of importance for standardization of Unicode facilities in the C++ language (ISO 14882), as reference to ISO standards is generally preferred in ISO standards. -- Martinho -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From unicode at unicode.org Thu May 17 10:48:40 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 17 May 2018 17:48:40 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: > On 17 May 2018, at 16:47, Garth Wallace via Unicode wrote: > > On Thu, May 17, 2018 at 12:41 AM Hans ?berg wrote: > > > On 17 May 2018, at 08:47, Garth Wallace via Unicode wrote: > > > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode wrote: > >> > >> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those: > >> https://www.smufl what should be unified.org > >> http://www.smufl.org/version/latest/ > >> ... > >> > >> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue. > >> > > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures). > > The reason is probably because it is intended for use with music engraving, and they should then be rendered differently. > > Exactly. But Unicode would consider these a matter for font switching in rich text. One original principle was ensure different encodings, so if the practise in music engraving is to keep them different, they might be encoded differently. > > There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp! > > But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings. > > Tuning systems are not scripts. That seems obvious. As I pointed out above, the Arabic glyphs were originally taken from Western ones, but have a different musical meaning, also when played using E12, as some do. From unicode at unicode.org Thu May 17 11:43:28 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 17 May 2018 09:43:28 -0700 Subject: The Unicode Standard and ISO In-Reply-To: References: Message-ID: On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote: > Hello, > > There are several mentions of synchronization with related standards in > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and > https://www.unicode.org/faq/unicode_iso.html. However, all such mentions > never mention anything other than ISO 10646. Because that is the standard for which there is an explicit understanding by all involved relating to synchronization. There have been occasionally some challenging differences in the process and procedures, but generally the synchronization is being maintained, something that's helped by the fact that so many people are active in both arenas. There are really no other standards where the same is true to the same extent. > > I was wondering which ISO standards other than ISO 10646 specify the > same things as the Unicode Standard, and of those, which ones are > actively kept in sync. This would be of importance for standardization > of Unicode facilities in the C++ language (ISO 14882), as reference to > ISO standards is generally preferred in ISO standards. > One of the areas the Unicode Standard differs from ISO 10646 is that its conception of a character's identity implicitly contains that character's properties - and those are standardized as well and alongside of just name and serial number. Many of these properties have associated with them algorithms, e.g. the bidi algorithm, that are an essential element of data interchange: if you don't know which order in the backing store is expected by the recipient to produce a certain display order, you cannot correctly prepare your data. There is one area where standardization in ISO relates to work in Unicode that I can think of, and that is sorting. However, sorting, beyond the underlying framework, ultimately relates to languages, and language-specific data is now housed in CLDR. Early attempts by ISO to standardize a similar framework for locale data failed, in part because the framework alone isn't the interesting challenge for a repository, instead it is the collection, vetting and management of the data. The reality is that the ISO model and its organizational structures are not well suited to the needs of many important area where some form of standardization is needed. That's why we have organization like IETF, W3C, Unicode etc.. Duplicating all or even part of their effort inside ISO really serves nobody's purpose. A./ From unicode at unicode.org Thu May 17 11:43:35 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 May 2018 08:43:35 -0800 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: References: Message-ID: This page describes the essentials of OpenType Arabic font development: https://docs.microsoft.com/en-us/typography/script-development/arabic From unicode at unicode.org Thu May 17 11:46:16 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 17 May 2018 09:46:16 -0700 Subject: Fwd: L2/18-181 In-Reply-To: References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> Message-ID: <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 17 12:47:28 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 17 May 2018 10:47:28 -0700 Subject: L2/18-181 Message-ID: <20180517104728.665a7a7059d7ee80bb4d670165c8327d.5c16fa60d3.wbe@email03.godaddy.com> Everyone, I was not serious about this proposal being "fascinating" or in any way a model for what should happen with the Bengali script. Please imagine a tongue-in-cheek expression as you re-read my post. Maybe there is an emoji that depicts this. Maybe I've just been away from the list too long and forgot that plain text often does not communicate dry humor effectively. James Kass wrote: > We should strive to keep any criticism constructive rather than > derisive. Fair enough. My constructive suggestion would be to press vendors to support Assamese language tools, so that spell-checking, sorting, transcription, and other language-dependent operations will work properly, whether or not that was the goal of the proposal. A language with 15 million native speakers deserves no less. Regarding keyboards, ?? is a conjunct consisting of three code points (U+0995, U+09CD, U+09B7) and fits comfortably on a single key within a standard Windows layout. Indeed, the Assamese keyboards shipped with Windows since at least 7 already have this key (E06, level 2). Systems that limit a keystroke to one code point have problems that go well beyond Assamese. > If I'm not mistaken, the character naming for this script was > inherited from the ISCII standard, so it was the Indian government's > convention. BIS made a mistake here in failing to distinguish languages, or language-specific alphabets, from scripts, but it only cost them a single attribute byte assignment in ISCII. Disunifying Assamese from Bengali in Unicode would have a much greater impact. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 17 12:53:07 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 17 May 2018 10:53:07 -0700 Subject: L2/18-181 Message-ID: <20180517105307.665a7a7059d7ee80bb4d670165c8327d.1b0a3c7241.wbe@email03.godaddy.com> I wrote: > ?? is a conjunct consisting of three code points s/??/???/ -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 17 13:04:21 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 19:04:21 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180516223936.32a843d1@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> <20180516223936.32a843d1@JRWUBU2> Message-ID: <20180517190421.30f4041f@JRWUBU2> On Wed, 16 May 2018 22:39:36 +0100 Richard Wordingham via Unicode wrote: > As an > example of correct rendering, I include the Pali for 'O mind!', _bho > mano_, encoded , > as rendered by the Lamphun font. Sorry, wrong sequence, wrong font. The correct sequence is , which is rendered by the Lamphun font as shown in the attached PNG file. Richard. -------------- next part -------------- A non-text attachment was scrubbed... Name: omind2.png Type: image/png Size: 3129 bytes Desc: not available URL: From unicode at unicode.org Thu May 17 13:12:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 19:12:40 +0100 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: References: Message-ID: <20180517191240.698aac75@JRWUBU2> On Thu, 17 May 2018 08:43:35 -0800 James Kass via Unicode wrote: > This page describes the essentials of OpenType Arabic font > development: > > https://docs.microsoft.com/en-us/typography/script-development/arabic But isn't the problem that PUA diacritics won't reach most Arabic shapers? I think we're back to the vexed issue of defining Unicode properties for PUA characters to applications. Richard. From unicode at unicode.org Thu May 17 13:43:00 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 17 May 2018 11:43:00 -0700 Subject: L2/18-181 Message-ID: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com> Otto Stolz wrote: > I wonder how English and French ever could > be made to use a single script, let alone > German (???), Icelandic (???), Swedish (???), > Latvian (???), Chech (???) or ? you name it. They do use the same script, Latin. They do not use the same alphabet. Each language has its own language-specific alphabet. It is the same for Bengali and Assamese, although the language-specific subsets are called abugidas instead of alphabets. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu May 17 14:12:55 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 20:12:55 +0100 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: References: Message-ID: <20180517201255.5da51fa5@JRWUBU2> On Thu, 17 May 2018 09:49:55 +0300 dinar qurbanov via Unicode wrote: > how to make custom combining diacritical marks for arabic letters? > should only font drivers and programs support it, or should also > unicode support it, for example, have special area for them? > > as i know, private use area can be used to make combining diacritical > marks for latin script without problems. > > but when i tried, several years ago, to make that for arabic script, > with fontforge, i had to use right to left override mark, and manually > insert beginning, middle, ending forms of arabic letters, and even > then, my custom marks were not located very properly above letters. I'm offering suggestions, but I don't that they will work. The one thing that may help you is that these marks cannot appear in plain text. There are a number of things you need to do: 1) Persuade the renderer to treat your character as being a run in a single script. You might be able to do this by: a) Not having any lookups for the Arabic script. b) Using RLM to persuade the renderer that you have a right-to-left run. It is just possible that his may fail with OpenType fonts but work with Graphite or AAT fonts. If it works, you will then have to implement all the Arabic shaping yourself. 2) If OpenType fonts will treat the data as a single script run, you will need to ensure that there is an OpenType substitution feature that the renderer will support. Fortunately, many modern text applications will allow you to force the ccmp feature to be enabled - I have used such feature forcing with OpenType in LibreOffice and also in HTML, which renders accordingly in all the modern browsers I have tested - MS Edge on Windows 10, Firefox and, on iPhones, Safari. While the ccmp feature is enabled for the PUA in Firefox, it is disabled in MS Edge on Windows 10. 3) I believe AAT will soon be available for products using the HarfBuzz layout engine, so it is likely to become available on Firefox and LibreOffice. If AAT looks like a solution, you may need to research the attitudes of Chrome and OpenOffice, for I believe they have chosen not to support Graphite. A totally different solution would be to recompile your application so that it believes that your diacritics are in the Arabic script. Richard. From unicode at unicode.org Thu May 17 14:18:08 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 20:18:08 +0100 Subject: L2/18-181 In-Reply-To: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com> References: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com> Message-ID: <20180517201808.5b0d08e3@JRWUBU2> On Thu, 17 May 2018 11:43:00 -0700 Doug Ewell via Unicode wrote: > It is the same for Bengali and Assamese, although the > language-specific subsets are called abugidas instead of alphabets. If we allow an abugida to be different to an alphasyllabary, then, in Thailand, Pali has a low brow *alphabet* which is a subset of the Thai *abugida*. Richard. From unicode at unicode.org Thu May 17 14:23:12 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 May 2018 20:23:12 +0100 Subject: L2/18-181 In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> Message-ID: <20180517202312.32a61900@JRWUBU2> On Wed, 16 May 2018 13:46:22 -0700 Doug Ewell via Unicode wrote: > http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf > > This is a fascinating proposal to disunify the Assamese script from > Bengali on the following bases: According to the proposal, the encoding for the Assamese writing system *must* be in the BMP. As it needs over a 100 characters, the only way to satisfy the need to be in the BMP is for it to share Bengali characters. Hey, that solution is already implemented! Richard. From unicode at unicode.org Thu May 17 17:26:15 2018 From: unicode at unicode.org (Peter Constable via Unicode) Date: Thu, 17 May 2018 22:26:15 +0000 Subject: The Unicode Standard and ISO In-Reply-To: References: Message-ID: ISO character encoding standards are primarily focused on identifying a repertoire of character elements and their code point assignments in some encoding form. ISO developed other, legacy character-encoding standards in the past, but has not done so for over 20 years. All of those legacy standards can be mapped as a bijection to ISO 10646; in regard to character repertoires, they are all proper subsets of ISO 10646. Hence, from an ISO perspective, ISO 10646 is the only standard for which on-going synchronization with Unicode is needed or relevant. Peter -----Original Message----- From: Unicode On Behalf Of Martinho Fernandes via Unicode Sent: Thursday, May 17, 2018 8:08 AM To: unicode at unicode.org Subject: The Unicode Standard and ISO Hello, There are several mentions of synchronization with related standards in unicode.org, e.g. in https://www.unicode.org/versions/index.html, and https://www.unicode.org/faq/unicode_iso.html. However, all such mentions never mention anything other than ISO 10646. I was wondering which ISO standards other than ISO 10646 specify the same things as the Unicode Standard, and of those, which ones are actively kept in sync. This would be of importance for standardization of Unicode facilities in the C++ language (ISO 14882), as reference to ISO standards is generally preferred in ISO standards. -- Martinho From unicode at unicode.org Thu May 17 18:29:36 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 18 May 2018 00:29:36 +0100 Subject: The Unicode Standard and ISO In-Reply-To: References: Message-ID: It would be great if mutual synchronization were considered to be of benefit. Some of us in SC2 are not happy that the Unicode Consortium has published characters which are still under Technical ballot. And this did not happen only once. > On 17 May 2018, at 23:26, Peter Constable via Unicode wrote: > > Hence, from an ISO perspective, ISO 10646 is the only standard for which on-going synchronization with Unicode is needed or relevant. From unicode at unicode.org Thu May 17 21:59:05 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 May 2018 18:59:05 -0800 Subject: Fwd: L2/18-181 In-Reply-To: <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com> References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com> <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu> <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com> Message-ID: On Thu, May 17, 2018 at 8:46 AM, Asmus Freytag via Unicode wrote: > On 5/16/2018 3:41 PM, Anshuman Pandey via Unicode wrote: > > If folks are interested in a valid proposal for disunification of > Bengali, please look at the proposal for Tirhuta. > > Location? https://www.unicode.org/L2/L2011/11175r-tirhuta.pdf I think that's the one, and Tirhuta is now in Unicode. From unicode at unicode.org Fri May 18 00:50:38 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 May 2018 21:50:38 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180517190421.30f4041f@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2> Message-ID: Richard Wordingham wrote, ? Your example appears to be using the font called 'A Tai Tham KH New'. Exactly. The black boxes in the display were becoming tiresome. The font package is available from this Tai Tham web page: http://www.kengtung.org/download-font/ (I'd downloaded a copy of "lamphun.otf", but the installer failed, so I had to go a-hunting.) Is it correct to say that the average daily Tai Tham use is already being more-or-less served by the current state of the fonts and the USE? And that many of the problems you are reporting with respect to things such as mark-to-mark positioning are happening with more exotic uses of the script, such as the input and display of Pali texts using the Tai Tham script? ? And how am I supposed to position MAI SAM to the right of the ? rightmost of the level 1 marks above? Beats me, it's not happening here. If the GPOS look-up is for (e.g.) TONE-1 plus MAI SAM, and the string is being re-ordered by the system to MAI SAM plus TONE-1 before being submitted to the font, then *that* look-up won't happen. In which case, change the look-up to accomodate the re-ordered string. I suppose you've already tried that. ? The correct sequence is , which is rendered by the Lamphun font as shown in the ? attached PNG file. ??????? To confirm, the NAA ligature isn't happening with the 'A Tai Tham KH New' font. Changing the entry order to: ??????? ... forms the NAA ligature and the vowel re-ordering matches the Lamphun graphic you sent. But that kludge probably breaks the preferred encoding model/order. From unicode at unicode.org Fri May 18 02:38:27 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 May 2018 23:38:27 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2> Message-ID: I wrote, > Changing the entry order to: > ??????? > > ... forms the NAA ligature and the vowel re-ordering matches the > Lamphun graphic you sent. But that kludge probably breaks the > preferred encoding model/order. On the other hand, do the script users normally input the NAA ligature sequence first and then add any additional signs or marks? If the users consider NAA to be a distinct "letter", then that might explain why a font developed by a user accomodates the ligation for the string "NA" + "AA" only when nothing else appears between them. If, for example, there's a popular input method or keyboard driver which puts "NAA" on its own key, then the users will be producing data which is "NA" plus "AA" plus anything else. From unicode at unicode.org Fri May 18 02:57:18 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 18 May 2018 08:57:18 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2> Message-ID: <20180518085718.35402f71@JRWUBU2> On Thu, 17 May 2018 21:50:38 -0800 James Kass via Unicode wrote: > Richard Wordingham wrote, > > ? Your example appears to be using the font called 'A Tai Tham KH > New'. > > Exactly. The black boxes in the display were becoming tiresome. The > font package is available from this Tai Tham web page: > http://www.kengtung.org/download-font/ > > (I'd downloaded a copy of "lamphun.otf", but the installer failed, so > I had to go a-hunting.) That threatens a long struggle. The WOFF files work on MS Edge on Windows 10. Lamphun (and Da Lekh) depends on the rendering engine for Indic reordering; more precisely, it relies on dotted circles to know when reordering has failed. I don't think it can work possibly via Uniscribe and DirectWrite. > Is it correct to say that the average daily Tai Tham use is already > being more-or-less served by the current state of the fonts and the > USE? Many fonts depend on bypassing the USE. I also have a strong suspicion that they depend on HarfBuzz, though I'll have to recheck what is happening on iPhones. I'm only set up there to check what happens with Safari. > And that many of the problems you are reporting with respect to > things such as mark-to-mark positioning are happening with more exotic > uses of the script, such as the input and display of Pali texts using > the Tai Tham script? Since changes to Indic Syllabic category for Unicode 10 unbanned talk about nirvana (?????? , or with TALL AA instead; the vernaculars usually inserts SAKOT before the second NA) and Tai Khuen (and Tau Lue?) monks' names -in -dhammo (-?????), the USE should have supported uncontracted Pali. Pali is simple, though inter-Indic is complicated by subscript forms not encoded with SAKOT. (There may be a similar complication with the Myanmar script.) The complications primarily come with writing the vernacular. > ? And how am I supposed to position MAI SAM to the right of the > ? rightmost of the level 1 marks above? > Beats me, it's not happening here. If the GPOS look-up is for (e.g.) > TONE-1 plus MAI SAM, and the string is being re-ordered by the system > to MAI SAM plus TONE-1 before being submitted to the font, then *that* > look-up won't happen. In which case, change the look-up to accomodate > the re-ordered string. I suppose you've already tried that. What makes you think the USE tries to address such matters? If the developers had made the time to find out about such details (I think their money tree must have died), we wouldn't have a problem with CVC askharas. Also, the USE prohibits spelling where this rearrangement is desirable. Hariphunchai and therefore Lamphun addresses the positioning by having a separate position for MAI SAM, but that doesn't work well when there is a top vowel in the syllable. Now, the use of MAI SAM to indicate elision, as opposed to duplication at the word or syllable level, is somewhat 'exotic'; many writers don't do it. It's the use to indicate elision, written in accordance with the accepted proposal, that the USE prohibits. > ? The correct sequence is ? SIGN AA>, which is rendered by the Lamphun font as shown in the > ? attached PNG file. > > ??????? > To confirm, the NAA ligature isn't happening with the 'A Tai Tham KH > New' font. > > Changing the entry order to: > ??????? > > ... forms the NAA ligature and the vowel re-ordering matches the > Lamphun graphic you sent. But that kludge probably breaks the > preferred encoding model/order. Exactly. Richard. From unicode at unicode.org Fri May 18 17:06:18 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 18 May 2018 23:06:18 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180515235135.7df264c2@JRWUBU2> <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2> Message-ID: <20180518230618.4ba5a033@JRWUBU2> On Thu, 17 May 2018 23:38:27 -0800 James Kass via Unicode wrote: > I wrote, > > > Changing the entry order to: > > ??????? > > > > ... forms the NAA ligature and the vowel re-ordering matches the > > Lamphun graphic you sent. But that kludge probably breaks the > > preferred encoding model/order. > > On the other hand, do the script users normally input the NAA ligature > sequence first and then add any additional signs or marks? If the > users consider NAA to be a distinct "letter", then that might explain > why a font developed by a user accomodates the ligation for the string > "NA" + "AA" only when nothing else appears between them. If, for > example, there's a popular input method or keyboard driver which puts > "NAA" on its own key, then the users will be producing data which is > "NA" plus "AA" plus anything else. There was a keyboard map in the zip file that you may have got the font from, http://www.kengtung.org/font-download/Tai-Tham-Unicode-for-PC.zip . It has three key symbols per key - plan, shift and capslock. All the combinations correspond to a single character. There's also a zip file for a non-Unicode font, http://www.kengtung.org/font-download/Tai-Tham-Non-Unicode-for-PC.zip and that has a corresponding keyboard. Now, while I haven't looked at the font, it looks like a direct key to glyph mapping, and as I would have expected from the pre-Unicode Wat Inn hack encoding, the English key stroke for 'o' (the key stroke for THAI CHARACTER NO NU) yields NA and the key stroke for 'O' yields the NAA ligature. I may be wrong about the relationship - the top vowel + tone ligatures seem to be missing from the keyboard. So, the evidence is ambiguous. The dictionaries I have seen do not treat NAA as an indivisible character - NAA plus subscript is treated differently depending on whether the subscript phonetically precedes or follows the subscript consonant. However, the rule that homorganic subscript precedes and others follow the vowel works pretty well. Now, the chanting of Pali declensions, if related to writing, should bring home via the participles in -nt- that there is a close relationship between and . It would be interesting to see how often ligation fails in participles. However, I think there is a different explanation for the sequence. There are suggestions around that aksharas should be encoded with left matras in second place. This makes it easier for fonts. I think we're seeing an encoding based on ease of font design. Now, one doesn't need this. If feature ss02 is enabled, the fonts of my Da Lekh family will convert a transliteration of Tai Tham letters, numbers and marks to ASCII back to the original Tai Tham text. All I need is a feature activation, which ASCII is normally has the privilege of receiving. I believe I could do it all by ccmp, but this feature is a fall back for when the renderer does not support Tai Tham. At present, Tai Tham seems to be in grave danger of breaking up into a number of font encodings - one chooses the rendering system, and that determines the allowed sequences, even for fairly simple words. The Xishuangbanna News appears to be using a visual order encoding. I suspect this works because syllables are separated by spaces, so they don't have to worry about Indic rearrangement being applied despite the lack of lookups for OTL script "lana". Richard. From unicode at unicode.org Fri May 18 22:00:20 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 19 May 2018 04:00:20 +0100 Subject: Choosing the Set of Renderable Strings In-Reply-To: References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> Message-ID: <20180519040020.5f223dd8@JRWUBU2> On Tue, 15 May 2018 04:19:42 -0800 James Kass via Unicode wrote: > On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode > wrote: > > I've seen an implementation of the USE render > > canonically equivalent strings differently. ... > Because the USE failed or because the font provided look-ups for each > of those strings to different glyphs? Unless I haven't picked up a recent change, neither Microsoft (by evidence of MS Edge) nor Apple (by evidence of Safari in iOS 10.3.2) normalises Tai Tham text. gets just one dotted circle, while Apple and Microsoft award a dotted circle to each mark in the canonically equivalent . Not many fonts handle two dotted circles - subscript formation has to work in the context . There's also the formal problem that is actually a legitimate sequence in the backing store. The defence to a charge of violating the character identity of DOTTED CIRCLE would be to say that such sequences are not supported - a renderer is not required to support all strings! Incidentally, I've fixed the Lamphun font; it will now install in Windows 10. TTX found ways to reduce its size by 10%. While it should work for most text, there are a few sequences that aren't handled properly. These are issues that pertain to the font domain, not the domain of the rendering engine. Richard. From unicode at unicode.org Sat May 19 05:22:45 2018 From: unicode at unicode.org (dinar qurbanov via Unicode) Date: Sat, 19 May 2018 13:22:45 +0300 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: <20180517201255.5da51fa5@JRWUBU2> References: <20180517201255.5da51fa5@JRWUBU2> Message-ID: this is a test i made that time: http://tmf.org.ru/arabic.html . look at second line. my custom mark is located too left on the most left "B", and is located too right on the middle (that is of middle form of B) and on the most righ "B" (that is of starter form of B). it should be located right above the below dot. - this was the problem that i could not solve. also there are problems that i could solve by using 1) rtl override mark; 2) and using start, middle, end, separate B characters instead of using simple arabic B, that would be easier. (you can see in the example that that characters are used). (using different forms of letter can also be achieved by using php or javascript, etc). 2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode : > On Thu, 17 May 2018 09:49:55 +0300 > dinar qurbanov via Unicode wrote: > >> how to make custom combining diacritical marks for arabic letters? >> should only font drivers and programs support it, or should also >> unicode support it, for example, have special area for them? >> >> as i know, private use area can be used to make combining diacritical >> marks for latin script without problems. >> >> but when i tried, several years ago, to make that for arabic script, >> with fontforge, i had to use right to left override mark, and manually >> insert beginning, middle, ending forms of arabic letters, and even >> then, my custom marks were not located very properly above letters. > > I'm offering suggestions, but I don't that they will work. > > The one thing that may help you is that these marks cannot appear in > plain text. There are a number of things you need to do: > > 1) Persuade the renderer to treat your character as being a run in a > single script. You might be able to do this by: > > a) Not having any lookups for the Arabic script. > > b) Using RLM to persuade the renderer that you have a right-to-left run. > > It is just possible that his may fail with OpenType fonts but work > with Graphite or AAT fonts. If it works, you will then have to > implement all the Arabic shaping yourself. > > 2) If OpenType fonts will treat the data as a single script run, you > will need to ensure that there is an OpenType substitution feature that > the renderer will support. Fortunately, many modern text applications > will allow you to force the ccmp feature to be enabled - I have used > such feature forcing with OpenType in LibreOffice and also in HTML, > which renders accordingly in all the modern browsers I have tested - MS > Edge on Windows 10, Firefox and, on iPhones, Safari. While the ccmp > feature is enabled for the PUA in Firefox, it is disabled in MS Edge on > Windows 10. > > 3) I believe AAT will soon be available for products using the HarfBuzz > layout engine, so it is likely to become available on Firefox and > LibreOffice. If AAT looks like a solution, you may need to research the > attitudes of Chrome and OpenOffice, for I believe they have chosen not > to support Graphite. > > A totally different solution would be to recompile your application so > that it believes that your diacritics are in the Arabic script. > > Richard. From unicode at unicode.org Sat May 19 22:33:20 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 19 May 2018 19:33:20 -0800 Subject: Choosing the Set of Renderable Strings In-Reply-To: <20180519040020.5f223dd8@JRWUBU2> References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost> <20180514203115.5c093920@JRWUBU2> <20180519040020.5f223dd8@JRWUBU2> Message-ID: Richard Wordingham wrote, > Incidentally, I've fixed the Lamphun font; it will now install in Windows 10. Confirming successful installation on Windows 7. From unicode at unicode.org Tue May 22 05:51:33 2018 From: unicode at unicode.org (Martinho Fernandes via Unicode) Date: Tue, 22 May 2018 12:51:33 +0200 Subject: Extended grapheme cluster stability Message-ID: Hello, None of the *_Break properties are stable, as far as I can see in https://www.unicode.org/policies/stability_policy.html. If I understand correctly, this means that, at least in theory, it is possible that in Unicode version X a sequence of characters AB forms an extended grapheme cluster, i.e. A ? B in the notation used in the algorithm description and in the test data, but then in Unicode version X+1, that changes to A ? B. Am I reading this correctly or is this not possible? Or is it possible in theory but not in practice? Or maybe it has happened before? -- Martinho -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From unicode at unicode.org Tue May 22 07:43:23 2018 From: unicode at unicode.org (Martinho Fernandes via Unicode) Date: Tue, 22 May 2018 14:43:23 +0200 Subject: Extended grapheme cluster stability In-Reply-To: References: Message-ID: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote: > Hello, > > None of the *_Break properties are stable, as far as I can see in > https://www.unicode.org/policies/stability_policy.html. If I understand > correctly, this means that, at least in theory, it is possible that in > Unicode version X a sequence of characters AB forms an extended grapheme > cluster, i.e. A ? B in the notation used in the algorithm description > and in the test data, but then in Unicode version X+1, that changes to A > ? B. > > Am I reading this correctly or is this not possible? Or is it possible > in theory but not in practice? Or maybe it has happened before? > Hmm, to answer my own question, yes, this has happened before. In Unicode 8 there were no breaks between regional indicators. In Unicode 9 now there are no breaks "between regional indicator (RI) symbols if there is an odd number of RI characters before the break point". I has also happened in the direction break=>no break, with when emoji ZWJ sequences were introduced. -- Martinho -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From unicode at unicode.org Tue May 22 13:27:17 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 22 May 2018 19:27:17 +0100 Subject: Extended grapheme cluster stability In-Reply-To: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io> References: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io> Message-ID: <20180522192717.52a29289@JRWUBU2> On Tue, 22 May 2018 14:43:23 +0200 Martinho Fernandes via Unicode wrote: > On 22.05.18 12:51, Martinho Fernandes via Unicode wrote: > > > Hello, > > > > None of the *_Break properties are stable, as far as I can see in > > https://www.unicode.org/policies/stability_policy.html. If I > > understand correctly, this means that, at least in theory, it is > > possible that in Unicode version X a sequence of characters AB > > forms an extended grapheme cluster, i.e. A ? B in the notation used > > in the algorithm description and in the test data, but then in > > Unicode version X+1, that changes to A ? B. > > > > Am I reading this correctly or is this not possible? Or is it > > possible in theory but not in practice? Or maybe it has happened > > before? > Hmm, to answer my own question, yes, this has happened before. In > Unicode 8 there were no breaks between regional indicators. In > Unicode 9 now there are no breaks "between regional indicator (RI) > symbols if there is an odd number of RI characters before the break > point". I has also happened in the direction break=>no break, with > when emoji ZWJ sequences were introduced. These are more refinements of the algorithm than fundamental changes. However, many of the breaks are inherently uncertain and may therefore be tailored. English has uncertainties as to word boundaries, but the author's decision is represented in writing, e.g. 'beam width' v. 'beamwidth'. In writing systems without visible boundaries between words, such as Thai, such vacillation could occur between software versions rather than between version of Unicode. Line break opportunities can in practice vacillate in such writing systems, e.g. between breaks at syllable boundaries and breaks at word boundaries. Formal extended grapheme cluster boundaries have varied in normal, well established text. In Thai, left matras and consonants were briefly part of the same grapheme cluster. When that formal property was implemented in editors, there were howls of pain from Thailand, and the change was promptly reversed. I do not believe one rules suits all Indic consonant clusters. While splitting X virama | Y makes sense for Devanagari with its half-forms, X | coeng Y makes no sense for scripts where it is the second consonant that changes shape. It makes even less sense when some combinations of 'coeng Y' are encoded separately, as in mainland SE Asia. These combinations are categorised as marks. In Burma, the syllable boundary comes after U+1A58 TAI THAM SIGN MAI KANG LAI. In Laos, it comes before it. We came very close to extended grapheme clusters being extended to whole aksharas in Unicode 11.0. My view is that Unicode has attempted to conflate several concepts in grapheme cluster, and it just doesn't work. Richard. From unicode at unicode.org Tue May 22 16:48:56 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 22 May 2018 14:48:56 -0700 Subject: Extended grapheme cluster stability In-Reply-To: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io> References: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io> Message-ID: On thing to bear in mind about breaks: Unicode is plain-text and not "final rendered text". Many types of breaks depend on things like actual font selection, column width and other factors determined by styling. They are therefore not necessarily stable from a plain text perspective (the same goes for things not specified by Unicode, like hyphenation, because hyphenation, for example, depends on the actual language associated with a text, something not part of the plain text back-bone). The moral is that if you need a frozen representation of text that does not behave differently if accessed, iterated, viewed etc. at different times, you need to have some kind of rich-text format that can represent all segmentation choices. If, on the other hand, you are doing a live interaction with the text, then Unicode segmentation gives you the "best available" algorithm - which may change over time as new information becomes available about what constitutes best practice. For many writing systems, the understanding of best practice is still quite limited at this point - in the sense that even if it is known, it is not widely available and therefore there has not yet been a chance to validate and standardize it. (Setting aside areas of actual innovation, like emoji). For these reasons, it would be outright detrimental if any of these algorithms are "frozen" -- however, the hope is that updates are handled with some sensitivity to avoid unnecessary disruption of settled practice. A./ On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote: > On 22.05.18 12:51, Martinho Fernandes via Unicode wrote: > >> Hello, >> >> None of the *_Break properties are stable, as far as I can see in >> https://www.unicode.org/policies/stability_policy.html. If I understand >> correctly, this means that, at least in theory, it is possible that in >> Unicode version X a sequence of characters AB forms an extended grapheme >> cluster, i.e. A ? B in the notation used in the algorithm description >> and in the test data, but then in Unicode version X+1, that changes to A >> ? B. >> >> Am I reading this correctly or is this not possible? Or is it possible >> in theory but not in practice? Or maybe it has happened before? >> > Hmm, to answer my own question, yes, this has happened before. In > Unicode 8 there were no breaks between regional indicators. In Unicode 9 > now there are no breaks "between regional indicator (RI) symbols if > there is an odd number of RI characters before the break point". I has > also happened in the direction break=>no break, with when emoji ZWJ > sequences were introduced. > From unicode at unicode.org Wed May 23 10:53:35 2018 From: unicode at unicode.org (Abe Voelker via Unicode) Date: Wed, 23 May 2018 10:53:35 -0500 Subject: =?UTF-8?Q?Major_vendors_changing_U=2B1F52B_PISTOL_=F0=9F=94=AB_depiction?= =?UTF-8?Q?_from_firearm_to_squirt_gun?= Message-ID: Hello, I'm curious if there has been any discussion on all the major vendors changing this emoji's depiction? ( https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/) As a user I find it troublesome because previous messages I've sent using this character on these platforms may now be interpreted differently due to the changed representation. That aspect has me wondering if this change is in line with Unicode standard conformance requirements. Regards, Abe Voelker -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 23 12:08:31 2018 From: unicode at unicode.org (via Unicode) Date: Wed, 23 May 2018 20:08:31 +0300 Subject: =?UTF-8?Q?VS:_Major_vendors_changing_U+1F5?= =?UTF-8?Q?2B_PISTOL_=F0=9F=94=AB_depiction_from_fire?= =?UTF-8?Q?arm_to_squirt_gun?= In-Reply-To: References: Message-ID: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> I?d treat these as glyph changes within fonts. Sincerely Erkki I. Kolehmainen L?hett?j?: Unicode Puolesta Abe Voelker via Unicode L?hetetty: keskiviikko 23. toukokuuta 2018 18.54 Vastaanottaja: unicode at unicode.org Aihe: Major vendors changing U+1F52B PISTOL ?? depiction from firearm to squirt gun Hello, I'm curious if there has been any discussion on all the major vendors changing this emoji's depiction? (https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/) As a user I find it troublesome because previous messages I've sent using this character on these platforms may now be interpreted differently due to the changed representation. That aspect has me wondering if this change is in line with Unicode standard conformance requirements. Regards, Abe Voelker -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 23 12:49:31 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 23 May 2018 18:49:31 +0100 Subject: Major vendors changing U+1F52B PISTOL =?UTF-8?B?8J+Uqw==?= depiction from firearm to squirt gun In-Reply-To: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> References: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> Message-ID: <20180523184931.2c5840f7@JRWUBU2> On Wed, 23 May 2018 20:08:31 +0300 via Unicode wrote: > I?d treat these as glyph changes within fonts. I'd treat them as gross violations of character identity. Richard. From unicode at unicode.org Wed May 23 12:59:02 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 23 May 2018 10:59:02 -0700 Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?= =?UTF-8?Q?piction_from_firearm_to_squirt_gun?= In-Reply-To: References: Message-ID: <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net> On 5/23/2018 8:53 AM, Abe Voelker via Unicode wrote: > As a user I find it troublesome because previous messages I've sent > using this character on these platforms may now be interpreted > differently due to the changed representation. That aspect has me > wondering if this change is in line with Unicode standard conformance > requirements. > The Unicode Standard publishes only *text presentation* (black and white) representative glyphs for emoji characters. And those text presentation glyphs have been quite stable in the standard. For U+1F52B PISTOL, the glyph currently published in Unicode 10.0 (and the one which will be published imminently in Unicode 11.0) is precisely the same as the glyph that was initially published nearly 8 years ago in Unicode 6.0. Care to check up on that? Unicode 6.0: https://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf Unicode 11.0: https://www.unicode.org/charts/PDF/Unicode-11.0/U110-1F300.pdf What vendors do for their colorful *emoji presentation* glyphs is basically outside the scope of the Unicode Standard. Technically, it is outside the scope even of the separate Unicode Technical Standard #51, Unicode Emoji, which specifies data, behavior, and other mechanisms for promoting interoperability and valid interchange of emoji characters and emoji sequences, but which does *not* try to constrain vendors in their emoji glyph designs. Now, sure, nobody wants their emoji for an avocado, to willy-nilly turn into a completely unrelated emoji for a crying face. But many emoji are deliberately vague in their scope of denotation and connotation, and the vendors have a lot a leeway to design little images that they like and their customers like. And the Unicode Standard does not now and probably never will try to define and enforce precise semantics and usage rules for every single emoji character. Basically, it is a fool's game to be using emoji as if they were a well-defined and standardized pictographic orthography with unchanging semantics. If you want stable presentation of content, use a pdf document or an image. If you want stable and accurate conveyance of particular meaning -- well, write it out in the standard orthography of a particular language. If you want playful and emotional little pictographs accompanying text, well, then don't expect either stability of the images or the meaning, because that isn't how emoji work. Case in point: if you are using U+1F351 PEACH for its well-known resemblance to a bum, well, don't complain to the Unicode Consortium if a phone vendor changes the meaning of your message by redesigning its emoji glyph for U+1F351 to a cut peach slice that more resembles a smile. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 23 13:00:33 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Wed, 23 May 2018 19:00:33 +0100 Subject: =?utf-8?Q?Re=3A_Major_vendors_changing_U+1F52B_PISTOL_?= =?utf-8?Q?=F0=9F=94=AB_depiction_from_firearm_to_squirt_gun?= In-Reply-To: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> References: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> Message-ID: <3F9180F7-03CA-4636-9816-742589F63720@evertype.com> I consider it a significant semantic shift from the intended meaning of the character in the source Japanese character set. Michael Everson From unicode at unicode.org Wed May 23 14:55:17 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 23 May 2018 20:55:17 +0100 Subject: Major vendors changing U+1F52B PISTOL =?UTF-8?B?8J+Uqw==?= depiction from firearm to squirt gun In-Reply-To: <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net> References: <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net> Message-ID: <20180523205517.1ad64f0a@JRWUBU2> On Wed, 23 May 2018 10:59:02 -0700 Ken Whistler via Unicode wrote: > If you want stable and accurate > conveyance of particular meaning -- well, write it out in the > standard orthography of a particular language. Preferably not of a living language, though even the semantics of a dead language can wobble. Richard. From unicode at unicode.org Wed May 23 20:18:10 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 24 May 2018 10:18:10 +0900 Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?= =?UTF-8?Q?piction_from_firearm_to_squirt_gun?= In-Reply-To: <3F9180F7-03CA-4636-9816-742589F63720@evertype.com> References: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi> <3F9180F7-03CA-4636-9816-742589F63720@evertype.com> Message-ID: <90d61d43-db51-89dc-82d8-d2b6de8b2dba@it.aoyama.ac.jp> On 2018/05/24 03:00, Michael Everson via Unicode wrote: > I consider it a significant semantic shift from the intended meaning of the character in the source Japanese character set. Yes and no. I'd consider the semantic shift from a real pistol in a Japanese message to a real pistol in a message in the US quite significant. The former, except for some extremely small and marginal segment of Japanese society, essentially has no "I might shoot you" implications at all. In the later case, that may be quite a bit different. I'm not saying the (glyph or whatever you call it) change was okay. But when talking about semantics, it's important to not only consider surface semantics, but also the overall context. Regards, Martin. From unicode at unicode.org Thu May 24 15:28:51 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 24 May 2018 22:28:51 +0200 (CEST) Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTO?= =?UTF-8?Q?L_=F0=9F=94=AB_depiction_from_firearm_to_squirt_gun?= In-Reply-To: References: Message-ID: <1638057770.543374.1527193731416@ox.hosteurope.de> Abe Voelker: > > I'm curious if there has been any discussion on all the major vendors > changing this emoji's depiction? ( > https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/) Curiously, this happened right before UTC 155 in a possibly concerted (but at least not independent) manner by Twitter and Google at least. My comments on PRI 356 (UTS 51.11) from 17 April already seem outdated. >From the single-line feedback I've received, it seems the issue has not been discussed at the meeting in late April. (I've yet to review the minutes.) I'm suggesting ZWJ sequences to distinguish between a firearm (????) and a toy (????) for PISTOL. This does not solve the valid compatibility concerns. > As a user I find it troublesome because previous messages I've sent using > this character on these platforms may now be interpreted differently due to > the changed representation. We must discourage the perception that emojis are only used in volatile text messages (often in walled-garden systems) and tweets. They also appear in texts that are meant to be read in the future as well. From unicode at unicode.org Sat May 26 16:58:54 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 26 May 2018 23:58:54 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: Even flat notes or rythmic and pause symbols in Western musical notations have different contextual meaning depending on musical keys at start of scores, and other notations or symbols added above the score. So their interpretation are also variable according to context, just like tuning in a Arabic musical score, which is also keyed and annotated differently. These keys can also change within the same partition score. So both the E12 vs. E24 systems (which are not incompatible) may also be used in Western and Arabic music notations. The score keys will give the interpretation. Tone marks taken isolately mean absolutely nothing in both systems outside the keyed scores in which they are inserted, except that they are just glyphs, which may be used to mean something else (e.g. a note in a comics artwork could be used to denote someone whistling, without actually encoding any specific tone, or rythmic). 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode : > > > > On 17 May 2018, at 16:47, Garth Wallace via Unicode > wrote: > > > > On Thu, May 17, 2018 at 12:41 AM Hans ?berg wrote: > > > > > On 17 May 2018, at 08:47, Garth Wallace via Unicode < > unicode at unicode.org> wrote: > > > > > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode < > unicode at unicode.org> wrote: > > >> > > >> It would be best to encode the SMuFL symbols, which is rather > comprehensive and include those: > > >> https://www.smufl what should be unified.org > > >> http://www.smufl.org/version/latest/ > > >> ... > > >> > > >> These are otherwise originally the same, but has since drifted. So > whether to unify them or having them separate might be best to see what > SMuFL does, as they are experts on the issue. > > >> > > > SMuFL's standards on unification are not the same as Unicode's. For > one thing, they re-encode Latin letters and Arabic digits multiple times > for various different uses (such as numbers used in tuplets and those used > in time signatures). > > > > The reason is probably because it is intended for use with music > engraving, and they should then be rendered differently. > > > > Exactly. But Unicode would consider these a matter for font switching in > rich text. > > One original principle was ensure different encodings, so if the practise > in music engraving is to keep them different, they might be encoded > differently. > > > > There are duplicates all over the place, like how the half-sharp > symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at > U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as " > accidentalQuarterToneSharpArabic", and at U+E444 as > "accidentalKomaSharp". They are graphically identical, and the first three > even all mean the same thing, a quarter tone sharp! > > > > But the tuning system is different, E24 and Pythagorean. Some Latin and > Greek uppercase letters are exactly the same but have different encodings. > > > > Tuning systems are not scripts. > > That seems obvious. As I pointed out above, the Arabic glyphs were > originally taken from Western ones, but have a different musical meaning, > also when played using E12, as some do. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 27 06:02:24 2018 From: unicode at unicode.org (Ivan Panchenko via Unicode) Date: Sun, 27 May 2018 13:02:24 +0200 Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?= =?UTF-8?Q?piction_from_firearm_to_squirt_gun?= Message-ID: <1222dee2-7468-972d-dd1f-12724ec22924@gmail.com> On another note, the ?crocodile shot by police? (??????) example in UTS #51 appears with a water gun glyph (taken from Apple) now. If the pistol is to be gotten rid of, would it not be more sensible to stop supporting the emoji rather than to corrupt its meaning? NTT DOCOMO apparently did not change to a squirt gun: https://www.nttdocomo.co.jp/binary/pdf/service/developer/smart_phone/make_contents/pictograph/pictograph_list.pdf Best regards Ivan From unicode at unicode.org Sun May 27 15:18:43 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sun, 27 May 2018 13:18:43 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: Philippe is entirely correct here. The fact that a symbol has somewhat different meanings in different contexts does not mean that it is actually multiple visually identical symbols. Otherwise Unicode would be re-encoding the Latin alphabet many, many times over. During most of Bach's career, the prevailing tuning system was meantone. He wrote the Well-Tempered Clavier to explore the possibilities afforded by a new tuning system called well temperament. In the modern era, his work has typically been played in 12-tone equal temperament. That does not mean that the ? that Bach used in his score for the Well-Tempered Clavier was not the same symbol as the ? in his other scores, or that they somehow invisibly became yet another symbol when the score is opened on the music desk of a modern Steinway. On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy wrote: > Even flat notes or rythmic and pause symbols in Western musical notations > have different contextual meaning depending on musical keys at start of > scores, and other notations or symbols added above the score. So their > interpretation are also variable according to context, just like tuning in > a Arabic musical score, which is also keyed and annotated differently. > These keys can also change within the same partition score. > So both the E12 vs. E24 systems (which are not incompatible) may also be > used in Western and Arabic music notations. The score keys will give the > interpretation. > Tone marks taken isolately mean absolutely nothing in both systems outside > the keyed scores in which they are inserted, except that they are just > glyphs, which may be used to mean something else (e.g. a note in a comics > artwork could be used to denote someone whistling, without actually > encoding any specific tone, or rythmic). > > > 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode : > >> >> >> > On 17 May 2018, at 16:47, Garth Wallace via Unicode < >> unicode at unicode.org> wrote: >> > >> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg wrote: >> > >> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode < >> unicode at unicode.org> wrote: >> > > >> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode < >> unicode at unicode.org> wrote: >> > >> >> > >> It would be best to encode the SMuFL symbols, which is rather >> comprehensive and include those: >> > >> https://www.smufl what should be unified.org >> > >> http://www.smufl.org/version/latest/ >> > >> ... >> > >> >> > >> These are otherwise originally the same, but has since drifted. So >> whether to unify them or having them separate might be best to see what >> SMuFL does, as they are experts on the issue. >> > >> >> > > SMuFL's standards on unification are not the same as Unicode's. For >> one thing, they re-encode Latin letters and Arabic digits multiple times >> for various different uses (such as numbers used in tuplets and those used >> in time signatures). >> > >> > The reason is probably because it is intended for use with music >> engraving, and they should then be rendered differently. >> > >> > Exactly. But Unicode would consider these a matter for font switching >> in rich text. >> >> One original principle was ensure different encodings, so if the practise >> in music engraving is to keep them different, they might be encoded >> differently. >> >> > > There are duplicates all over the place, like how the half-sharp >> symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at >> U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as >> "accidentalQuarterToneSharpArabic", and at U+E444 as >> "accidentalKomaSharp". They are graphically identical, and the first three >> even all mean the same thing, a quarter tone sharp! >> > >> > But the tuning system is different, E24 and Pythagorean. Some Latin and >> Greek uppercase letters are exactly the same but have different encodings. >> > >> > Tuning systems are not scripts. >> >> That seems obvious. As I pointed out above, the Arabic glyphs were >> originally taken from Western ones, but have a different musical meaning, >> also when played using E12, as some do. >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 27 15:33:00 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 27 May 2018 22:33:00 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: Thanks! Le dim. 27 mai 2018 22:18, Garth Wallace a ?crit : > Philippe is entirely correct here. The fact that a symbol has somewhat > different meanings in different contexts does not mean that it is actually > multiple visually identical symbols. Otherwise Unicode would be re-encoding > the Latin alphabet many, many times over. > > During most of Bach's career, the prevailing tuning system was meantone. > He wrote the Well-Tempered Clavier to explore the possibilities afforded by > a new tuning system called well temperament. In the modern era, his work > has typically been played in 12-tone equal temperament. That does not mean > that the ? that Bach used in his score for the Well-Tempered Clavier was > not the same symbol as the ? in his other scores, or that they somehow > invisibly became yet another symbol when the score is opened on the music > desk of a modern Steinway. > > On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy > wrote: > >> Even flat notes or rythmic and pause symbols in Western musical notations >> have different contextual meaning depending on musical keys at start of >> scores, and other notations or symbols added above the score. So their >> interpretation are also variable according to context, just like tuning in >> a Arabic musical score, which is also keyed and annotated differently. >> These keys can also change within the same partition score. >> So both the E12 vs. E24 systems (which are not incompatible) may also be >> used in Western and Arabic music notations. The score keys will give the >> interpretation. >> Tone marks taken isolately mean absolutely nothing in both systems >> outside the keyed scores in which they are inserted, except that they are >> just glyphs, which may be used to mean something else (e.g. a note in a >> comics artwork could be used to denote someone whistling, without actually >> encoding any specific tone, or rythmic). >> >> >> 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode : >> >>> >>> >>> > On 17 May 2018, at 16:47, Garth Wallace via Unicode < >>> unicode at unicode.org> wrote: >>> > >>> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg >>> wrote: >>> > >>> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode < >>> unicode at unicode.org> wrote: >>> > > >>> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode < >>> unicode at unicode.org> wrote: >>> > >> >>> > >> It would be best to encode the SMuFL symbols, which is rather >>> comprehensive and include those: >>> > >> https://www.smufl what should be unified.org >>> > >> http://www.smufl.org/version/latest/ >>> > >> ... >>> > >> >>> > >> These are otherwise originally the same, but has since drifted. So >>> whether to unify them or having them separate might be best to see what >>> SMuFL does, as they are experts on the issue. >>> > >> >>> > > SMuFL's standards on unification are not the same as Unicode's. For >>> one thing, they re-encode Latin letters and Arabic digits multiple times >>> for various different uses (such as numbers used in tuplets and those used >>> in time signatures). >>> > >>> > The reason is probably because it is intended for use with music >>> engraving, and they should then be rendered differently. >>> > >>> > Exactly. But Unicode would consider these a matter for font switching >>> in rich text. >>> >>> One original principle was ensure different encodings, so if the >>> practise in music engraving is to keep them different, they might be >>> encoded differently. >>> >>> > > There are duplicates all over the place, like how the half-sharp >>> symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 >>> as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as >>> "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". >>> They are graphically identical, and the first three even all mean the same >>> thing, a quarter tone sharp! >>> > >>> > But the tuning system is different, E24 and Pythagorean. Some Latin >>> and Greek uppercase letters are exactly the same but have different >>> encodings. >>> > >>> > Tuning systems are not scripts. >>> >>> That seems obvious. As I pointed out above, the Arabic glyphs were >>> originally taken from Western ones, but have a different musical meaning, >>> also when played using E12, as some do. >>> >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 27 17:36:02 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 00:36:02 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> Message-ID: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. By contrast, Persian music notation invented new microtonal accidentals, called the koron and sori, and my impression is that their average value, as measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] using this value; note that one actually needs two extra microtonal accidentals?Arabic microtonal notation is in fact not complete. The E24 exact quarter-tones are suitable for making a piano sound badly out of tune. Compare that with the accordion in [2], Farid El Atrache - "Noura-Noura". 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html 2. https://www.youtube.com/watch?v=fvp6fo7tfpk > On 27 May 2018, at 22:33, Philippe Verdy wrote: > > Thanks! > > Le dim. 27 mai 2018 22:18, Garth Wallace a ?crit : > Philippe is entirely correct here. The fact that a symbol has somewhat different meanings in different contexts does not mean that it is actually multiple visually identical symbols. Otherwise Unicode would be re-encoding the Latin alphabet many, many times over. > > During most of Bach's career, the prevailing tuning system was meantone. He wrote the Well-Tempered Clavier to explore the possibilities afforded by a new tuning system called well temperament. In the modern era, his work has typically been played in 12-tone equal temperament. That does not mean that the ? that Bach used in his score for the Well-Tempered Clavier was not the same symbol as the ? in his other scores, or that they somehow invisibly became yet another symbol when the score is opened on the music desk of a modern Steinway. > > On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy wrote: > Even flat notes or rythmic and pause symbols in Western musical notations have different contextual meaning depending on musical keys at start of scores, and other notations or symbols added above the score. So their interpretation are also variable according to context, just like tuning in a Arabic musical score, which is also keyed and annotated differently. These keys can also change within the same partition score. > So both the E12 vs. E24 systems (which are not incompatible) may also be used in Western and Arabic music notations. The score keys will give the interpretation. > Tone marks taken isolately mean absolutely nothing in both systems outside the keyed scores in which they are inserted, except that they are just glyphs, which may be used to mean something else (e.g. a note in a comics artwork could be used to denote someone whistling, without actually encoding any specific tone, or rythmic). > > > 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode : > > > > On 17 May 2018, at 16:47, Garth Wallace via Unicode wrote: > > > > On Thu, May 17, 2018 at 12:41 AM Hans ?berg wrote: > > > > > On 17 May 2018, at 08:47, Garth Wallace via Unicode wrote: > > > > > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode wrote: > > >> > > >> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those: > > >> https://www.smufl what should be unified.org > > >> http://www.smufl.org/version/latest/ > > >> ... > > >> > > >> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue. > > >> > > > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures). > > > > The reason is probably because it is intended for use with music engraving, and they should then be rendered differently. > > > > Exactly. But Unicode would consider these a matter for font switching in rich text. > > One original principle was ensure different encodings, so if the practise in music engraving is to keep them different, they might be encoded differently. > > > > There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp! > > > > But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings. > > > > Tuning systems are not scripts. > > That seems obvious. As I pointed out above, the Arabic glyphs were originally taken from Western ones, but have a different musical meaning, also when played using E12, as some do. > > > > > From unicode at unicode.org Sun May 27 20:39:52 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sun, 27 May 2018 18:39:52 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> Message-ID: On Sun, May 27, 2018 at 3:36 PM, Hans ?berg wrote: > The flats and sharps of Arabic music are semantically the same as in > Western music, departing from Pythagorean tuning, then, but the microtonal > accidentals are different: they simply reused some that were available. But they aren't different! They are the same symbols. They are, as you yourself say, reused. The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols. > By contrast, Persian music notation invented new microtonal accidentals, > called the koron and sori, and my impression is that their average value, > as measured by Hormoz Farhat in his thesis, is also usable in Arabic music. > For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation > [1] using this value; note that one actually needs two extra microtonal > accidentals?Arabic microtonal notation is in fact not complete. > > The E24 exact quarter-tones are suitable for making a piano sound badly > out of tune. Compare that with the accordion in [2], Farid El Atrache - > "Noura-Noura". > > 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html > 2. https://www.youtube.com/watch?v=fvp6fo7tfpk > > > > I don't really see how this is relevant. Nobody is claiming that the koron and sori accidentals are the same symbols as the Arabic half-sharp and flat with crossbar. They look entirely different. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun May 27 14:27:03 2018 From: unicode at unicode.org (SundaraRaman R via Unicode) Date: Mon, 28 May 2018 00:57:03 +0530 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? Message-ID: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama ???? as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? Cheers, Sundar From unicode at unicode.org Mon May 28 03:08:30 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 10:08:30 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> Message-ID: <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> > On 28 May 2018, at 03:39, Garth Wallace wrote: > >> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg wrote: >> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. >> > But they aren't different! They are the same symbols. They are, as you yourself say, reused. Historically, yes, but not necessarily now. > The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols. It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not. > By contrast, Persian music notation invented new microtonal accidentals, called the koron and sori, and my impression is that their average value, as measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] using this value; note that one actually needs two extra microtonal accidentals?Arabic microtonal notation is in fact not complete. > > The E24 exact quarter-tones are suitable for making a piano sound badly out of tune. Compare that with the accordion in [2], Farid El Atrache - "Noura-Noura". > > 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html > 2. https://www.youtube.com/watch?v=fvp6fo7tfpk > > > > I don't really see how this is relevant. Nobody is claiming that the koron and sori accidentals are the same symbols as the Arabic half-sharp and flat with crossbar. They look entirely different. Arabic music simply happens to use Western style accidentals for concepts similar to Persian music rather than Western music. From unicode at unicode.org Mon May 28 04:05:42 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 28 May 2018 10:05:42 +0100 (BST) Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> Message-ID: On 2018-05-28, Hans ?berg via Unicode wrote: >> On 28 May 2018, at 03:39, Garth Wallace wrote: >>> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg wrote: >>> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. ... >> The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols. > > It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not. Latin, Greek, Cyrillic etc. could not have been unified, because of the requirement to have round-trip compatibility with previous encodings. It is also, of course, convenient for many reasons to have the notion of "script" hard-coded into unicode code-points, instead of in higher-level mark-up where it arguably belongs - just as, when copyright finally expires, it will be convenient to have Tolkien's runes disunified from historical runes (which is the line taken by the proposal waiting for that day). Whether it is so convenient to have a "music script" notion hard-coded is presumably what this argument is about. It's not obvious to me that musical notation is something that carries the "script" baggage in the same way that writing systems do. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon May 28 05:43:10 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 12:43:10 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> Message-ID: <88FD19F5-F401-4CED-A397-5D7BAE4EFDB1@telia.com> > On 28 May 2018, at 11:05, Julian Bradfield via Unicode wrote: > > On 2018-05-28, Hans ?berg via Unicode wrote: >>> On 28 May 2018, at 03:39, Garth Wallace wrote: >>>> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg wrote: >>>> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. > ... >>> The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols. >> >> It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not. > > Latin, Greek, Cyrillic etc. could not have been unified, because of the > requirement to have round-trip compatibility with previous encodings. Indeed, in Unicode because of that, which I pointed out. > It is also, of course, convenient for many reasons to have the notion > of "script" hard-coded into unicode code-points, instead of in > higher-level mark-up where it arguably belongs - just as, when > copyright finally expires, it will be convenient to have Tolkien's > runes disunified from historical runes (which is the line taken by the > proposal waiting for that day). Whether it is so convenient to have a > "music script" notion hard-coded is presumably what this argument is > about. It's not obvious to me that musical notation is something that > carries the "script" baggage in the same way that writing systems do. Indeed, that is what I also pointed out. So I suggested to contact the SMuFL people which might inform about the underlying reasoning, and then make a decision about what might be suitable for Unicode. They probably have them separate for the same reason as for scripts: originally different fonts encodings, but those are not official, and in addition it is for music engraving, and not writing in text files. From unicode at unicode.org Mon May 28 07:57:26 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 13:57:26 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: References: Message-ID: <20180528135726.7759c425@JRWUBU2> On Mon, 28 May 2018 00:57:03 +0530 SundaraRaman R via Unicode wrote: > Hi, > > In languages like Ruby or Java > (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), > functions to check if a character is alphabetic do that by looking for > the 'Alphabetic' property (defined true if it's in one of the L > categories, or Nl, or has 'Other_Alphabetic' property). When parsing > Tamil text, this works out well for independent vowels and consonants > (which are in Lo), and for most dependent signs (which are in Mc or Mn > but have the 'Other_Alphabetic' property), but the very common pulli > (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to > concluding any string containing it to be non-alphabetic. > > This doesn't make sense to me since the Virama ???? as much of an > alphabetic character as any of the "Dependent Vowel" characters which > have been given the 'Other_Alphabetic' property. Is there a rationale > behind this difference, or is it an oversight to be corrected? There is only one character with a canonical combining class of 9 that is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. That last had any of the other properties of viramas back in Unicode 1.0; the characters that triggered such behaviours were permanently removed in Unicode 1.1. There are some notable absences from the combining marks included. Significant absences include ZWJ, ZWNJ and CGJ. However, a non-erroneous *conformant* Unicode process cannot always determine whether a string, given only that it is a string, is composed only of alphabetic characters. The answer would be 'yes' for but 'no' for the canonically equivalent ! (U+0327 is not included as alphabetic either.) There is at least one combination of Latin letter and combining mark that occurs in the normal orthography of a natural language and does not have a precomposed equivalent. I fear that the correct test for what you want is to split text into words and check that each word begins with an alphabetic character. That test can be made by a conformant process. I think, but have not checked, that the test an be simplified to: (a) Check that the first character is alphabetic. (b) Ignore every character with a WordBreak property of Extend or ZWJ (c) Check that all other characters are alphabetic. Richard. From unicode at unicode.org Mon May 28 08:10:23 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 14:10:23 +0100 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> Message-ID: <20180528141023.24d2231e@JRWUBU2> On Mon, 28 May 2018 10:08:30 +0200 Hans ?berg via Unicode wrote: > > On 28 May 2018, at 03:39, Garth Wallace wrote: > > The fact that they do not denote the same width in cents in Arabic > > music as they do in Western modern classical does not matter. That > > sort of precision is not inherent to the written symbols. > > It is not about precision, but concepts. Like B, ?, and ?, which > could have been unified, but are not. Unifying these would make a real mess of lower casing! What is the context in which the Arab use would benefit from having a different encoding? Richard. From unicode at unicode.org Mon May 28 08:30:55 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 15:30:55 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: <20180528141023.24d2231e@JRWUBU2> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> Message-ID: > On 28 May 2018, at 15:10, Richard Wordingham via Unicode wrote: > > On Mon, 28 May 2018 10:08:30 +0200 > Hans ?berg via Unicode wrote: > >> It is not about precision, but concepts. Like B, ?, and ?, which >> could have been unified, but are not. > > Unifying these would make a real mess of lower casing! German has a special sign ? for "ss", without upper capital version. > What is the context in which the Arab use would benefit from having a > different encoding? Maybe if they decide to change the glyph, then what already is encoded would get the right appearance. But SMuFL might have had other reasons: the glyphs should probably be designed together. And it is simple, as one does not need to investigate their uses too much. For example, the Turkish AEU sharps are microtonal, not the ordinary ones. So if the Turkish accidentals have their own code points, one can change that later. From unicode at unicode.org Mon May 28 09:33:11 2018 From: unicode at unicode.org (SundaraRaman R via Unicode) Date: Mon, 28 May 2018 20:03:11 +0530 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <20180528135726.7759c425@JRWUBU2> References: <20180528135726.7759c425@JRWUBU2> Message-ID: Hi, thanks for your reply. > There is only one character with a canonical combining class of 9 that > is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU. > That last had any of the other properties of viramas back in Unicode > 1.0; the characters that triggered such behaviours were permanently > removed in Unicode 1.1. I didn't understand the second sentence here, could you clarify? What do you mean by "any of the other properties" here? And "triggered such behaviours" seems to imply having them in other_alphabetic had negative consequences, could you give an example of what that might be? > There are some notable absences from the combining marks included. > Significant absences include ZWJ, ZWNJ and CGJ. > > However, a non-erroneous *conformant* Unicode process cannot > always determine whether a string, given only that it is a string, is > composed only of alphabetic characters. The answer would be 'yes' for > but 'no' for the canonically > equivalent ! > (U+0327 is not included as alphabetic either.) > > There is at least one combination of Latin letter and combining mark > that occurs in the normal orthography of a natural language and does not > have a precomposed equivalent. Ah, that's somewhat unfortunate that such a quick and easy alphabetic check is not possible in the general case, but I can understand how it might be weird to give the Alphabetic property to a ZWJ or ZWNJ. But in the case of Tamil, I'm curious why most other combining Tamil marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a character barely used in Tamil text, has combining class 0 and is included in Other_Alphabetic, but the visually similar and similarly positioned pulli is not. In this particular case, is it a historical accident that these got assigned this way, or is there a rationale behind these? Would it at all be possible to get this changed in the upcoming Unicode standard? (By the way, I'm happy to get a link to read through for any of my questions here. I just find it quite hard to search for and find past discussions and decision rationales regarding these, not knowing how and where to search for them.) > I fear that the correct test for what you want is to split text into > words and check that each word begins with an alphabetic character. Do you mean "each grapheme cluster begins with an alphabetic character" here? It seems to me (in my very limited Unicode knowledge) that such a test, going through grapheme clusters and checking the first codepoint in each, would also ensure the text is full alphabetic. And it has the advantage that more languages have a (relatively) easy way for splitting text into grapheme clusters, than for checking minor Unicode properties like WordBreak, so this one might be easier to implement. Does this test anywhere in the ballpark of being right? Regards, Sundar From unicode at unicode.org Mon May 28 10:00:37 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 16:00:37 +0100 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> Message-ID: <20180528160037.6b3689e0@JRWUBU2> On Mon, 28 May 2018 15:30:55 +0200 Hans ?berg via Unicode wrote: > > On 28 May 2018, at 15:10, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 10:08:30 +0200 > > Hans ?berg via Unicode wrote: > > > >> It is not about precision, but concepts. Like B, ?, and ?, which > >> could have been unified, but are not. > > > > Unifying these would make a real mess of lower casing! > > German has a special sign ? for "ss", without upper capital version. That doesn't prevent upper-casing - you just have to know your audience. The three letters like 'B' have very different lower case forms, and very few would agree that they were the same letter. For the same reason, there are two utter confusables in THE Latin SCRIPT for 00D0 LATIN CAPITAL LETTER ETH. More notably though, one just has to run the risk of getting a culturally incorrect upper case when rendering U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the same letter is debatable. Richard. From unicode at unicode.org Mon May 28 10:54:47 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 17:54:47 +0200 Subject: Unicode characters unification In-Reply-To: <20180528160037.6b3689e0@JRWUBU2> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> Message-ID: > On 28 May 2018, at 17:00, Richard Wordingham via Unicode wrote: > > On Mon, 28 May 2018 15:30:55 +0200 > Hans ?berg via Unicode wrote: > >>> On 28 May 2018, at 15:10, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 10:08:30 +0200 >>> Hans ?berg via Unicode wrote: >>> >>>> It is not about precision, but concepts. Like B, ?, and ?, which >>>> could have been unified, but are not. >>> >>> Unifying these would make a real mess of lower casing! >> >> German has a special sign ? for "ss", without upper capital version. > > That doesn't prevent upper-casing - you just have to know your > audience. That would be the same if the Greek and Latin uppercase letters would have been unified: One would need to know the context. > The three letters like 'B' have very different lower case > forms, and very few would agree that they were the same letter. They were the same in the Uncial script, but evolved to be viewed as different. That is common with math symbols: something available evolving into separate symbols. > For the > same reason, there are two utter confusables in THE Latin SCRIPT for > 00D0 LATIN CAPITAL LETTER ETH. The stuff is likely added for computer legacy, if there were separate encodings for those. > More notably though, one just has to run > the risk of getting a culturally incorrect upper case when rendering > U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the > same letter is debatable. Unified CJK Ideographs differ by stroke order. From unicode at unicode.org Mon May 28 11:45:39 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 17:45:39 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: References: <20180528135726.7759c425@JRWUBU2> Message-ID: <20180528174539.29acf556@JRWUBU2> On Mon, 28 May 2018 20:03:11 +0530 SundaraRaman R via Unicode wrote: > Hi, thanks for your reply. > > > There is only one character with a canonical combining class of 9 > > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER > > PHINTHU. That last had any of the other properties of viramas back > > in Unicode 1.0; the characters that triggered such behaviours were > > permanently removed in Unicode 1.1. > > I didn't understand the second sentence here, could you clarify? Sorry, I messed that system up. It should have read, "The last time that that had any of the other properties of viramas back in Unicode 1.0;" > What > do you mean by "any of the other properties" here? The effects of virama that spring to mind are: (a) Causing one or both letters on either side to change or combine to indicate combination; (b) Appearing as a mark only if it does not affect one of the letters on either side; (c) Causing a left matra to appear on the left of the sequence of consonants joined by a sequence of non-visible viramas. > And "triggered such > behaviours" seems to imply having them in other_alphabetic had > negative consequences, could you give an example of what that might > be? Nowadays, the Thai syllable ???, normatively pronounced /trai/, is only encoded , and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode 1.0, while was rendered as at present, the same visible string could also be encoded as - no glyph would be rendered for U+0E3A. If one wanted the official Sanskritised Pali version, one could type ???? as at present. One could also encode it as . Weirdly, I couldn't have used the phonetically ordered vowel to type a monk's name ending in ???? , as would have been rendered as ????. As the non-phonetic virama-like behaviours of U+0E3A are only mentioned under the heading 'Alternate Ordering', I can only presume that they were triggered by the phonetic order vowel signs, U+0E70 to U+0E74. It is possible that U+0E3A acquired the alphabetic property because it ceased to behave like a virama. Alternatively, it may have acquired the alphabetic property because of its use in the compound vowels of minority languages. > But in the case of Tamil, I'm curious why most other combining Tamil > marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a > character barely used in Tamil text, has combining class 0 and is > included in Other_Alphabetic, but the visually similar and similarly > positioned pulli is not. In this particular case, is it a historical > accident that these got assigned this way, or is there a rationale > behind these? Would it at all be possible to get this changed in the > upcoming Unicode standard? Tamil has usually been treated as just another Indian Indic script. U+0E3A is the only virama-like character with the property of being 'alphabetic'. I can't see a change making it into Unicode 11.0. It requires too much careful thought. Besides, anything that considered as alphabetic should also considerer as alphabetic - they should be mostly interchangeable in Tamil. > > I fear that the correct test for what you want is to split text into > > words and check that each word begins with an alphabetic > > character. > > Do you mean "each grapheme cluster begins with an alphabetic > character" here? It seems to me (in my very limited Unicode knowledge) > that such a test, going through grapheme clusters and checking the > first codepoint in each, would also ensure the text is full > alphabetic. Not directly. Is the string "mark2mark" alphabetic? It constitutes a single word. My suggested simplification would say 'no', as it contains '2'; perhaps my simplification is wrong. > And it has the advantage that more languages have a > (relatively) easy way for splitting text into grapheme clusters, than > for checking minor Unicode properties like WordBreak, so this one > might be easier to implement. Does this test anywhere in the ballpark > of being right? Yes, it's close to being right. Note that simple approximations for SE Asian word-breaking (e.g. treating SE Asian characters as alphabetic) should work well for your application. Richard. From unicode at unicode.org Mon May 28 12:18:52 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 18:18:52 +0100 Subject: Unicode characters unification In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> Message-ID: <20180528181852.7ce84e52@JRWUBU2> On Mon, 28 May 2018 17:54:47 +0200 Hans ?berg via Unicode wrote: > > On 28 May 2018, at 17:00, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 15:30:55 +0200 > > Hans ?berg via Unicode wrote: > >> German has a special sign ? for "ss", without upper capital > >> version. > > > > That doesn't prevent upper-casing - you just have to know your > > audience. > > That would be the same if the Greek and Latin uppercase letters would > have been unified: One would need to know the context. I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER M and U+039C GREEK CAPITAL LETTER MU on it. I only knew the difference because I listened to what the lecturer said. > > For the > > same reason, there are two utter confusables in THE Latin SCRIPT for > > 00D0 LATIN CAPITAL LETTER ETH. > The stuff is likely added for computer legacy, if there were separate > encodings for those. Unlikely. U+00F0 LATIN SMALL LETTER ETH and U+0256 LATIN SMALL LETTER D WITH TAIL contrast in the IPA. The difference between U+0111 LATIN SMALL LETTER D WITH STROKE and U+00F0 LATIN SMALL LETTER ETH may have been debated. Richard. From unicode at unicode.org Mon May 28 13:19:09 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 20:19:09 +0200 Subject: Unicode characters unification In-Reply-To: <20180528181852.7ce84e52@JRWUBU2> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> Message-ID: <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> > On 28 May 2018, at 19:18, Richard Wordingham via Unicode wrote: > > On Mon, 28 May 2018 17:54:47 +0200 > Hans ?berg via Unicode wrote: > >>> On 28 May 2018, at 17:00, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 15:30:55 +0200 >>> Hans ?berg via Unicode wrote: > >>>> German has a special sign ? for "ss", without upper capital >>>> version. >>> >>> That doesn't prevent upper-casing - you just have to know your >>> audience. >> >> That would be the same if the Greek and Latin uppercase letters would >> have been unified: One would need to know the context. > > I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER > M and U+039C GREEK CAPITAL LETTER MU on it. I only knew the difference > because I listened to what the lecturer said. Indistinguishable math styles Latin and Greek uppercase letters have been added, even though that was not so in for example TeX, and thus no encoding legacy to consider. From unicode at unicode.org Mon May 28 14:01:39 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 20:01:39 +0100 Subject: Unicode characters unification In-Reply-To: <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> Message-ID: <20180528200139.744ee706@JRWUBU2> On Mon, 28 May 2018 20:19:09 +0200 Hans ?berg via Unicode wrote: > Indistinguishable math styles Latin and Greek uppercase letters have > been added, even though that was not so in for example TeX, and thus > no encoding legacy to consider. They sort differently - one can have vaguely alphabetical indexes of mathematical symbols. They also have quite different compatibility decompositions. Does sorting offer an argument for encoding these symbols differently. I'm not sure it's a strong arguments - how likely is one to have a list where the difference matters? Richard. From unicode at unicode.org Mon May 28 14:14:58 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 21:14:58 +0200 Subject: Unicode characters unification In-Reply-To: <20180528200139.744ee706@JRWUBU2> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> Message-ID: <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com> > On 28 May 2018, at 21:01, Richard Wordingham via Unicode wrote: > > On Mon, 28 May 2018 20:19:09 +0200 > Hans ?berg via Unicode wrote: > >> Indistinguishable math styles Latin and Greek uppercase letters have >> been added, even though that was not so in for example TeX, and thus >> no encoding legacy to consider. > > They sort differently - one can have vaguely alphabetical indexes of > mathematical symbols. They also have quite different compatibility > decompositions. > > Does sorting offer an argument for encoding these symbols differently. > I'm not sure it's a strong arguments - how likely is one to have a list > where the difference matters? The main point is that they are not likely to be distinguishable when used side-by-side in the same formula. They could be of significance if using Greek names instead of letters, of length greater than one, then. But it is not wrong to add them, because it is easier than having to think through potential uses. From unicode at unicode.org Mon May 28 14:38:27 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 28 May 2018 20:38:27 +0100 Subject: Unicode characters unification In-Reply-To: <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com> Message-ID: <20180528203827.7c073b30@JRWUBU2> On Mon, 28 May 2018 21:14:58 +0200 Hans ?berg via Unicode wrote: > > On 28 May 2018, at 21:01, Richard Wordingham via Unicode > > wrote: > > > > On Mon, 28 May 2018 20:19:09 +0200 > > Hans ?berg via Unicode wrote: > > > >> Indistinguishable math styles Latin and Greek uppercase letters > >> have been added, even though that was not so in for example TeX, > >> and thus no encoding legacy to consider. > > > > They sort differently - one can have vaguely alphabetical indexes of > > mathematical symbols. They also have quite different compatibility > > decompositions. > > > > Does sorting offer an argument for encoding these symbols > > differently. I'm not sure it's a strong arguments - how likely is > > one to have a list where the difference matters? > > The main point is that they are not likely to be distinguishable when > used side-by-side in the same formula. They could be of significance > if using Greek names instead of letters, of length greater than one, > then. But it is not wrong to add them, because it is easier than > having to think through potential uses. By these symbols, I meant the quarter-tone symbols. Capital em and capital mu, as symbols, need to be encoded separately for proper sorting. Richard. From unicode at unicode.org Mon May 28 15:23:54 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 28 May 2018 22:23:54 +0200 Subject: Unicode characters unification In-Reply-To: <20180528203827.7c073b30@JRWUBU2> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com> <20180528203827.7c073b30@JRWUBU2> Message-ID: <8ABECE66-4FB0-4434-8F9D-CB8AA889621D@telia.com> > On 28 May 2018, at 21:38, Richard Wordingham wrote: > > On Mon, 28 May 2018 21:14:58 +0200 > Hans ?berg via Unicode wrote: > >>> On 28 May 2018, at 21:01, Richard Wordingham via Unicode >>> wrote: >>> >>> On Mon, 28 May 2018 20:19:09 +0200 >>> Hans ?berg via Unicode wrote: >>> >>>> Indistinguishable math styles Latin and Greek uppercase letters >>>> have been added, even though that was not so in for example TeX, >>>> and thus no encoding legacy to consider. >>> >>> They sort differently - one can have vaguely alphabetical indexes of >>> mathematical symbols. They also have quite different compatibility >>> decompositions. >>> >>> Does sorting offer an argument for encoding these symbols >>> differently. I'm not sure it's a strong arguments - how likely is >>> one to have a list where the difference matters? >> >> The main point is that they are not likely to be distinguishable when >> used side-by-side in the same formula. They could be of significance >> if using Greek names instead of letters, of length greater than one, >> then. But it is not wrong to add them, because it is easier than >> having to think through potential uses. > > By these symbols, I meant the quarter-tone symbols. Capital em and > capital mu, as symbols, need to be encoded separately for proper > sorting. Some of the math style letters are out of order for legacy reasons, so sorting may not work well. SMuFL have different fonts for text and music engraving, but I can't think of any use of sorting them. From unicode at unicode.org Mon May 28 17:13:43 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 28 May 2018 16:13:43 -0600 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? Message-ID: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell> SundaraRaman R wrote: > but the very common pulli (VIRAMA) > is neither in Lo nor has 'Other_Alphabetic', and so leads to > concluding any string containing it to be non-alphabetic. Is this definition part of Unicode? I thought the use of General Category to answer questions like "this sequence is a word" or "this string is alphabetic" was much more complex than that. (I'm not even sure what the latter means, for any script with any sort of combining mark.) Richard Wordingham wrote: > The effects of virama that spring to mind are: > > (a) Causing one or both letters on either side to change or combine to > indicate combination; > > (b) Appearing as a mark only if it does not affect one of the letters > on either side; > > (c) Causing a left matra to appear on the left of the sequence of > consonants joined by a sequence of non-visible viramas. Most of these don't apply to Tamil, of course. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon May 28 23:23:13 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 29 May 2018 13:23:13 +0900 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: References: Message-ID: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: > Hi, > > In languages like Ruby or Java > (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), > functions to check if a character is alphabetic do that by looking for > the 'Alphabetic' property (defined true if it's in one of the L > categories, or Nl, or has 'Other_Alphabetic' property). When parsing > Tamil text, this works out well for independent vowels and consonants > (which are in Lo), and for most dependent signs (which are in Mc or Mn > but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) > is neither in Lo nor has 'Other_Alphabetic', and so leads to > concluding any string containing it to be non-alphabetic. > > This doesn't make sense to me since the Virama ???? as much of an > alphabetic character as any of the "Dependent Vowel" characters which > have been given the 'Other_Alphabetic' property. Is there a rationale > behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin. From unicode at unicode.org Mon May 28 23:40:49 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 28 May 2018 21:40:49 -0700 Subject: Unicode characters unification In-Reply-To: <20180528200139.744ee706@JRWUBU2> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> Message-ID: <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com> In the discussion leading up to this it has been implied that Unicode encodes / should encode concepts or pure shape. And there's been some confusion as to were concerns about sorting or legacy encodings fit in. Time to step back a bit: Primarily the Unicode Standard encodes by character identity - something that is different from either the pure shape or the "concept denoted by the character". For example, for most alphabetic characters, you could say that they stand for a more-or-less well-defined phonetic value. But Unicode does not encode such values directly, instead it encodes letters - which in turn get re-purposed for different sound values in each writing system. Likewise, the various uses of period or comma are not separately encoded - potentially these marks are given mappings to specific functions for each writing system or notation using them. Clearly these are not encoded to represent a single mapping to an external concept, and, as we will see, they are not necessarily encoded directly by shape. Instead, the Unicode Standard encodes character identity; but there are a number of principled and some ad-hoc deviations from a purist implementation of that approach. The first one is that of forcing a disunification by script. What constitutes a script can be argued over, especially as they all seem to have evolved from (or been created based on) predecessor scripts, so there are always pairs of scripts that have a lot in common. While an "Alpha" and an "A" do have much in common, it is best to recognize that their membership in different scripts leads to important differences so that it's not a stretch to say that they no longer share the same identity. The next principled deviation is that of requiring case pairs to be unique. Bicameral scripts, (and some of the characters in them), acquired their lowercase at different times, so that the relation between the upper cases and the lower cases are different across scripts, and gives rise to some exceptional cases inside certain scripts. This is one of the reasons to disunify certain bicameral scripts. But even inside scripts, there are case pairs that may share lowercase forms or may share uppercase forms, but said forms are disunified to make the pairs separate. The two first principles match users expectations in that case changes (largely) work as expected in plain text and that sorting also (largely) matches user expectation by default. The third principle is to disunify characters based on line-breaking or line-layout properties. Implicit in that is the idea that plain text, and not markup, is the place to influence basic algorithms such as line-breaking and bidi layout (hence two sets of Arabic-Indic digits). Once can argue with that decision, but the fact is, there are too many places where text exist without the ability to apply markup to go entirely without that support. The fourth principle is that of differential variability of appearance. For letters proper, their identity can be associated with a wide range of appearances from sparse to fanciful glyphs. If an entire piece of text (or even a given word) is set using a particular font style, context will enable the reader to identify the underlying letter, even if the shape is almost unrelated to the "archetypical shape" documented in the Standard. When letters or marks get re-used in notational systems, though, the permissible range of variability changes dramatically - variations that do not change the meaning of a word in styled text, suddenly change the meaning of text in a certain notational system. Hence the disunification of certain letters or marks (but not all of them) in support of mathematical notation. The fifth principle appears to be to disunify only as far as and only when necessary. The biggest downside of this principle is that it leads to "late" disunifications; some characters get disunified as the committee becomes aware of some issue, leading to the problem of legacy data. But it has usefully somewhat limited the further proliferation of characters of identical appearance. The final principle is compatibility. This covers being able to round-trip from certain legacy encodings. This principle may force some disunifications that otherwise might not have happened, but it also isn't a panacea: there are legacy encodings that are mutually incompatible, so that one needs to make a choice which one to support. TeX being a "glyph based" system looses out here in comparison to legacy plain-text character encoding systems such as the 8859 series of ISO/IEC standards. Some unification among punctuation marks in particular seem to have been made on a more ad-hoc basis. This issue is exacerbated by the fact that many such systems lack either the wide familiarity of standard writing systems (with their tolerance for glyph variation) nor the rigor of something like mathematical notation. This leads to the pragmatic choice of letting users select either "shape" or "concept" rather than "identity"; generally, such ad-hoc solutions should be resisted -- they are certainly not to be seen as a precedent for "encoding concepts" generally. But such exceptions prove the rule, which leads back to where we started: the default position is that Unicode encodes a character identity that is not the same as encoding the concept that said character is used to represent in writing. A./ From unicode at unicode.org Mon May 28 23:44:11 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 28 May 2018 21:44:11 -0700 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> Message-ID: One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. A./ On 5/28/2018 9:23 PM, Martin J. D?rst via Unicode wrote: > Hello Sundar, > > On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: >> Hi, >> >> In languages like Ruby or Java >> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), >> >> functions to check if a character is alphabetic do that by looking for >> the 'Alphabetic'? property (defined true if it's in one of the L >> categories, or Nl, or has 'Other_Alphabetic' property). When parsing >> Tamil text, this works out well for independent vowels and consonants >> (which are in Lo), and for most dependent signs (which are in Mc or Mn >> but have the 'Other_Alphabetic' property), but the very common pulli >> (VIRAMA) >> is neither in Lo nor has 'Other_Alphabetic', and so leads to >> concluding any string containing it to be non-alphabetic. >> >> This doesn't make sense to me since the Virama? ???? as much of an >> alphabetic character as any of the "Dependent Vowel" characters which >> have been given the 'Other_Alphabetic' property. Is there a rationale >> behind this difference, or is it an oversight to be corrected? > > I suggest submitting an error report via > https://www.unicode.org/reporting.html. I haven't studied the issue in > detail (sorry, just no time this week), but it sounds reasonable to > give the VIRAMA the 'Other_Alphabetic' property. > > I'd recommend to mention examples other than Tamil in your report > (assuming they exist). > > BTW, what's the method you are using in Ruby? If there's a problem in > Ruby (which I don't think; it's just using Unicode data), then please > make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I > should be able to follow up on that. > > Regards,?? Martin. > From unicode at unicode.org Mon May 28 23:45:04 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 28 May 2018 21:45:04 -0700 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> Message-ID: On 5/28/2018 9:23 PM, Martin J. D?rst via Unicode wrote: > Hello Sundar, > > On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: >> Hi, >> >> In languages like Ruby or Java >> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), >> >> functions to check if a character is alphabetic do that by looking for >> the 'Alphabetic'? property (defined true if it's in one of the L >> categories, or Nl, or has 'Other_Alphabetic' property). When parsing >> Tamil text, this works out well for independent vowels and consonants >> (which are in Lo), and for most dependent signs (which are in Mc or Mn >> but have the 'Other_Alphabetic' property), but the very common pulli >> (VIRAMA) >> is neither in Lo nor has 'Other_Alphabetic', and so leads to >> concluding any string containing it to be non-alphabetic. >> >> This doesn't make sense to me since the Virama? ???? as much of an >> alphabetic character as any of the "Dependent Vowel" characters which >> have been given the 'Other_Alphabetic' property. Is there a rationale >> behind this difference, or is it an oversight to be corrected? > > I suggest submitting an error report via > https://www.unicode.org/reporting.html. I haven't studied the issue in > detail (sorry, just no time this week), but it sounds reasonable to > give the VIRAMA the 'Other_Alphabetic' property. Please don't. This is not an error in the Unicode property assignments, which have been stable in scope for Alphabetic for some time now. The problem is in assuming that the Java or Ruby isAphabetic() API, which simply report the Unicode property value Alphabetic for a character, suffices for identifying a string as somehow "wordlike". It doesn't. The approximation you are looking for is to add Diacritic to Alphabetic. That will automatically pull in all the nuktas and viramas/killers for Brahmi-derived scripts. It also will pull in the harakat for Arabic and similar abjads, which are also not Alphabetic in the property values. And it will pull in tone marks for various writing systems. For good measure, also add Extender, which will pick up length marks and iteration marks. Please do not assume that the Alphabetic property just automatically equates to "what I would write in a word". Or that it should be adjusted to somehow make that happen. It would be highly advisable to study *all* the UCD properties in more depth, before starting to report bugs in one or another simply because using a single property doesn't produce the string classification one assumes should be correct in a particular case. Of course, to get a better approximation of what actually constitutes a "word" in a particular writing system, instead of using raw property API's, one should be using a WordBreak iterator, preferably one tailored for the language in question. --Ken > > I'd recommend to mention examples other than Tamil in your report > (assuming they exist). > > BTW, what's the method you are using in Ruby? If there's a problem in > Ruby (which I don't think; it's just using Unicode data), then please > make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I > should be able to follow up on that. > > Regards,?? Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 29 00:02:15 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 28 May 2018 22:02:15 -0700 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> Message-ID: On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: > One of the general principles is that combining marks inherit the > property of their base character. > > Normally, "inherited" should be the only property value for combining > marks. > > There have been some deviations from this over the years, for various > reasons, and there are some properties (such as general category) > where it is necessary to recognize the character as combining, but the > general principle still holds. > > Therefore, if you are trying to see whether a string is alphabetic, > combining marks should be "transparent" to such an algorithm. Generally, good advice. But there are clear exceptions. For example, the enclosing combining marks for symbols are intended (basically) to make symbols of a sort. And many combining marks have explicit script assigments, so they cannot simply willy-nilly inherit the script of a base letter if they are misapplied, for example. This is why I recommend simply adding the Diacritic property into the mix for testing a string. That is a closer approximation to the kind of naive "Is this string alphabetic?" question that SunaraRaman was asking about -- it picks up the correct subset of combining marks to union with the set of actual isAlphabetic characters, to produce more expected results. (Including, of course, the correct classification of all the viramas, stackers, and killers, as well as picking up all the nuktas.). Folks, please examine the set of character for Diacritic and for Extender in: http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt to see what I'm talking about. The stuff you are looking for is already there. --Ken P.S. And please don't start an argument about the fact that a "virama" isn't really a "diacritic". We know that, too. ;-) From unicode at unicode.org Tue May 29 00:30:22 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 28 May 2018 22:30:22 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> Message-ID: <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 29 02:49:52 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 29 May 2018 08:49:52 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> Message-ID: <20180529084952.7856023d@JRWUBU2> On Mon, 28 May 2018 22:02:15 -0700 Ken Whistler via Unicode wrote: > On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: > > One of the general principles is that combining marks inherit the > > property of their base character. > > > > Normally, "inherited" should be the only property value for > > combining marks. > > > > There have been some deviations from this over the years, for > > various reasons, and there are some properties (such as general > > category) where it is necessary to recognize the character as > > combining, but the general principle still holds. > > > > Therefore, if you are trying to see whether a string is alphabetic, > > combining marks should be "transparent" to such an algorithm. > > Generally, good advice. But there are clear exceptions. For example, > the enclosing combining marks for symbols are intended (basically) to > make symbols of a sort. And many combining marks have explicit script > assigments, so they cannot simply willy-nilly inherit the script of a > base letter if they are misapplied, for example. How would one know that they are misapplied? And what if the author of the text has broken your rules? Are such texts never to be transcribed to pukka Unicode? > This is why I recommend simply adding the Diacritic property into the > mix for testing a string. That is a closer approximation to the kind > of naive "Is this string alphabetic?" question that SunaraRaman was > asking about -- it picks up the correct subset of combining marks to > union with the set of actual isAlphabetic characters, to produce more > expected results. (Including, of course, the correct classification > of all the viramas, stackers, and killers, as well as picking up all > the nuktas.). > > Folks, please examine the set of character for Diacritic and for > Extender in: > > http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt > > to see what I'm talking about. The stuff you are looking for is > already there. Even without knowing exactly what is wanted, it looks to me as though it isn't. If he wants to allow as a substring, which he should, then that fails because there is no overlap between p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. U+034F COMBINING GRAPHEME JOINER is also missing, apparently deliberately in the case of 'diacritic'. If one uses the definition of words in the word break algorithm, one will end up accepting combinations of letter plus enclosing circle or keycap. (A fix to the word break algorithm for that would be unpleasant.) One hopes that the requirement doesn't include accepting all single words. Every properly spelt word containing U+0E46 THAI CHARACTER MAIYAMOK will be rejected, as it will contain a space before the U+0E46. (I assume there are such words; certainly there are dictionary entries with no corresponding entries without U+0E46, such as "???? ?".) At a lesser level, even English has a very few words with spaces in them, and there is no solution but to list them. Richard. From unicode at unicode.org Tue May 29 03:08:48 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 29 May 2018 09:08:48 +0100 Subject: Unicode characters unification In-Reply-To: <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com> Message-ID: <20180529090848.5ffae27a@JRWUBU2> On Mon, 28 May 2018 21:40:49 -0700 Asmus Freytag via Unicode wrote: > But such exceptions prove the rule, which leads back to where we > started: the default position is that Unicode encodes a character > identity that is not the same as encoding the concept that said > character is used to represent in writing. And the problem remains that of determining the 'identity'. It is rather like distinguishing species - biologists have dozens of different concepts. Richard. From unicode at unicode.org Tue May 29 03:15:42 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 29 May 2018 10:15:42 +0200 Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?= In-Reply-To: <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> Message-ID: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> > On 29 May 2018, at 07:30, Asmus Freytag via Unicode wrote: > > On 5/28/2018 6:30 AM, Hans ?berg via Unicode wrote: >>> Unifying these would make a real mess of lower casing! >>> >> German has a special sign ? for "ss", without upper capital version. >> >> > You may want to retract the second part of that sentence. > > An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position). > A./ Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version. 1. https://en.wikipedia.org/wiki/? From unicode at unicode.org Tue May 29 03:54:27 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 29 May 2018 17:54:27 +0900 Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?= In-Reply-To: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> Message-ID: On 2018/05/29 17:15, Hans ?berg via Unicode wrote: > >> On 29 May 2018, at 07:30, Asmus Freytag via Unicode wrote: >> An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position). >> A./ > > Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version. > > 1. https://en.wikipedia.org/wiki/? The English wikipedia may not be fully up to date. See https://de.wikipedia.org/wiki/Gro?es_? (second paragraph): "Seit dem 29. Juni 2017 ist das ? Bestandteil der amtlichen deutschen Rechtschreibung.[2][3]" Translated to English: "Since June 29, 2017, the ? is part of the official German orthography." (As far as I remember the discussion (on this list?) last year, the ? (uppercase ?) is allowed, but not required.) Regards, Martin. From unicode at unicode.org Tue May 29 04:04:17 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 29 May 2018 11:04:17 +0200 Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?= In-Reply-To: References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> Message-ID: <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> > On 29 May 2018, at 10:54, Martin J. D?rst wrote: > > On 2018/05/29 17:15, Hans ?berg via Unicode wrote: >>> On 29 May 2018, at 07:30, Asmus Freytag via Unicode wrote: > >>> An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position). >>> A./ >> Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version. >> 1. https://en.wikipedia.org/wiki/? > > The English wikipedia may not be fully up to date. > See https://de.wikipedia.org/wiki/Gro?es_? (second paragraph): > > "Seit dem 29. Juni 2017 ist das ? Bestandteil der amtlichen deutschen Rechtschreibung.[2][3]" > > Translated to English: "Since June 29, 2017, the ? is part of the official German orthography." > > (As far as I remember the discussion (on this list?) last year, the ? (uppercase ?) is allowed, but not required.) And it is already in Unicode as ? LATIN CAPITAL LETTER SHARP S U+1E9E. When looking for the lowercase ? LATIN SMALL LETTER SHARP S U+00DF in a MacOS Character Viewer, it does not give the uppercase version, for some reason. The equivalence with "ss" shows up ICU Regular Expressions that do case insensitive matching where the cases have different length, so it should do that for the new character to, I gather. http://userguide.icu-project.org/strings/regexp From unicode at unicode.org Tue May 29 04:17:53 2018 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Tue, 29 May 2018 11:17:53 +0200 (CEST) Subject: Uppercase =?utf-8?B?w58=?= In-Reply-To: <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> Message-ID: <20180529.111753.1271341346199266089.wl@gnu.org> > When looking for the lowercase ? LATIN SMALL LETTER SHARP S U+00DF > in a MacOS Character Viewer, it does not give the uppercase version, > for some reason. Yes, and it will stay so, AFAIK. The uppercase variant of `?' is `SS'. `?' is to be used mainly for names that contain `?', and which must be printed uppercase, for example in passports. Here the distinction is important, cf. Strau? vs. Strauss ? STRAU? vs. STRAUSS Since uppercasing is not common in typesetting German text (in particular headers), the need to make a distinction between words like `Masse' (mass) and `Ma?e' (dimensions) if written uppercase is rarely necessary because it can usually deduced by context. Werner From unicode at unicode.org Tue May 29 05:39:41 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 29 May 2018 12:39:41 +0200 Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?= In-Reply-To: <20180529.111753.1271341346199266089.wl@gnu.org> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> Message-ID: <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> > On 29 May 2018, at 11:17, Werner LEMBERG wrote: > >> When looking for the lowercase ? LATIN SMALL LETTER SHARP S U+00DF >> in a MacOS Character Viewer, it does not give the uppercase version, >> for some reason. > > Yes, and it will stay so, AFAIK. The uppercase variant of `?' is > `SS'. `?' is to be used mainly for names that contain `?', and which > must be printed uppercase, for example in passports. Here the > distinction is important, cf. > > Strau? vs. Strauss ? STRAU? vs. STRAUSS > > Since uppercasing is not common in typesetting German text (in > particular headers), the need to make a distinction between words like > `Masse' (mass) and `Ma?e' (dimensions) if written uppercase is rarely > necessary because it can usually deduced by context. If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available. From unicode at unicode.org Tue May 29 07:20:26 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 29 May 2018 14:20:26 +0200 Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?= In-Reply-To: <20180529105516.GA4094094@phare.normalesup.org> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> <20180529105516.GA4094094@phare.normalesup.org> Message-ID: <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com> > On 29 May 2018, at 12:55, Arthur Reutenauer wrote: > >> If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available. > > It would, for reasons of stability. The main point is what users of ? and ? would think, and Unicode to adjust accordingly. From unicode at unicode.org Tue May 29 07:57:57 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 29 May 2018 14:57:57 +0200 Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?= In-Reply-To: <20180529124759.GA4141011@phare.normalesup.org> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> <20180529105516.GA4094094@phare.normalesup.org> <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com> <20180529124759.GA4141011@phare.normalesup.org> Message-ID: <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com> > On 29 May 2018, at 14:47, Arthur Reutenauer wrote: > >> The main point is what users of ? and ? would think, and Unicode to adjust accordingly. > > Since users of ? would think that in the vast majority of cases, it > ought to be uppercased to SS, I think you?re missing the main point. No, you missed the point. From unicode at unicode.org Tue May 29 09:27:21 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 29 May 2018 07:27:21 -0700 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <20180529084952.7856023d@JRWUBU2> References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> <20180529084952.7856023d@JRWUBU2> Message-ID: <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: > How would one know that they are misapplied? And what if the author of > the text has broken your rules? Are such texts never to be transcribed > to pukka Unicode? Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, Script=Latin) doesn't automatically make the Tamil vowel "inherit" the Latin script property value, nor should it. That said, if someone decides they want that sequence, and their text as "broken my rules", so be it. I'm just not going to assume anything particular about that text. Note that in terms of trying to determine whether such a string is (naively) alphabetic, such a sequence doesn't interfere with the determination. On the other hand, a process concerned about text runs, script assignment, validity for domains, or other such issues *will* be sensitive to such a boundary -- and should not be overruled by some generic determination that combining marks inherit all the properties of their base. > > > Even without knowing exactly what is wanted, it looks to me as though > it isn't. If he wants to allow as a substring, which > he should, then that fails because there is no overlap between > p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. Yes, so if you are working with strings for Indic scripts (or for that matter, Arabic), you add Join_Control to the mix: Alphabetic? ? Diacritic ? Extender ? Join_Control gets you a decent approximation of what is (naively) expected to fall within an "alphabetic" string for most scripts. For those following along, Alphabetic is roughly meant to cover the ABC, ?????,... plus ideographic elements of most scripts. Diacritic picks up most of the applied combining marks, including nuktas, viramas, and tone marks. Extender picks up spacing elements that indicate length, reduplication, iteration, etc. And joiners are, well, joiners. If one wants finer categorization specifically for Indic scripts, then I would suggest turning to the Indic_Syllabic_Category property instead of a union of PropList.txt properties and/or some twiddling with General_Category values. --Ken From unicode at unicode.org Tue May 29 05:55:16 2018 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Tue, 29 May 2018 12:55:16 +0200 Subject: Uppercase =?iso-8859-1?Q?=DF?= In-Reply-To: <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> Message-ID: <20180529105516.GA4094094@phare.normalesup.org> > If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available. It would, for reasons of stability. Arthur From unicode at unicode.org Tue May 29 07:47:59 2018 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Tue, 29 May 2018 14:47:59 +0200 Subject: Uppercase =?iso-8859-1?Q?=DF?= In-Reply-To: <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> <20180529105516.GA4094094@phare.normalesup.org> <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com> Message-ID: <20180529124759.GA4141011@phare.normalesup.org> > The main point is what users of ? and ? would think, and Unicode to adjust accordingly. Since users of ? would think that in the vast majority of cases, it ought to be uppercased to SS, I think you?re missing the main point. Arthur From unicode at unicode.org Tue May 29 12:23:38 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 29 May 2018 10:23:38 -0700 Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?= In-Reply-To: <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com> References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com> <20180529.111753.1271341346199266089.wl@gnu.org> <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com> <20180529105516.GA4094094@phare.normalesup.org> <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com> <20180529124759.GA4141011@phare.normalesup.org> <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com> Message-ID: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 29 12:27:17 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Tue, 29 May 2018 10:27:17 -0700 Subject: Unicode characters unification In-Reply-To: <20180529090848.5ffae27a@JRWUBU2> References: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <20180528160037.6b3689e0@JRWUBU2> <20180528181852.7ce84e52@JRWUBU2> <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com> <20180528200139.744ee706@JRWUBU2> <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com> <20180529090848.5ffae27a@JRWUBU2> Message-ID: On 5/29/2018 1:08 AM, Richard Wordingham wrote: > On Mon, 28 May 2018 21:40:49 -0700 > Asmus Freytag via Unicode wrote: > >> But such exceptions prove the rule, which leads back to where we >> started: the default position is that Unicode encodes a character >> identity that is not the same as encoding the concept that said >> character is used to represent in writing. > And the problem remains that of determining the 'identity'. It is > rather like distinguishing species - biologists have dozens of > different concepts. > > Richard. > Totally. Never said that encoding is a simple algorithmic process. :) A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue May 29 12:33:35 2018 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Tue, 29 May 2018 19:33:35 +0200 Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?= In-Reply-To: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> References: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net> <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com> <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com> <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com> <20180528141023.24d2231e@JRWUBU2> <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com> <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com> Message-ID: <72ae3932-5299-d55f-3d45-d84c542274d1@uni-konstanz.de> Hello, am 2018-05-29 um 10:15 Uhr hat Hans ?berg geschrieben: > Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. There used to bee two differnt orthographic dictionaries, both called ?Duden?: ? The Duden from Leipzig (DDR) had a captal ???, on the cover page of its 1957 edition. ? The Duden from Mannheim (FRG) never has featured a captal ???, IIRC. > So it would be interesting with a reference to an official version. Neither Duden has been anything like an ?official version? ? never ever. Until 1996, the only official German orthography was somewhat loosely defined by a common decision of the Ministers of Education of the FRG, with an additional remark saying: ?In case of doubt, the spelling of the latest edition of the Duden (i. e. the Mannheim version) will take effect.? Nowadays, the official version of the orthographic rules can be found in: ; the Uppercase-? rule, particularily, is discussed in , under ?25(E3); the latest version of the rule reads thusly: > E3: Bei Schreibung mit Gro?buchstaben schreibt man SS. > Daneben ist auch die Verwendung des Gro?buchstabens ? > m?glich. which means: When writing in all-caps, you write SS. Alternatively, the capital ? may be used. So, the normal upper-case equivalent of German sharp-S still is the double-S. The recently introduced capital sharp-S is an optional alternative, but not the normal way of uppercasing the sharp-S. Best wishes, Otto Stolz From unicode at unicode.org Tue May 29 12:42:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 29 May 2018 18:42:40 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell> References: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell> Message-ID: <20180529184240.2526ec4e@JRWUBU2> On Mon, 28 May 2018 16:13:43 -0600 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > The effects of virama that spring to mind are: > > > > (a) Causing one or both letters on either side to change or combine > > to indicate combination; > > > > (b) Appearing as a mark only if it does not affect one of the > > letters on either side; > > > > (c) Causing a left matra to appear on the left of the sequence of > > consonants joined by a sequence of non-visible viramas. > > Most of these don't apply to Tamil, of course. They all apply to ???? TAMIL SYLLABLE KSSEE. There are four other named syllables where they all apply. Richard From unicode at unicode.org Tue May 29 14:15:09 2018 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Tue, 29 May 2018 21:15:09 +0200 (CEST) Subject: Uppercase =?utf-8?B?w58=?= In-Reply-To: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> References: <20180529124759.GA4141011@phare.normalesup.org> <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com> <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> Message-ID: <20180529.211509.316853494165711301.wl@gnu.org> > Overlooked in this discussion is the fact that the revised > orthography of 1996 introduces for the first time a systematic > difference in pronunciation for the vowel preceding SS vs. ? (short > vs. long). As users of the old orthography age out, I would not be > surprised if the SS fallback were to become less acceptable over > time because it would be at odds with how the word is to be > pronounced. I'm also confidently expecting the use of ALL CAPS to > become (somewhat) more prevalent under the continued influence of > English usage. It's not that simple. * `?' is never used in Switzerland; it's always `ss' (and `SS'). Even ambiguous cases like `Masse' are always written like that. This means that for Swiss users `?' is even more alien than for most German and Austrian users. In particular, there doesn't exist a `unity SS' in Swiss German at all! For example, the word `Ma?e' if capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and Austria (since `SS' is treated in this case as a unity). However, the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a replacement for `?', is *not* treated as a unity. * There are dialectic differences between northern and southern Germany (and Austria). Example: `Gescho?' vs. `Geschoss', which means exactly the same ? and both orthographies are allowed. For such cases, `GESCHOSS' is a much better uppercase version since it covers both dialectic forms. I very much dislike the approach that just for the sake of `simplistic standardization for uppercase' the use if `?' should be enforced in German. It's not the job of a language to fit computer usage. It's rather the job of computers to fit language usage. Werner From unicode at unicode.org Tue May 29 15:13:56 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Tue, 29 May 2018 13:13:56 -0700 Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?= In-Reply-To: <20180529.211509.316853494165711301.wl@gnu.org> References: <20180529124759.GA4141011@phare.normalesup.org> <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com> <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> <20180529.211509.316853494165711301.wl@gnu.org> Message-ID: On 5/29/2018 12:15 PM, Werner LEMBERG wrote: >> Overlooked in this discussion is the fact that the revised >> orthography of 1996 introduces for the first time a systematic >> difference in pronunciation for the vowel preceding SS vs. ? (short >> vs. long). As users of the old orthography age out, I would not be >> surprised if the SS fallback were to become less acceptable over >> time because it would be at odds with how the word is to be >> pronounced. I'm also confidently expecting the use of ALL CAPS to >> become (somewhat) more prevalent under the continued influence of >> English usage. > It's not that simple. > > * `?' is never used in Switzerland; it's always `ss' (and `SS'). Even > ambiguous cases like `Masse' are always written like that. This > means that for Swiss users `?' is even more alien than for most > German and Austrian users. In particular, there doesn't exist a > `unity SS' in Swiss German at all! For example, the word `Ma?e' if > capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and > Austria (since `SS' is treated in this case as a unity). However, > the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a > replacement for `?', is *not* treated as a unity. So the Swiss don't have that issue. What do they do for names? > > * There are dialectic differences between northern and southern > Germany (and Austria). Example: `Gescho?' vs. `Geschoss', which > means exactly the same ? and both orthographies are allowed. For > such cases, `GESCHOSS' is a much better uppercase version since it > covers both dialectic forms. I don't see the claimed benefit; if you allow two different spellings in lowercase to track the phonetic difference, then that would rather seem to support my argument that there is now a tension in the orthography (for standard German) that may well resolve itself by greater use of the distinct uppercase form. Users who will end up "resolving" this would be those who grew up only with the revised orthography. Older users are used to a different principle of selecting between SS and ? and that isn't tied to pronunciation of preceding vowel. > > I very much dislike the approach that just for the sake of `simplistic > standardization for uppercase' the use if `?' should be enforced in > German. It's not the job of a language to fit computer usage. It's > rather the job of computers to fit language usage. Hmm, don't see anyone calling for that in this discussion. A./ > > > Werner From unicode at unicode.org Tue May 29 15:43:52 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 29 May 2018 21:43:52 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net> References: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp> <20180529084952.7856023d@JRWUBU2> <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net> Message-ID: <20180529214352.04906154@JRWUBU2> On Tue, 29 May 2018 07:27:21 -0700 Ken Whistler via Unicode wrote: > On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: > > How would one know that they are misapplied? And what if the > > author of the text has broken your rules? Are such texts never to > > be transcribed to pukka Unicode? > > Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, > Script=Latin) doesn't automatically make the Tamil vowel "inherit" > the Latin script property value, nor should it. It's the sort of process that gave us U+0310 COMBINING CANDRABINDU. However, I see adding SE Asian dependent vowels to Latin letter x (U+0078, Script=Latin) as rather tending to make 'x' Script=Common. Others have disagreed quite vehemently. I see the view that the base character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has prevailed. Serifed U+00D7 is quite common in manually typewritten material; I remember it from school. I'm not sure what script the sequence belongs to in OpenType layout. I ought to find out for the benefit of Tai Tham fonts. > That said, if someone decides they want that sequence, and their text > as "broken my rules", so be it. I'm just not going to assume anything > particular about that text. Note that in terms of trying to determine > whether such a string is (naively) alphabetic, such a sequence > doesn't interfere with the determination. On the other hand, a > process concerned about text runs, script assignment, validity for > domains, or other such issues *will* be sensitive to such a boundary > -- and should not be overruled by some generic determination that > combining marks inherit all the properties of their base. When it comes to script runs for rendering, such a rule feels oppressive; it is widely unenforced. For example, I have found that if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham character, it will generally render satisfactorily on a Tai Tham character. Presumably I can now use a few examples of the same Northern Thai syllable on the same page in a published language-teaching book as evidence for adding its clone to the Tai Tham script. There should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham syllables, but I haven't found any yet. See the chart at the end of "Exemple d??criture ignor?e par Unicode : l??criture tham du Laos" http://www.laosoftware.com/download/articleTALN.pdf for an implicit claim of existence. > > Even without knowing exactly what is wanted, it looks to me as > > though it isn't. If he wants to allow as a > > substring, which he should, then that fails because there is no > > overlap between p{extender} and p{gc=Cf} or between p{diacritic} > > and p{gc=Cf}. > > Yes, so if you are working with strings for Indic scripts (or for > that matter, Arabic), you add Join_Control to the mix: > > Alphabetic? ? Diacritic ? Extender ? Join_Control > > gets you a decent approximation of what is (naively) expected to fall > within an "alphabetic" string for most scripts. but won't work for collatable Welsh 'Llan?gollen'! (There's a CGJ between the 'n' and the 'g'.) One also needs Join_Control for fraktur German and, to my mind, English 'Ca?esar'. > For those following along, Alphabetic is roughly meant to cover the > ABC, ?????,... plus ideographic elements of most scripts. > Diacritic picks up most of the applied combining marks, including > nuktas, viramas, and tone marks. Extender picks up spacing elements > that indicate length, reduplication, iteration, etc. And joiners are, > well, joiners. 'Diacritic' mostly includes marks with secondary collation weight; those with primary weights, such as Indic dependent vowels, are mopped up in Alphabetic. (Removing diacritics is very much not the same as removing combining marks.) > If one wants finer categorization specifically for Indic scripts, > then I would suggest turning to the Indic_Syllabic_Category property > instead of a union of PropList.txt properties and/or some twiddling > with General_Category values. You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN DOUBLE CANDRABINDU VIRAMA. And you'd still miss U+0303 COMBINING TILDE and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I need to make another attempt to get them appropriate Indic syllabic category values. Richard. From unicode at unicode.org Tue May 29 16:03:25 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 29 May 2018 14:03:25 -0700 Subject: Why is TAMIL SIGN VIRAMA (pulli) not =?UTF-8?Q?Alphabetic=3F?= Message-ID: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com> Richard Wordingham wrote: >>> The effects of virama that spring to mind are: >>> >>> (a) Causing one or both letters on either side to change or combine >>> to indicate combination; >>> >>> (b) Appearing as a mark only if it does not affect one of the >>> letters on either side; >>> >>> (c) Causing a left matra to appear on the left of the sequence of >>> consonants joined by a sequence of non-visible viramas. >> >> Most of these don't apply to Tamil, of course. > > They all apply to ???? TAMIL > SYLLABLE KSSEE. There are four other named syllables where they all > apply. And several others where they do not. TUS explains that visible pu??i is the general rule in Tamil, and conjunct ligatures are the exception. I should have written "These mostly don't apply to Tamil, of course." In any case, Ken has answered the real underlying question: a process that checks whether each character in a sequence is "alphabetic" is inappropriate for determining whether the sequence constitutes a word. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue May 29 16:46:21 2018 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Tue, 29 May 2018 23:46:21 +0200 (CEST) Subject: Uppercase =?utf-8?B?w58=?= In-Reply-To: References: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> <20180529.211509.316853494165711301.wl@gnu.org> Message-ID: <20180529.234621.2273334688029588499.wl@gnu.org> >> * `?' is never used in Switzerland; it's always `ss' (and `SS'). >> [...] > > So the Swiss don't have that issue. What do they do for names? Foreign names containing `?' are treated as-is, AFAIK. It's similar to using, say, accents in some foreign names in English. >> For such cases, `GESCHOSS' is a much better uppercase version >> since it covers both dialectic forms. ... and Swiss people would use the same uppercase version... > I don't see the claimed benefit; [...] > > Users who will end up "resolving" this would be those who grew up > only with the revised orthography. Indeed. >> I very much dislike the approach that just for the sake of >> `simplistic standardization for uppercase' the use if `?' should be >> enforced in German. [...] > > Hmm, don't see anyone calling for that in this discussion. Well, I hear an implicit ?Great, there is now an `?' character! Let's use it as the uppercase version of `?' everywhere so that this nasty German peculiarity is finally gone.? Maybe it's only me... Werner From unicode at unicode.org Tue May 29 16:49:06 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 29 May 2018 22:49:06 +0100 Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? In-Reply-To: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com> References: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com> Message-ID: <20180529224906.400d0346@JRWUBU2> On Tue, 29 May 2018 14:03:25 -0700 Doug Ewell via Unicode wrote: > In any case, Ken has answered the real underlying question: a process > that checks whether each character in a sequence is "alphabetic" is > inappropriate for determining whether the sequence constitutes a word. Back in the second post of the thread, I made the point that a conformant Unicode process cannot always give a yes/no answer to the question of whether all characters in a string are alphabetic. What we seem to have established is that Unicode properties are not set up to facilitate the identification of words. Given that spell-checkers work, we have taken a wrong turn. Perhaps we should reconsider "b?e?", which consists of two letters each inside its own enclosing circle. The spell-checker I'm using considers it a misspelt word, rather than two symbols side by side. Richard. From unicode at unicode.org Tue May 29 18:32:19 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Tue, 29 May 2018 16:32:19 -0700 Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?= In-Reply-To: <20180529.234621.2273334688029588499.wl@gnu.org> References: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com> <20180529.211509.316853494165711301.wl@gnu.org> <20180529.234621.2273334688029588499.wl@gnu.org> Message-ID: <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com> On 5/29/2018 2:46 PM, Werner LEMBERG wrote: >>> I very much dislike the approach that just for the sake of >>> `simplistic standardization for uppercase' the use if `?' should be >>> enforced in German. [...] >> Hmm, don't see anyone calling for that in this discussion. > Well, I hear an implicit ?Great, there is now an `?' character! Let's > use it as the uppercase version of `?' everywhere so that this nasty > German peculiarity is finally gone.? The ALL-CAPS "SS" really has little to recommend it, intrinsically. It is de-facto a fall-back; one that competed with "SZ" as used in telegrams (while they still were a thing). Not being able to know how to hyphenate MASSE without knowing the meaning of the word is also not something that I consider a "benefit". Uppercase forms for `?' have been kicking around in fonts for a long time as was documented around the time that the character was encoded. It is possible mainly because running text in ALL CAPS is? indeed uncommon (and in the time of Fraktur was effectively not viable because the Fraktur capitals don't lend themselves to it. (If SS had ever occurred in Title-Case, I doubt it would have survived as long, other than the "Swiss solution" of making it the only form, also in lower case). Saving an uppercase form for a non-initial letter was a godsend on typewriters -- adding to the factors that made the "SS" solution acceptable. But sign writers, type designers and typesetters did not find it so universally attractive - also documented exhaustively. With changing environment (starting with influence from Anglo-Saxon use of type and not ending with the way the character is treated in relation to phonetics) I've been expecting so see usage evolving; and not necessarily driven by software engineers. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 30 00:45:48 2018 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Wed, 30 May 2018 07:45:48 +0200 (CEST) Subject: Uppercase =?utf-8?B?w58=?= In-Reply-To: <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com> References: <20180529.234621.2273334688029588499.wl@gnu.org> <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com> Message-ID: <20180530.074548.66833950861253245.wl@gnu.org> > The ALL-CAPS "SS" really has little to recommend it, intrinsically. > It is de-facto a fall-back; one that competed with "SZ" as used in > telegrams (while they still were a thing). Well, the status of `?' is indeed complicated, and the radical solution used in Switzerland has certainly benefits. > Not being able to know how to hyphenate MASSE without knowing the > meaning of the word is also not something that I consider a > "benefit". I don't see much difference to the English example of `re-cord' vs. `rec-ord'. And Swiss people won't start to use `?' just for getting the right meaning... > Uppercase forms for `?' have been kicking around in fonts for a long > time as was documented around the time that the character was > encoded. Yes, and it was never successful. The introduction of `?' into Unicode a few years ago was mainly driven by experts, not something that had big popularity before. > With changing environment (starting with influence from Anglo-Saxon > use of type and not ending with the way the character is treated in > relation to phonetics) I've been expecting to see usage evolving; > and not necessarily driven by software engineers. Yes, let's see how everything will evolve. Regardless of that, software should support the status quo as good as possible. Werner From unicode at unicode.org Thu May 31 11:59:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 31 May 2018 17:59:30 +0100 Subject: Character Boundaries - Who is to choose? Message-ID: <20180531175930.21b0d862@JRWUBU2> This has nothing to do with grapheme boundaries. A few days ago, I remarked that deciding whether two character usages were of the same character was akin to deciding whether two populations were of the same species. It can also be difficult to decide where the boundary between two species lies. Is it the job of Unicode to prescribe the boundary between two characters, or should it prefer to describe the boundary that users largely follow? A good example of an unobvious boundary is U+02BC MODIFIER LETTER APOSTROPHE v. U+2019 RIGHT SINGLE QUOTATION MARK. I am seeing a boundary issue between U+1A7A TAI THAM SIGN RA HAAM and U+1A7C TAI THAM SIGN KHUEN-LUE KARAN. Between them, they have two different functions, namely as the superscript final consonant form of RA and as a killer. My understanding of the difference was that it was based on the glyph shape. The function of final consonant would always be performed by U+1A7A, and U+1A7C would always have the function of killer. The 'HAAM' in 'RA HAAM' means 'to prohibit'. KARAN seems to be a loanword from Siamese, where it originally seems to just mean 'final letter', which is the only meaning I could find for it in Pali (as _k?ranta_); nowadays, in Siamese it means 'a letter bearing the mark U+0E4C THAI CHARACTER THANTHAKHAT', which Siamese mark is known as _mai wanchakan_ when it just kills the vowel. In older Tai Khuen (1930's), both functions are performed by the RA HAAM glyph. The glyph used is relatively large. What I have been seeing a lot of recently is Northern Thai text where the killer function is encoded U+1A7C. This does not strike me as unreasonable; the usage expresses the view that the difference between U+1A7C, which typically has a small glyph, and the Northern Thai glyph for the killer function, which also tends to be small, is simply glyph variation. (I have no evidence of Northern Thai using superscript final RA.) The idea of encoding the two functions differently was abandoned because of the principle that combining marks are encoded on the basis of form; encoding them separately would, on the face of the evidence, have been like encoding diaeresis and the mark of umlaut separately. Richard.