From unicode at unicode.org  Fri May 11 05:14:16 2018
From: unicode at unicode.org (Maggie Oates via Unicode)
Date: Fri, 11 May 2018 06:14:16 -0400
Subject: code of ethics and conduct
Message-ID: <CAKh=ZK1i42vaHwW6kji56gYS810qN+3N8xFPU9Mh9vYzOCNtTQ@mail.gmail.com>

Greetings,

The "Unicode Consortium Whistleblower Policy" listed on the Consortium's
policies page <https://www.unicode.org/policies/policies.html> references
the "Unicode Consortium's Code of Ethics and Conduct." I'm having trouble
finding this code, and was wondering if someone could *point me to a copy
of that code of ethics*. The closest reference I've found is a small
section in the bylaws for the board members: "Article 3. Sec 11. Standard
of Conduct."

Thanks!
---
Maggie Oates

Societal Computing, PhD student
Carnegie Mellon University
moates at cmu.edu
she/her/hers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180511/5568a186/attachment.html>

From unicode at unicode.org  Fri May 11 12:37:27 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 11 May 2018 18:37:27 +0100
Subject: Choosing the Set of Renderable Strings
Message-ID: <20180511183727.54a341bf@JRWUBU2>

For assembling a rendering system for a script with combining marks,
is there a guide as to how to decide what strings one should exclude,
and which one should strive to support?

There will also be characters outside the script that should be
supported.  For a font, there are lists of characters for Microsoft Word
and for the Universal Scripting Engine, and it is frequently desirable
for a font to be able to display its own name.  There are also various
control and formatting characters, and punctuation characters from
outside the script.

I believe compromises are necessary.

There are issues with stacking combining marks - at one point does one
throw oneself on the mercy of the application?  Making characters small
enough to accommodate a cross-line stack of 20 within the nominal line
separation is usually not acceptable!  (There are Sanskrit manuscripts
where a stack extends across several lines.)  There are also problems
if glyphs cannot simply be stacked - it is not unknown for a
'subscript' glyph to obligatorily have a part on the baseine - preposed
'subscript' RA can required different glyphs depending on how deeply it
is stacked.

If canonical equivalence does not eliminate homographs, there is the
question of which homographs to tolerate.

I have hit this issue with Tai Tham.  The essence of the problem is
that a CVCV word with identical consonants can be abbreviated to CVV,
as in some other scripts, and dependent vowels can be written using
several vowel symbols.  All vowels have ccc=0.  Now, the accepted
proposal (i.e. the one accepted by the UTC for the ISO process) gave an
order for the vowels in such polygraphs, and most combinations
resulting from such contraction comply with this order.

The existence of such a contraction can be indicated in writing by the
(ambiguous) mark MAI SAM, and in such cases the proposed encoding of
Tai Tham text is of the form CVxV where 'x' is MAI SAM.  In such cases
I allow the constraint on vowel order to apply to each vowel
separately. This allows homographs, but I take the view that I am
rejecting homographs to facilitate searching, not to prevent spoofing.
The prevention of spoofing would use stricter rules, which would ban
some words, just as the English word "caf?" is prohibited in British
domain names.  (The doublet "cafe" refers to a lower class of
establishment in British English.)

However, the mark MAI SAM is not always used.  Now, if Tai Tham vowels
had non-zero combining marks, I would separate the vowels from the two
phonetic syllables by the general disruptor, CGJ, to facilitate
sorting.  At the very least the word should then be sorted with other
words starting with the same CV, and with preprocessing, the CGJ could
be replaced by the omitted consonant.  Now, Tai Tham vowels have
ccc=0, but I favour retaining the CGJ to mark the location of the
repeated consonant.

This CGJ also enables me to make some check as to whether the
individual phonetic syllables' vowel symbols are in the correct
order.  So:

(a) If the vowel symbols in CVV are in the permitted order, the string
is accepted.

(b) If the word is typed as CV<CGJ>V and the vowels on either side of
CGJ are in the correct order, the string is accepted.

(c) If the word is typed as CVV and the vowel symbols are not in the
permitted order, and I can detect this, I allow the implementation of
the Universal Script Engine (be it Microsoft, AAT or HarfBuzz) to insert
its dotted circles.  More precisely, I don't remove them.

Is this a reasonable approach to allowing both collation and
suppressing needless homographs?  My contribution to the rendering is
only the provision of a font.

Richard.  
 

From unicode at unicode.org  Sat May 12 09:01:44 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 May 2018 15:01:44 +0100
Subject: Lack of ulUnicodeRange Bit for Adlam
In-Reply-To: <CABSPNv=LUC4j6=SOv+4kpJAR1zR5_EL+isPtaRpL+shEm0UvHw@mail.gmail.com>
References: <mailman.0.1519495201.26397.unicode@unicode.org>
 <CABSPNv=LUC4j6=SOv+4kpJAR1zR5_EL+isPtaRpL+shEm0UvHw@mail.gmail.com>
Message-ID: <20180512150144.46092c59@JRWUBU2>

On Tue, 27 Feb 2018 11:45:36 -0500
Neil Patel via Unicode <unicode at unicode.org> wrote, under topic heading
"Re: Unicode Digest, Vol 50, Issue 20":

> Does the ulUnicodeRange bits get used to dictate rendering behavior or
> script recognition?
> 
> I am just wondering about whether the lack of bits to indicate an
> Adlam charset can cause other issues in applications.

(Answering in case the problem is still relevant - I had misfiled this
post.)

The lack is unlikely to cause any problems in anything recent enough to
understand the concept of "Adlam" - the bits were not added for blocks
newer than Unicode 5.1.

As Adlam was only added in Version 9.0, there is a significant risk of
rendering engines not supporting cursive joining.  On the other hand,
the Adlam characters have been right-to-left since Version 5.2.  (It
shouldn't matter whether one of the characters switched from right to
left to NSM.)

Richard.

From unicode at unicode.org  Mon May 14 01:15:10 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 13 May 2018 22:15:10 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180511183727.54a341bf@JRWUBU2>
References: <20180511183727.54a341bf@JRWUBU2>
Message-ID: <CABPY6Z3uBbj7LAazuBrobakHJvDyv6NYi1SAiya6EqxG3zfVSQ@mail.gmail.com>

Richard Wordingham asked,

? Is this a reasonable approach to allowing both collation
? and suppressing needless homographs?  My contribution to
? the rendering is only the provision of a font.

If anything about this approach was unreasonable, one of the experts
on this list would probably have pointed it out by now.

Trailblazers such as yourself will help to establish the guidelines you seek.

One does the best that one can in anticipating the character strings
the font will be expected to support, follows the font specs, and puts
the results out there for the public.  Then, the user community, if
any, may provide appropriate feedback to the developers so that
adjustments can be made.

Riding along with the insertion of the dotted circles by the USE
enables the actual users to see immediately that the text needs to be
modified in order to render reasonably on that system with the shaping
engine and font selected.  If users consider any such insertion
inappropriate, then it's feedback time.

? ... and it is frequently desirable for a font to be able
? to display its own name.

Does the font name have to be in a Latin-based script?


From unicode at unicode.org  Mon May 14 02:47:55 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 May 2018 08:47:55 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z3uBbj7LAazuBrobakHJvDyv6NYi1SAiya6EqxG3zfVSQ@mail.gmail.com>
References: <20180511183727.54a341bf@JRWUBU2>
 <CABPY6Z3uBbj7LAazuBrobakHJvDyv6NYi1SAiya6EqxG3zfVSQ@mail.gmail.com>
Message-ID: <20180514084755.7da895d7@JRWUBU2>

On Sun, 13 May 2018 22:15:10 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham asked,
> 
> ? Is this a reasonable approach to allowing both collation
> ? and suppressing needless homographs?  My contribution to
> ? the rendering is only the provision of a font.
> 
> If anything about this approach was unreasonable, one of the experts
> on this list would probably have pointed it out by now.

Not necessarily; some may still be recovering from the recent UTC
meeting.  Moreover, it took many years before we were told that there
was no character to suppress word boundaries wrongly deduced by Thai
breaking algorithms.  The character we had been using, U+2060 WORD
JOINER, is apparently only for suppressing line breaks. 

> Riding along with the insertion of the dotted circles by the USE
> enables the actual users to see immediately that the text needs to be
> modified in order to render reasonably on that system with the shaping
> engine and font selected.  If users consider any such insertion
> inappropriate, then it's feedback time.

The massive failure of USE was reported within hours of USE being
announced on the Unicode forum.  So far there has only been tinkering,
and an encouragement of bad spelling.  For example, at least about 23%
of Northern Thai monosyllables can be rendered only by clear
misspelling - see the results in
http://www.wrdingham.co.uk/lanna/random_test.htm. The USE specification
brushes over this with the statement, "Note: Tai Tham support is
limited to mono-syllabic clusters", which gives the misleading
impression that mono-syllabic clusters are supported.  Basically,
support is limited to (C)+(V)* clusters with a liberal interpretation
of C and V.  Crw and Cry aren't supported either.

At the moment, one is generally better off using a Thai hack font that
uses paiyannoi to toggle between the various forms and placements of
Tai Tham characters.  That has the advantage that the text is still
intelligible when you have no font that renders it as Tai Tham.  The
main limitation of such schemes is in plain text.

> ? ... and it is frequently desirable for a font to be able
> ? to display its own name.
> 
> Does the font name have to be in a Latin-based script?

Postscript certainly gets unhappy if there isn't an ASCII name for it;
I don't know the requirements for the various PDF generators.

Richard.


From unicode at unicode.org  Mon May 14 07:12:56 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 May 2018 04:12:56 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
Message-ID: <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>

In response to William Overington's post, it's easier to transcode
data from a PUA scheme into Unicode than it is to enter the data from
scratch.  (The same could be said for a customized ASCII font.)  Some
users may not wish to wait even the handful of years it took for
mainstream Indic complex scripts to be rendered properly.

At this phase of Unicode's progress, however, we shouldn't encourage
the interchange of such PUA data.  Since it's simple to transcode, any
such data should be transcoded prior to interchange or permanent
storage.  Recipients lacking systems supporting proper Unicode
rendering for complex scripts such as Tai Tham could then transcode it
to the PUA scheme for display/printing purposes.

An OpenType font, a keyboard driver, and a text conversion utility
might go a long way towards supporting complex scripts for users whose
systems cannot otherwise currently support them.

A good keyboard driver should be able to remove some of the burden off
of the OpenType tables, enabling multiple
fonts covering the same script to be used without having bloated and
redundant OpenType tables, by offering some degree of control over the
actual character strings which are being stored (and presented to the
font for rendering).

(Many font developers might consider that any kind of normalization
should be handled at input rather than left up to the font.  Keyboard
developers might have a different idea, though.)

A hundred years from now, properly encoded Tai Tham text should be
legible.  But the ability to display data using temporary PUA schemes
which were set up in lieu of proper rendering support appears to fade
away over time.

From unicode at unicode.org  Mon May 14 02:55:05 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 14 May 2018 08:55:05 +0100 (BST)
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
Message-ID: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>

One possibility that might be worth consideration is to map each otherwise unmapped glyph in the font each to a distinct code point in the Private Use Area. This being as well as all of the automated glyph substitution, not instead of it.

This is not an ideal solution and may be regarded by some people as a wrong approach but if an end user of the font is trying to produce a hard copy print out or a PDF (Portable Document Format) document and is stuck because he or she cannot otherwise get the desired glyphs for the desired printable display from the font, the facility of being able to insert a desired glyph from the Private Use Area can get the desired result produced. Certainly, an end user could follow that up with feedback to the font producer so that in due course the display can become producible without needing to use a Private Use Area code point, yet having the glyphs available in the Private Use Area could sometimes be useful when a result is needed straightaway.

William Overington

Monday 14 May 2018

----Original message----
>From : unicode at unicode.org
Date : 2018/05/14 - 07:15 (GMTDT)
To : richard.wordingham at ntlworld.com
Cc : unicode at unicode.org
Subject : Re: Choosing the Set of Renderable Strings

Richard Wordingham asked,

? Is this a reasonable approach to allowing both collation
? and suppressing needless homographs?  My contribution to
? the rendering is only the provision of a font.

If anything about this approach was unreasonable, one of the experts
on this list would probably have pointed it out by now.

Trailblazers such as yourself will help to establish the guidelines you seek.

One does the best that one can in anticipating the character strings
the font will be expected to support, follows the font specs, and puts
the results out there for the public.  Then, the user community, if
any, may provide appropriate feedback to the developers so that
adjustments can be made.

Riding along with the insertion of the dotted circles by the USE
enables the actual users to see immediately that the text needs to be
modified in order to render reasonably on that system with the shaping
engine and font selected.  If users consider any such insertion
inappropriate, then it's feedback time.

? ... and it is frequently desirable for a font to be able
? to display its own name.

Does the font name have to be in a Latin-based script?


From unicode at unicode.org  Mon May 14 11:47:11 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 May 2018 17:47:11 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
Message-ID: <20180514174711.7d8109b6@JRWUBU2>

On Mon, 14 May 2018 08:55:05 +0100 (BST)
William_J_G Overington via Unicode <unicode at unicode.org> wrote:

> One possibility that might be worth consideration is to map each
> otherwise unmapped glyph in the font each to a distinct code point in
> the Private Use Area. This being as well as all of the automated
> glyph substitution, not instead of it.

That's what the Xishuangbanna News does for final consonants.  My
issues are generally not with producing the right image, but
rather with enabling the semantically correct sequence of
characters.  (It would be daft to impose phonetic order on the users
and then prohibit it piecemeal.)  I can overcome the USE, the question
is which battles the font is to fight.

Richard.

From unicode at unicode.org  Mon May 14 14:31:15 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 May 2018 20:31:15 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
Message-ID: <20180514203115.5c093920@JRWUBU2>

On Mon, 14 May 2018 04:12:56 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> In response to William Overington's post, it's easier to transcode
> data from a PUA scheme into Unicode than it is to enter the data from
> scratch.  (The same could be said for a customized ASCII font.)  Some
> users may not wish to wait even the handful of years it took for
> mainstream Indic complex scripts to be rendered properly.
> 
> At this phase of Unicode's progress, however, we shouldn't encourage
> the interchange of such PUA data.  Since it's simple to transcode, any
> such data should be transcoded prior to interchange or permanent
> storage.  

> Recipients lacking systems supporting proper Unicode
> rendering for complex scripts such as Tai Tham could then transcode it
> to the PUA scheme for display/printing purposes.

The PUA scheme would be roughly equivalent to the glyph sequence
produced by the shaper. (The ccmp feature is in general not available
for the PUA, though CSS allows its use to be forced.)  However, there
would be no extra channels, such as the component-mark association often
needed for some cursive scripts. For example, in ???? <LOW KA, SIGN U,
SAKOT, LOW YA> 'to direct', SIGN U may be realised as a mark below
left, a mark below <SAKOT, LOW YA>, or a spacing mark on the right of
<SAKOT, YA>.  One could argue that the three positions require
different glyphs for SIGN U.  Each font would need its own PUA.

> An OpenType font, a keyboard driver, and a text conversion utility
> might go a long way towards supporting complex scripts for users whose
> systems cannot otherwise currently support them.

This is where Apple had the right idea, but difficult of
implementation, and the OTL paradigm is deficient. There are several
places in Tai Tham layout where I want to swap glyphs round, but for
the layout engine to do so for me would cause grief for other Tai Tham
fonts. This rearrangement cannot be delegated to the rendering
engine.  There are Tai Tham fonts which handle Indic rearrangement in
the ccmp feature, but they are then totally defeated by either ccmp not
being enabled or by the USE doing basic Indic shaping.

There are now two approaches for Tai Tham - (1) fix USE or
restore/create a separate shaper for scripts with CVC... aksharas, and
(2) overcome the USE in the font. For the latter I need to make the
work-arounds in Da Lekh easier to copy.  I have transferred them to Ed
Trager's Hariphunchai font, yielding Lamphun, but Lamphun still needs
some further revision to the positioning logic. It wasn't as
complete as I'd hoped.  I've done a quick fix for the vowels below, but
I suspect much more work is needed to conform to the spirit of the
Hariphunchai font. I could do with someone artistic to help with the
combinations of NYA and subscript consonant such as NY.CA, and Pali
LL.HA is currently a disaster.

On Track 1, there's also more tinkering to do, such as making MEDIAL LA
and MEDIAL RA 'consonant subscript' rather than 'consonant medial'
/lw/ is an allowed onset in the Tai languages using the Tai Tham
script, so we get orthographic onset <hlw-> with MEDIAL LA in the West.
The main problem is that we do not have characters *MEDIAL WA and
*MEDIAL YA - the general subscript WA and YA are used instead, and these can
function as matres lectionis.  (In Unicode Khmer, the matres lectionis
have been reanalysed as vowels.)

I think it would also help to make SIGN AA and SIGN TALL AA into
letters as far as the USE is concerned. The default grapheme
segmentation rules already treat them as consonants. The possible
downside is that so doing might mess up some fonts.

> A good keyboard driver should be able to remove some of the burden off
> of the OpenType tables, enabling multiple
> fonts covering the same script to be used without having bloated and
> redundant OpenType tables, by offering some degree of control over the
> actual character strings which are being stored (and presented to the
> font for rendering).

It won't work.  The text input delivered by X still needs to be
supported, and without modifying the application, X can only input one
character at a time.  Not everyone uses an 'input method'.

> (Many font developers might consider that any kind of normalization
> should be handled at input rather than left up to the font.  Keyboard
> developers might have a different idea, though.)

Apparently, Hangul input should not be canonically normalised in South
Korea. I've seen an implementation of the USE render canonically
equivalent strings differently.  It wouldn't be HarfBuzz - it
normalises, as we saw when it briefly messed up Tai Tham rendering when
it swapped <tone, SAKOT> to <SAKOT, tone>.  That was rapidly fixed to normalise the other way round.

I'd completely forgotten that Thai, Lao and Tai Tham tone marks had
different combining classes.  However, in Northern Thai,
<TONE-1...TONE-2> and <TONE-2...TONE-1> seem to render the same, so
normalisation might not be relevant.  Unsurprisingly, that's the only
pair of tone-marks I've seen in the same akshara, so I don't know how
the other pairs of distinct tone marks combine.  A pair arises when two
chained syllables have different tone marks.  If they have the same
tone mark, one is suppressed.

Richard.


From unicode at unicode.org  Tue May 15 05:18:11 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 15 May 2018 02:18:11 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180514203115.5c093920@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
Message-ID: <CABPY6Z13j8Z8eD9D-qakQzvqUK4E7RUm72cw_8YVR3poipKcNA@mail.gmail.com>

Richard Wordingham replied,

?? ...Private Use Area...
?
? That's what the Xishuangbanna News does for final consonants.

I failed to find a link for their web site, but only spent about an
hour and a half searching for it.  There is a web site for
"Xishuangbanna Daily", but the pages I saw there were all in Chinese.

If Xishuangbanna News is publishing using PUA, then they probably
offer a font for download.  I was just curious to see what their web
pages looked like, and wondered how pervasive the PUA use is.  If
their site only resorts to PUA for final consonants, then a
presumption would be that the USE supports all other shaping
requirements for the script.

? My issues are generally not with producing the right image,
? but rather with enabling the semantically correct sequence
? of characters.

Because you started out with all the Tai Tham glyphs mapped to the
PUA, and are now trying to produce a working font using the standard
encoding?


From unicode at unicode.org  Tue May 15 07:19:42 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 15 May 2018 04:19:42 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180514203115.5c093920@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
Message-ID: <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>

On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode
<unicode at unicode.org> wrote:

> ...  One could argue that the three positions require
> different glyphs for SIGN U.  Each font would need its own PUA.

Or a consensus.

> ... There are several
> places in Tai Tham layout where I want to swap glyphs round, but for
> the layout engine to do so for me would cause grief for other Tai Tham
> fonts. This rearrangement cannot be delegated to the rendering
> engine.  There are Tai Tham fonts which handle Indic rearrangement in
> the ccmp feature, but they are then totally defeated by either ccmp not
> being enabled or by the USE doing basic Indic shaping.

Suppose the OpenType specs were revised to include a bit which could
be set for disabling basic Indic shaping by the USE?  I wouldn't set
it if I were just starting out to make a font for a complex script
requiring basic Indic shaping, and cannot imagine why anyone else just
starting out would.

> ...
>
> I think it would also help to make SIGN AA and SIGN TALL AA into
> letters as far as the USE is concerned. The default grapheme
> segmentation rules already treat them as consonants. The possible
> downside is that so doing might mess up some fonts.

The possibility of messing up some fonts has seldom (if ever) stopped
needed revisions to shaping engines before.  I should know.

>> A good keyboard driver ...
>
> It won't work.  The text input delivered by X still needs to be
> supported, and without modifying the application, X can only input one
> character at a time.  Not everyone uses an 'input method'.

Every keyboard uses a driver, though.  I can't speak for "X", but my
understanding is that the keyboard driver acts as sort of a buffer
between the user's key strokes and the application.

> Apparently, Hangul input should not be canonically normalised in South
> Korea. I've seen an implementation of the USE render canonically
> equivalent strings differently.  ...

Because the USE failed or because the font provided look-ups for each
of those strings to different glyphs?

Best regards,

James Kass

From unicode at unicode.org  Tue May 15 09:04:45 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 15 May 2018 06:04:45 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
Message-ID: <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>

Display behaviour which is script-specific should be handled by the
rendering/shaping engine.  Only that which is font-specific should be
handled by the font.

The font's OpenType tables will include pointers to presentation forms
which aren't directly encoded, the location and repertoire of which
would naturally differ from font to font.  Likewise, the font's GPOS
tables will handle things such as mark positioning, because each
font's metrics are going to be different.

Because the USE apparently accesses current on-line Unicode data, the
USE will re-order anything which needs to be moved around.

From unicode at unicode.org  Tue May 15 09:15:07 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Tue, 15 May 2018 15:15:07 +0100 (BST)
Subject: Colours - both for emoji and otherwise
Message-ID: <26212264.32041.1526393707864.JavaMail.defaultUser@defaultHost>

Years ago this mailing list had some wonderful long discussions.

A similar such discussion may be interesting now on the topic of Colours - both for emoji and otherwise, as recent developments could possibly be leading towards a major change in Unicode.

A few days - including a weekend - before the recent UTC (Unicode Technical Committee) meeting there appeared in the Current UTC Document Register for 2018 the following document.

http://www.unicode.org/L2/L2018/18141-emoji-colors.pdf

I wrote some comments and sent them in as feedback. They are available as the last listed item in the Encoding Feedback for that particular UTC meeting.

http://www.unicode.org/L2/L2018/18117-pubrev.html#Encoding_Feedback

However, the original 18141-emoji-colors.pdf document has been revised twice since that feedback and the following is the present version.

http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf

It seems to me that there are, in a Unicode context, at least two possible ways that the use of a white square next to an emoji of a brown bear could "indicate that an emoji has a different color".

One way is that the person viewing the white square next to an emoji of a brown bear 'knows' that a white bear is intended and 'understands' that that is the intended meaning - that could be useful as it is language-independent so communication through the language barrier of mention of a white bear is possible. I just wrote language-independent but I am wondering if 'knowing' that and 'understanding' that mean that the use of those characters in that way is part of an emoji-based language. I am not a linguist and maybe some people who are linguists might like to comment on that and also maybe on the whole notion of emoji characters being used to produce languages - not necessarily constructed languages but also languages that are arising and evolving naturally but at a much faster rate than natural languages evolved historically. 

Another way is that the rendering system displays an emoji of a white bear instead of the white square next to an emoji of a brown bear.

Yet would what I have just referred to as an emoji of a white bear actually be an emoji as such or would it be a "just" a picture glyph and not an emoji as such as it is not a separately encoded character?

What makes the present situation interesting though and thus worth a discussion is the following.

The new characters about colours are listed in sections 5 and 6 of the following document.

http://www.unicode.org/L2/L2018/18176-future-adds.pdf

Yet the minutes of the UTC meeting,

http://www.unicode.org/L2/L2018/18115.htm

has the following.

> Discussion. UTC took no action at this time.

Now maybe that was later overridden by later discussions yet not listed in the minutes under Emoji Colors as such, but I am wondering if that refers to whether, and if so, how, a white square next to an emoji of a brown bear could be specified within The Unicode Standard so that such a sequence were to become rendered as a glyph of a white bear.

Yet I am wondering if another set of characters, colour operators, should be defined for such an automated purpose, yet also have a displayable glyph for graceful fall-back display when automated rendering is not possible: the colour operators being encoded in plane 14;  yet also having a mode where the colour operator could be displayed as a zero-width space as an alternative graceful fall-back display.

Yet colours are being talked about in relation to emoji. What about with other characters, such as letters of the alphabet?

The encoding of colours is fascinating and may be the next big thing with Unicode, so a discussion in this mailing list as to what is possible and what is desirable could be of importance.

William Overington

Tuesday 15 May 2018


From unicode at unicode.org  Tue May 15 12:47:51 2018
From: unicode at unicode.org (Johnny Farraj via Unicode)
Date: Tue, 15 May 2018 13:47:51 -0400
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
Message-ID: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>

Dear Unicode list members,

I wish to get feedback about a new symbol submission proposal.

Currently the Miscellaneous Symbols table (2600-26FF) includes the
following characters:

266D ? MUSIC FLAT SIGN
266F ? MUSIC SHARP SIGN

while the Musical Symbols table (1D100 - 1D1FF) includes the following
characters:

1D12A ?? MUSICAL SYMBOL DOUBLE SHARP
1D12B ?? MUSICAL SYMBOL DOUBLE FLAT
1D12C ?? MUSICAL SYMBOL FLAT UP
1D12D ?? MUSICAL SYMBOL FLAT DOWN
1D130 ?? MUSICAL SYMBOL SHARP UP
1D131 ?? MUSICAL SYMBOL SHARP DOWN
1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP
1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT

None of these matches what's used in Arabic music notation.

I am proposing the addition of 2 new characters to the Musical Symbols
table:

- the half-flat sign (lowers a note by a quarter tone)
- the half-sharp sign (raises a note by a quarter tone)

[image: Inline image]
[image: Inline image]


These are the correct symbols for Arabic music notation, and they express
intervals that are multiples of quarter tones. it would be really nice to
be able to include them directly in an HTML page or rich text document
using a native code rather than an image.

I am the primary sponsor of this proposal. As far as my credentials, I am
the owner of http://maqamworld.com, the most widely used online resource on
Arabic music theory, in English.

My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com,
another important online reference for Arabic music theory.

Together, we are in the process of publishing a book on Arabic music theory
and performance with Oxford University Press, coming out late 2018.

I can also enlist the support of many academics in the music theory field
who specialize in Arabic music.

I welcome any feedback on this proposal.

thanks

Johnny Farraj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/9987d22f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-sharp sign.png
Type: image/png
Size: 2754 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/9987d22f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-flat sign.png
Type: image/png
Size: 3617 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/9987d22f/attachment-0001.png>

From unicode at unicode.org  Tue May 15 15:29:49 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 May 2018 21:29:49 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z13j8Z8eD9D-qakQzvqUK4E7RUm72cw_8YVR3poipKcNA@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z13j8Z8eD9D-qakQzvqUK4E7RUm72cw_8YVR3poipKcNA@mail.gmail.com>
Message-ID: <20180515212949.73568f11@JRWUBU2>

On Tue, 15 May 2018 02:18:11 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham replied,
> 
> ?? ...Private Use Area...
> ?
> ? That's what the Xishuangbanna News does for final consonants.
> 
> I failed to find a link for their web site, but only spent about an
> hour and a half searching for it.  There is a web site for
> "Xishuangbanna Daily", but the pages I saw there were all in Chinese.

There's a sample at New sample:
http://www.dw12.com/DigitalNewspaper/xsbnbold/content/20180325/ArticelA04001DK.htm .
I'd have added a link, but the sample page wasn't working.  The page is
currently suffering an attack of dittography (seen on both IE on windows
7 and Firefox on Ubuntu).

> If Xishuangbanna News is publishing using PUA, then they probably
> offer a font for download.  I was just curious to see what their web
> pages looked like, and wondered how pervasive the PUA use is.  If
> their site only resorts to PUA for final consonants, then a
> presumption would be that the USE supports all other shaping
> requirements for the script.
> 
> ? My issues are generally not with producing the right image,
> ? but rather with enabling the semantically correct sequence
> ? of characters.
> 
> Because you started out with all the Tai Tham glyphs mapped to the
> PUA, and are now trying to produce a working font using the standard
> encoding?

No.  The problem is a grammar Nazi of a rendering engine.  I have been
working from a set of characters, and what has happened is that some
glyphs (in the ISO sense, not in the sense used for fonts) that
looked as though they may have needed variation sequences have been
split off as formally unrelated characters - MEDIAL LA, MEDIAL RA,
SIGN SA and SIGN LOW PA OR HIGH RATHA.

What do you mean by 'standard encoding'?  It is agreed that there is a
standard coding for *characters*.  I have been using the
encoding proposal accepted by the Unicode Technical Committee as the
definition of the encoding of text; that, interpreted in the light of
the changes to the encoding for characters, is what I have been using
as the definition of the encoding of characters. A problem is
that it seems that Unicode does not specify the encoding of text.
HarfBuzz used to more or less implement the rules in the proposal, and
rendering generally worked. Then HarfBuzz switched to USE.

For example, what prompted my question was the encoding of the
words /t??n t??/ and /t?? t??n/, both meaning 'hornet'.  If the
subscript consonant representing /n/ and and the vowel /??/ form a
ligature which is ambiguous as to the order of the phonemes, or the
vowel truly falls through below the consonant, then the contracted form
is the same for both words, and will be rendered if I type it as
??????? <HIGH TA, SAKOT, NA, SIGN AE, SIGN OA BELOW, MAI KANG,
TONE-1>.  However, the logical reading of that spelling is /ta?n??
ta?n??/, which sounds like a slightly unusual intensifier.   If we
follow the principal of using phonetic order, then  /t??n t??/ will be
encoded ??????? <HIGH TA, SIGN AE, SAKOT, NA, SIGN OA BELOW, MAI KANG,
TONE-1> and /t?? t??n/ will be encoded ??????? <HIGH TA, SIGN OA BELOW,
MAI KANG, TONE-1, SIGN AE, SAKOT, NA>.  Both get a dotted circle
because of the sequence <dependent vowel, SAKOT>.  The second one gets
a dotted circle because of tone before vowel; misapplying the single
subsyllable rule from the proposal, the offence is having a tone mark
before a vowel not on the right.  Without the tone mark or MAI KANG,
the offence would be having a below matra (SIGN OA BELOW) before a left
matra (SIGN AE). When MAI KANG was a vowel, back in Unicode 9.0, a USE
implementation would detect two different offences:

(a) Having a top matra (MAI KANG) before a left matra (SIGN AE) and
(b) Following the accepted proposal for Tai Tham and having a bottom
matra (SIGN OA BELOW) before a top matra (MAI KANG).

A fastidious writer would separate the two subsyllables with MAI SAM,
which is a visible mark.  My specific question was whether, in the
absence of MAI SAM, it was in order to use CGJ to separate the two
subsyllables, so that a grammar checker would know where the boundary
between the subsyllables lay. The issue is that the TUS says that CGJ
does not affect rendering, just after an example of it affecting
rendering in Hebrew.  Now, a possible argument is that it may affect
whether rendering occurs; the insertion of a dotted circle is to be
interpreted as meaning that the renderer has refused to render the
string.

Richard.


From unicode at unicode.org  Tue May 15 16:40:11 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 May 2018 22:40:11 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
Message-ID: <20180515224011.13c4b348@JRWUBU2>

On Tue, 15 May 2018 04:19:42 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:
> 
> > ...  One could argue that the three positions require
> > different glyphs for SIGN U.  Each font would need its own PUA.  
> 
> Or a consensus.

One would end up with a large glyph list to accommodate all designs.
Imagine applying this approach to Devanagari, with all its Sanskrit
conjuncts to be supported although some converters would only target a
small subset.

> > ... There are several
> > places in Tai Tham layout where I want to swap glyphs round, but for
> > the layout engine to do so for me would cause grief for other Tai
> > Tham fonts. This rearrangement cannot be delegated to the rendering
> > engine.  There are Tai Tham fonts which handle Indic rearrangement
> > in the ccmp feature, but they are then totally defeated by either
> > ccmp not being enabled or by the USE doing basic Indic shaping.  
> 
> Suppose the OpenType specs were revised to include a bit which could
> be set for disabling basic Indic shaping by the USE?  I wouldn't set
> it if I were just starting out to make a font for a complex script
> requiring basic Indic shaping, and cannot imagine why anyone else just
> starting out would.

One would need to set the bit while the script was not yet in Unicode,
and then you may well need to set it when the USE bites.  As another
concrete example, one couldn't use USE for the Khmer script - it too
has CVC syllables.  I believe there are also lurking problems with the
ordering of the rarer marks.

You'd come unstuck if you found your script had both preposed
subscripts and optionally preposed matras.  The USE can't handle both
in the same syllable.

One might need to ignore syllable boundaries before Indic re-ordering,
though that's probably a preference rather than a requirement.  Tai
Tham has a troublesome mark, U+1A58 TAI THAM SIGN MAI KANG LAI.  In the
West, it's 'Consonant final' and is a mark above or above right.  In
the East, it works like Burmese kinzi, and acts like a repha.  Revision
1 of the Maefahluang Dictionary of Northern Thai sits on the border.
In its text, it behaves one way in some environments, and the other
way in others.

Finally, many scripts had fonts before windows supported them.  Indeed,
isn't significant Tai Tham renderer support on Windows 7 restricted to
HarfBuzz clients?  (I don't believe M17n is significant, and I fear my
interfacing set-up only works for my fonts.) 

> >> A good keyboard driver ...  
> >
> > It won't work.  The text input delivered by X still needs to be
> > supported, and without modifying the application, X can only input
> > one character at a time.  Not everyone uses an 'input method'.  
> 
> Every keyboard uses a driver, though.  I can't speak for "X", but my
> understanding is that the keyboard driver acts as sort of a buffer
> between the user's key strokes and the application.

X attempts to present the key strokes to the application.  The
application may chose to present these key stroke to an input method to
handle, but these input methods are not reliable.  I have a battery of
three inputs methods for most applications on Ubuntu - raw X keyboard
mapping, ibus using Keyman for Linux, and fcitx using M17n.
Additionally, I find Emacs is easier to use if I talk to it in ASCII
and use its input methods for other character sets.  The advantage
there is that Emacs knows whether I am entering a command, which must
be in ASCII, or text, for which it uses the active input method.

Another issue is that normalised text can be highly inconvenient for a
font.  HarfBuzz chooses a non-standard normalisation for several
scripts simply because that makes things easier for a font. 

> > I've seen an implementation of the USE render
> > canonically equivalent strings differently.  ...  
> 
> Because the USE failed or because the font provided look-ups for each
> of those strings to different glyphs?

Remember that the USE changes the string presented to the font by
inserting dotted circles.  Essentially, <tone, SAKOT, consonant> and
<SAKOT, tone, consonant> can be penalised differently - Microsoft
inserts more dotted circles than does HarfBuzz.

Richard.

From unicode at unicode.org  Tue May 15 16:46:05 2018
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Tue, 15 May 2018 14:46:05 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
Message-ID: <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>

On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode <
unicode at unicode.org> wrote:

> Dear Unicode list members,
>
> I wish to get feedback about a new symbol submission proposal.
>

Just to clarify, this is a discussion list where you may get some useful
feedback. This is not where you would submit an actual proposal.

See https://www.unicode.org/pending/proposals.html

I am proposing the addition of 2 new characters to the Musical Symbols
> table:
>
> - the half-flat sign (lowers a note by a quarter tone)
> - the half-sharp sign (raises a note by a quarter tone)
>

In an actual proposal, I would expect a discussion of whether you are
proposing to encode established symbols, or whether you are proposing new
symbols to be adopted by the community (in which case Unicode would
probably wait & see if they get established).

A proposal should also show evidence of usage and glyph variations.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/200f7946/attachment.html>

From unicode at unicode.org  Tue May 15 17:48:14 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 15 May 2018 15:48:14 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
Message-ID: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>


On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote:
>
>     I am proposing the addition of 2 new characters to the Musical
>     Symbols table:
>
>     - the half-flat sign (lowers a note by a quarter tone)
>     - the half-sharp sign (raises a note by a quarter tone)
>
>
> In an actual proposal, I would expect a discussion of whether you are 
> proposing to encode established symbols, or whether you are proposing 
> new symbols to be adopted by the community (in which case Unicode 
> would probably wait & see if they get established).
>
> A proposal should also show evidence of usage and glyph variations.
>

And should probably refer to the relationship between these signs and 
the existing:

U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT

which are also half-sharp or half-flat accidentals.

The wiki on flat signs shows this flat with a crossbar, as well as a 
reversed flat symbol, to represent the half-flat.

And the wiki on sharp signs shows this sharp minus one vertical bar to 
represent the half-sharp.

So there may be some use of these signs in microtonal notation, outside 
of an Arabic context, as well. See:

https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/568b1cdb/attachment.html>

From unicode at unicode.org  Tue May 15 17:51:35 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 May 2018 23:51:35 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
Message-ID: <20180515235135.7df264c2@JRWUBU2>

On Tue, 15 May 2018 06:04:45 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Display behaviour which is script-specific should be handled by the
> rendering/shaping engine.  Only that which is font-specific should be
> handled by the font.

That makes a lot of sense.  Unfortunately, script-specific behaviour
often needs to be fixed or is completely absent.  It annoys me that my
font has to redo the bits of basic Indic shaping that are left undone
because the USE chops the aksharas up.

> The font's OpenType tables will include pointers to presentation forms
> which aren't directly encoded, the location and repertoire of which
> would naturally differ from font to font.  Likewise, the font's GPOS
> tables will handle things such as mark positioning, because each
> font's metrics are going to be different.
> 
> Because the USE apparently accesses current on-line Unicode data, the
> USE will re-order anything which needs to be moved around.

In Thai, the sequence <consonant, tone, SARA AM> is converted to
<consonant, NIKHAHIT, tone, SARA AA>.  Please tell me where in the
on-line Unicode data it says that:

1) Tai Tham <consonant, tone, SIGN AA, MAI KANG> is reordered to
<consonant, MAI KANG-am**, tone, SIGN AA>:
(a) When the base consonant is NA; and also
(b) in a typical Northern Thai font, but not a Lao*, Tai Lue or Tai
Khuen font.

*Some claim that Lao Tham doesn't use tone marks, but some version at
least does, or Gregory Kourilsky wouldn't have included them in his
encoding of the Tham script.

**The placement may be different to that of MAI KANG in /b?? wa?/
?????? <BA, MAI KANG, TONE-1, SAKOT, WA, SIGN AA> or ?????? <BA, MAI
KANG, SAKOT, WA, TONE-1, SIGN AA> - I don't know whether the first or
the second tone mark is dropped.

(Getting the tone and MAI KANG to interact after <NA, tone, SIGN AA,
MAI KANG> has formed the NAA ligature from <NA, SIGN AA> seems
impossible.  I assume this is because such interaction is undesirable
for Arabic.) 

2) <tone, (subscript consonant or sign, or stand-in)+, top matra> needs
to be rearranged to <top matra, tone, (subscript ...)+> (or equivalent).

And how am I supposed to position MAI SAM to the right of the rightmost
of the level 1 marks above?  Is this a standard positioning as opposed
to a stylistic decision?

Incidentally, how does Unicode document the handling of a tone mark
before U+0E33 THAI CHARACTER SARA AM?

Richard. 


From unicode at unicode.org  Tue May 15 19:19:58 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 16 May 2018 01:19:58 +0100
Subject: Complete Definition of Each Supported Script
Message-ID: <20180516011958.11fb1276@JRWUBU2>

I just found this assertion in https://en.wikipedia.org/wiki/Uniscribe:

"Microsoft worked with the Unicode Technical Committee to make shaping
requirements available in a machine readable format, so a complete
definition of each supported script will be included in the Unicode
standard and updating or adding new scripts will be significantly
simplified."

It was added on 14 August 2016.

Apart from the holding of discussions and the consequences for
Uniscribe/DirectWrite, is this true or is someone adding two and two
together and making five?

1. In particular, is this anything more than reading too much into the
General_Category, Indic_syllabic_Category and
Indic_Positional_Category?  They could only work if the regular
expressions in the documentation of the Universal Script Engine were
correct (we know they aren't), and there are many shaping requirements
that font developers have to discover from other sources.

If there is more to it:

2. Are there "shaping requirements available in machine readable
format", and if so, how can one obtain them?

3. When will these "complete definition[s] of each supporting script
[...] be included in the Unicode standard"?  How will they be checked?
(There is a very good chance that many of them would be wrong.)


Richard.

From unicode at unicode.org  Tue May 15 23:32:24 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Tue, 15 May 2018 21:32:24 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
Message-ID: <CA+p4_H3T++NDjy1MbhmeWCboo=4NpGbJTY6f+UHeEbO_MbUdRg@mail.gmail.com>

What happened to the previous proposal? As I recall, there was some good
discussion after an email from you back in 2015 <
http://www.unicode.org/mail-arch/unicode-ml/y2015-m03/0118.html> and
Michael Everson offered assistance, but no formal proposal has been
submitted to the Documents Register since then.

These symbols are also used in Turkish notation and Western microtonal
notation. They are far more common than MUSICAL SYMBOL QUARTER TONE SHARP
and MUSICAL SYMBOL QUARTER TONE FLAT, which AFAICT only appear in the
Unicode code charts and nowhere else.

On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode <
unicode at unicode.org> wrote:

>
> Dear Unicode list members,
>
> I wish to get feedback about a new symbol submission proposal.
>
> Currently the Miscellaneous Symbols table (2600-26FF) includes the
> following characters:
>
> 266D ? MUSIC FLAT SIGN
> 266F ? MUSIC SHARP SIGN
>
> while the Musical Symbols table (1D100 - 1D1FF) includes the following
> characters:
>
> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP
> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT
> 1D12C ?? MUSICAL SYMBOL FLAT UP
> 1D12D ?? MUSICAL SYMBOL FLAT DOWN
> 1D130 ?? MUSICAL SYMBOL SHARP UP
> 1D131 ?? MUSICAL SYMBOL SHARP DOWN
> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP
> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT
>
> None of these matches what's used in Arabic music notation.
>
> I am proposing the addition of 2 new characters to the Musical Symbols
> table:
>
> - the half-flat sign (lowers a note by a quarter tone)
> - the half-sharp sign (raises a note by a quarter tone)
>
> [image: Inline image]
> [image: Inline image]
>
>
> These are the correct symbols for Arabic music notation, and they express
> intervals that are multiples of quarter tones. it would be really nice to
> be able to include them directly in an HTML page or rich text document
> using a native code rather than an image.
>
> I am the primary sponsor of this proposal. As far as my credentials, I am
> the owner of http://maqamworld.com, the most widely used online resource
> on Arabic music theory, in English.
>
> My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com,
> another important online reference for Arabic music theory.
>
> Together, we are in the process of publishing a book on Arabic music
> theory and performance with Oxford University Press, coming out late 2018.
>
> I can also enlist the support of many academics in the music theory field
> who specialize in Arabic music.
>
> I welcome any feedback on this proposal.
>
> thanks
>
> Johnny Farraj
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/4b39a10d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-flat sign.png
Type: image/png
Size: 3617 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/4b39a10d/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-sharp sign.png
Type: image/png
Size: 2754 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180515/4b39a10d/attachment-0001.png>

From unicode at unicode.org  Wed May 16 02:42:31 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Wed, 16 May 2018 09:42:31 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
Message-ID: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>


> On 16 May 2018, at 00:48, Ken Whistler via Unicode <unicode at unicode.org> wrote:
> 
> On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote:
>> I am proposing the addition of 2 new characters to the Musical Symbols table:
>> 
>> - the half-flat sign (lowers a note by a quarter tone) 
>> - the half-sharp sign (raises a note by a quarter tone)
>> 
>> In an actual proposal, I would expect a discussion of whether you are proposing to encode established symbols, or whether you are proposing new symbols to be adopted by the community (in which case Unicode would probably wait & see if they get established).
>> 
>> A proposal should also show evidence of usage and glyph variations.
> 
> And should probably refer to the relationship between these signs and the existing:

It would be best to encode the SMuFL symbols, which is rather comprehensive and include those:
 https://www.smufl.org
 http://www.smufl.org/version/latest/

> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
> U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT
> 
> which are also half-sharp or half-flat accidentals.
> 
> The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat.
> 
> And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp.
> 
> So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See:
> 
> https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation

These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue.


From unicode at unicode.org  Wed May 16 07:37:06 2018
From: unicode at unicode.org (Johnny Farraj via Unicode)
Date: Wed, 16 May 2018 08:37:06 -0400
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CA+p4_H3T++NDjy1MbhmeWCboo=4NpGbJTY6f+UHeEbO_MbUdRg@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CA+p4_H3T++NDjy1MbhmeWCboo=4NpGbJTY6f+UHeEbO_MbUdRg@mail.gmail.com>
Message-ID: <CABJUkCeuwd=r5ccnte1r2fd=UA7Twrbq9gQ_X74NX-SQPFJ5ig@mail.gmail.com>

Hi Garth,

You are right, I sent a similar posting to the list 3 years ago. at that
time I was hoping get help from some of the more experienced members on the
list to write a proposal. this is a very specialized job and it could take
me months to figure out the process and learn the language. But no one was
able to help. so I'm trying again.

My motivation is being able to type these symbols directly into a MS-Word
document or HTML page, just like you would type a Western flat or sharp
accidental symbol today. My motivation is NOT to make these symbols
available in sheet music notation software; there are solutions for that
today and it's a whole different problem domain.

About the existing symbols

U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT

I don't know what musical tradition these belong to, as far as I know no
one uses them in real life.

I need to make the case for new symbols called Arabic Half Flat / Sharp. I
don't see my proposal really as a duplication of existing symbols for the
following reason: there is no universal way to notate such accidentals, and
every musical tradition with concepts such as half-flat and half-sharp has
its own standard. I am not an expert in any tradition other than Arabic.
therefore all I'm trying to do is add the Arabic version of these symbols.

the Arabic symbols I'm proposing are established, and have been the
standard for a good 75 years. Any Arabic notation (except for a few
remnants from the 1930s before this standard in use) uses the symbols I'm
proposing to add. I do not foresee any disagreement over what the
half-sharp/half-flat Arabic symbols look like, and I can include tons of
evidence in my proposal.

Can someone on the list volunteer to guide with writing a proposal?

I'm willing to do all the work, I just don't know how to start. I need a
template, and I will be happy to complete it with all the required
information.

thanks

Johnny Farraj


On Wed, May 16, 2018 at 12:32 AM, Garth Wallace <gwalla at gmail.com> wrote:

> What happened to the previous proposal? As I recall, there was some good
> discussion after an email from you back in 2015 <
> http://www.unicode.org/mail-arch/unicode-ml/y2015-m03/0118.html> and
> Michael Everson offered assistance, but no formal proposal has been
> submitted to the Documents Register since then.
>
> These symbols are also used in Turkish notation and Western microtonal
> notation. They are far more common than MUSICAL SYMBOL QUARTER TONE SHARP
> and MUSICAL SYMBOL QUARTER TONE FLAT, which AFAICT only appear in the
> Unicode code charts and nowhere else.
>
> On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode <
> unicode at unicode.org> wrote:
>
>>
>> Dear Unicode list members,
>>
>> I wish to get feedback about a new symbol submission proposal.
>>
>> Currently the Miscellaneous Symbols table (2600-26FF) includes the
>> following characters:
>>
>> 266D ? MUSIC FLAT SIGN
>> 266F ? MUSIC SHARP SIGN
>>
>> while the Musical Symbols table (1D100 - 1D1FF) includes the following
>> characters:
>>
>> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP
>> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT
>> 1D12C ?? MUSICAL SYMBOL FLAT UP
>> 1D12D ?? MUSICAL SYMBOL FLAT DOWN
>> 1D130 ?? MUSICAL SYMBOL SHARP UP
>> 1D131 ?? MUSICAL SYMBOL SHARP DOWN
>> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP
>> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT
>>
>> None of these matches what's used in Arabic music notation.
>>
>> I am proposing the addition of 2 new characters to the Musical Symbols
>> table:
>>
>> - the half-flat sign (lowers a note by a quarter tone)
>> - the half-sharp sign (raises a note by a quarter tone)
>>
>> [image: Inline image]
>> [image: Inline image]
>>
>>
>> These are the correct symbols for Arabic music notation, and they express
>> intervals that are multiples of quarter tones. it would be really nice to
>> be able to include them directly in an HTML page or rich text document
>> using a native code rather than an image.
>>
>> I am the primary sponsor of this proposal. As far as my credentials, I am
>> the owner of http://maqamworld.com, the most widely used online resource
>> on Arabic music theory, in English.
>>
>> My co-sponsor is Sami Abu Shumays, author of http://maqamlessons.com,
>> another important online reference for Arabic music theory.
>>
>> Together, we are in the process of publishing a book on Arabic music
>> theory and performance with Oxford University Press, coming out late 2018.
>>
>> I can also enlist the support of many academics in the music theory field
>> who specialize in Arabic music.
>>
>> I welcome any feedback on this proposal.
>>
>> thanks
>>
>> Johnny Farraj
>>
>>
>>
>>
>>
>


-- 

Johnny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/1a593529/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-sharp sign.png
Type: image/png
Size: 2754 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/1a593529/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: half-flat sign.png
Type: image/png
Size: 3617 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/1a593529/attachment-0001.png>

From unicode at unicode.org  Wed May 16 08:23:08 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 May 2018 05:23:08 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180515235135.7df264c2@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
Message-ID: <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>

In response to Richard Wordingham,

Sorry I can't answer many of your questions.  Hoping someone who can
does.  Note that although the proposal gave canonical combining class
zero to both the tone marks and the vowel signs, the on-line Unicode
data gives canonical combining class 230 to the tone marks.

> **The placement may be different to that of MAI KANG
> in /b?? wa?/ ?????? <BA, MAI KANG, TONE-1, SAKOT, WA,
> SIGN AA> or ?????? <BA, MAI KANG, SAKOT, WA, TONE-1,
> SIGN AA> - I don't know whether the first or the second
> tone mark is dropped.

FWIW, neither is dropped in the display here, although they don't
display identically.  The first string shows TONE-1 positioned to the
right of MAI KANG, the second string superimposes them.  (Windows 7
running LibreOffice in order to enable the USE from HarfBuzz.)

> (Getting the tone and MAI KANG to interact after <NA, tone,
> SIGN AA, MAI KANG> has formed the NAA ligature from
> <NA, SIGN AA> seems impossible.

Substituting U+1A36 TAI THAM LETTER NA for BA in the above strings,
??????  ??????, and trying to get the ligature are in the attached
*.PNG file. Here's the four strings for the PNG:

\u1A36\u1A74\u1A75\u1A60\u1A45\u1A63
\u1A36\u1A74\u1A60\u1A45\u1A75\u1A63
\u1A36\u1A75\u1A63\u1A74
\u1A36\u1A63\u1A74\u1A75
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TaiTham_20180516.PNG
Type: image/png
Size: 2363 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/da5a4aa4/attachment.png>

From unicode at unicode.org  Wed May 16 08:25:59 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Wed, 16 May 2018 15:25:59 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
Message-ID: <9ADA9726-54A3-4642-936F-FFB9A6DDC757@telia.com>


> On 16 May 2018, at 09:42, Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>> On 16 May 2018, at 00:48, Ken Whistler via Unicode <unicode at unicode.org> wrote:
>> 
>>> A proposal should also show evidence of usage and glyph variations.
>> 
>> And should probably refer to the relationship between these signs and the existing:
> 
> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those:
> https://www.smufl.org
> http://www.smufl.org/version/latest/
> 
>> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
>> U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT
>> 
>> which are also half-sharp or half-flat accidentals.
>> 
>> The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat.
>> 
>> And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp.
>> 
>> So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See:
>> 
>> https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation
> 
> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue.

Clarification: The Arabic accidentals, listed here as separate entities
  http://www.smufl.org/version/latest/range/arabicAccidentals/
appear in LilyPond as ordinary microtonal accidentals:
  http://lilypond.org/doc/v2.18/Documentation/notation/the-feta-font#accidental-glyphs

So what I meant above is that originally, they were the same, i.e., when starting to use them in Arabic music, one took some Western microtonal accidentals. Now they mean microtones in the style of Arabic music, and the musical interpretation varies.


From unicode at unicode.org  Wed May 16 15:46:22 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 16 May 2018 13:46:22 -0700
Subject: L2/18-181
Message-ID: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>

http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf

This is a fascinating proposal to disunify the Assamese script from
Bengali on the following bases:

1. The identity of Assamese as a script distinct from Bengali is in
jeopardy.

2. Collation is different between the Assamese and Bengali languages,
and code point order should reflect collation order.

3. Keyboard design is more difficult because consonants like ???
are encoded as conjunct forms instead of atomic characters.

4. The use of a single encoded script to write two languages forces
users to use language identifiers to identify the language.

5. Transliteration of Assamese into a different script is problematic
because letters have different phonological value in Assamese and
Bengali.

It will be interesting to see where this proposal goes. Given that all
or most of these issues can be claimed for English, French, German,
Spanish, and hundreds of other languages written in the Latin script, if
the Assamese proposal is approved we can expect similar disunification
of the Latin script into language-specific alphabets in the future.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Wed May 16 16:39:36 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 16 May 2018 22:39:36 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
Message-ID: <20180516223936.32a843d1@JRWUBU2>

On Wed, 16 May 2018 05:23:08 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Note that although the proposal gave canonical combining class
> zero to both the tone marks and the vowel signs, the on-line Unicode
> data gives canonical combining class 230 to the tone marks.

There were several changes from ccc=0 to non-zero that were sneaked in
between the UTC agreeing to proceed with the proposal and Unicode 5.2
being published.  That may have been a test of vigialnce; we failed.  I
have seen no benefit from the changes - U+A160 TAI THAM SIGN SAKOT is
not a virama (it should not appear in valid text), and having the tone
marks and the invisible stacker have distinct non-zero classes has
caused lots of irritation.

We should probably have risked Tai Tham being excluded from the BMP and
gone for the Tibetan model; normalised would not then damage Tai tham
text.

> > **The placement may be different to that of MAI KANG
> > in /b?? wa?/ ?????? <BA, MAI KANG, TONE-1, SAKOT, WA,
> > SIGN AA> or ?????? <BA, MAI KANG, SAKOT, WA, TONE-1,
> > SIGN AA> - I don't know whether the first or the second
> > tone mark is dropped.  

> FWIW, neither is dropped in the display here, although they don't
> display identically.  The first string shows TONE-1 positioned to the
> right of MAI KANG, the second string superimposes them.  (Windows 7
> running LibreOffice in order to enable the USE from HarfBuzz.)

The full uncontracted writing is <BA, MAI KANG, TONE-1, WA, TONE-1, SIGN
AA>.  Both syllables have TONE-1, but I have not seen two identical
tone marks from different phonetic syllables in the same stack.  The
person typing the contraction drops a tone mark, not the rendering
system.

> Substituting U+1A36 TAI THAM LETTER NA for BA in the above strings,
> ??????  ??????, and trying to get the ligature are in the attached
> *.PNG file. Here's the four strings for the PNG:
> 
> \u1A36\u1A74\u1A75\u1A60\u1A45\u1A63
> \u1A36\u1A74\u1A60\u1A45\u1A75\u1A63
> \u1A36\u1A75\u1A63\u1A74
> \u1A36\u1A63\u1A74\u1A75

A lot of fonts have trouble ligating NA and AA when there is material
between them.  (Hint: Classify all non-spacing subscript consonants as
marks, and spacing subscript consonants as bases, and set the ligating
lookup to ignore marks.)

Your example appears to be using the font called 'A Tai Tham KH New'.
While the only way to type Pali _bho_ 'O' after other text in this font
or 'A Tai Tham KH' is to enter the correct sequence <LOW PHA, SIGN E,
SIGN AA>, the former font cannot render Pali _mano_ 'mind' (also used in
Northern Thai and probably also Tai Khuen) if one types the correct
sequence <MA, NA, SIGN E, SIGN AA>.  One has to type <MA, NA, SIGN AA,
SIGN E>!  The *older* font 'A Tai Tham KH (at Version 2.0) does render the
correct spelling properly.  As an example of correct rendering, I
include the Pali for 'O mind!', _bho mano_, encoded  <LOW PHA, SIGN E,
SIGN AA, MA, NA, SIGN AA, SIGN E>, as rendered by the Lamphun font.

Richard.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: o_mind.png
Type: image/png
Size: 2049 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/20d31a62/attachment.png>

From unicode at unicode.org  Wed May 16 17:01:10 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 16 May 2018 23:01:10 +0100
Subject: L2/18-181
In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
Message-ID: <20180516230110.31f9efa2@JRWUBU2>

On Wed, 16 May 2018 13:46:22 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf
> 
> This is a fascinating proposal to disunify the Assamese script from
> Bengali on the following bases:

> 3. Keyboard design is more difficult because consonants like ???
> are encoded as conjunct forms instead of atomic characters.

Users of X do have a valid gripe here.  An X keyboard mapping can only
accept single codepoints; sequences require explicit support by the
application.  Advanced applications get round this by using an input
method, but they can be unreliable, particularly over networks.  (I
ended up creating an X keyboad mapping as back-up, but when I use it
I lose all my 'ligature' keys.) However, that seems to be an argument
for deprecating Bengali, rather than for disunifying Bengali and
Assamese.

I think simple Windows keyboards have a limit of 4 16-bit code units;
for an Indic SMP script, one couldn't map <x> to a single key, as it
would require 6 code units.

It would be handy to have characters whose only use was to input text;
adding characters that are subject to composition exclusions would not
change whether text is in NFC, in NFD, or neither. 

Of course, if the scripts were disunified, would we have to ban Assamese
domain names in the new 'Assamese script' as they would be ambiguous
with Bengali names.

Richard.


From unicode at unicode.org  Wed May 16 17:41:12 2018
From: unicode at unicode.org (Anshuman Pandey via Unicode)
Date: Wed, 16 May 2018 17:41:12 -0500
Subject: Fwd: L2/18-181
In-Reply-To: <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
Message-ID: <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>

> On May 16, 2018, at 3:46 PM, Doug Ewell via Unicode <unicode at unicode.org> wrote:
>
> http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf
>
> This is a fascinating proposal to disunify the Assamese script from
> Bengali on the following bases:

?Fascinating? is a not a term I?d use for this proposal.

If folks are interested in a valid proposal for disunification of
Bengali, please look at the proposal for Tirhuta.

> 1. The identity of Assamese as a script distinct from Bengali is in
> jeopardy.

This is not a technical matter. Moreover, its typical rhetoric used by
various language communities in South Asia. Fairly standard fare for
those familiar with such issues.

The proposal needs to show how the two scripts differ, ie. conjuncts,
CV ligatures, etc. The number forms are similar to those already
encoded. Again, cf. Tirhuta.

> 2. Collation is different between the Assamese and Bengali languages,
> and code point order should reflect collation order.

The same issue applies to dictionary order for Hindi, Marathi, which
differ from the conventional Sanskrit order for Devanagari.
Orthographies for various languages put conjuncts and other things at
the end, which are not considered atomic letters. Nothing special in
this regard for Assamese and Bengali.

> 3. Keyboard design is more difficult because consonants like ???
> are encoded as conjunct forms instead of atomic characters.

Ignorant question on my part: is it difficult to use character
sequences as labels for keys? I see keys for both ??? and ??? on the
iOS Hindi keyboard, and ??? is tucked away under ?.

> 4. The use of a single encoded script to write two languages forces
> users to use language identifiers to identify the language.

Same applies to each of the 40+ varieties of Hindi, as well as
Marathi, etc. Another ignorant question: how to identify the various
languages that use Arabic and Cyrillic?

> 5. Transliteration of Assamese into a different script is problematic
> because letters have different phonological value in Assamese and
> Bengali.

Transliteration or transcription? In any case, this applies to other
languages written using similar scripts: a Marathi speaker pronounces
? and ? differently than a Hindi speaker does.

> It will be interesting to see where this proposal goes.

Hopefully, it does not go too far. What it proposes is contrary to
Unicode and redundant.

> Given that all
> or most of these issues can be claimed for English, French, German,
> Spanish, and hundreds of other languages written in the Latin script, if
> the Assamese proposal is approved we can expect similar disunification
> of the Latin script into language-specific alphabets in the future.

Fascinating. I mean, terrible.

All my best,
Anshuman


From unicode at unicode.org  Wed May 16 18:34:35 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 17 May 2018 00:34:35 +0100
Subject: L2/18-181
In-Reply-To: <20180516230110.31f9efa2@JRWUBU2>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <20180516230110.31f9efa2@JRWUBU2>
Message-ID: <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com>

This is not a fault of the encoding.

> On 16 May 2018, at 23:01, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> I think simple Windows keyboards have a limit of 4 16-bit code units;
> for an Indic SMP script, one couldn't map <x> to a single key, as it
> would require 6 code units.


From unicode at unicode.org  Wed May 16 18:38:24 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 17 May 2018 00:38:24 +0100
Subject: L2/18-181
In-Reply-To: <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
 <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
Message-ID: <49C16312-F60E-4495-B3E5-D74079FE5F9B@evertype.com>

And Icelandic. And Irish. And so on. 

> On 16 May 2018, at 23:41, Anshuman Pandey via Unicode <unicode at unicode.org> wrote:
> 
>> 2. Collation is different between the Assamese and Bengali languages,
>> and code point order should reflect collation order.
> 
> The same issue applies to dictionary order for Hindi, Marathi, which
> differ from the conventional Sanskrit order for Devanagari.


From unicode at unicode.org  Wed May 16 19:20:43 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 01:20:43 +0100
Subject: L2/18-181
In-Reply-To: <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <20180516230110.31f9efa2@JRWUBU2>
 <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com>
Message-ID: <20180517012043.2d5a2f7d@JRWUBU2>

On Thu, 17 May 2018 00:34:35 +0100
Michael Everson via Unicode <unicode at unicode.org> wrote:

> This is not a fault of the encoding.
> 
> > On 16 May 2018, at 23:01, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > I think simple Windows keyboards have a limit of 4 16-bit code
> > units; for an Indic SMP script, one couldn't map <x> to a single
> > key, as it would require 6 code units.  

It is a consequence of the policy of avoiding precomposed characters.
If there were a precomposed character for <x>, the keyboard could emit
that character - job done.

One objection is that one would need a sequence of decompositions:

<XA> = <KA_PLUS, SSA>
<KA_PLUS> = <KA, VIRAMA>

Some people are vehemently opposed to unnatural characters like
<KA_PLUS>.

Presumable the official view is that Windows Text Services have taken us
beyond that point, and the likes of <XA> above are not needed.

If X persists, perhaps named sequences should be assigned numbers so
that X can make a generic allocation of keysym codes to named
sequences.

Richard. 

From unicode at unicode.org  Wed May 16 19:24:11 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 01:24:11 +0100
Subject: L2/18-181
In-Reply-To: <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
 <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
Message-ID: <20180517012411.4dfcdaac@JRWUBU2>

On Wed, 16 May 2018 17:41:12 -0500
Anshuman Pandey via Unicode <unicode at unicode.org> wrote:

> > 3. Keyboard design is more difficult because consonants like ???
> > are encoded as conjunct forms instead of atomic characters.  
> 
> Ignorant question on my part: is it difficult to use character
> sequences as labels for keys? I see keys for both ??? and ??? on the
> iOS Hindi keyboard, and ??? is tucked away under ?.

It can be.  It depends on the technology.

Pure X seems to be the worst.  At the basic level, one has a
bewildering map of key plus active modifier key to a single
Unicode character. (The space also include function keys.)  An
*application* can map keys to strings, but I know of no way of doing
that to all of a user's applications, both those running and those that
will run.  Even the logic for dead keys has to be applied by the
application, though I believe there are standard libraries that will
handle that.

The old method on Windows uses sets of data tables that may be termed
keyboards.  Populated sets are saved as DLLs, and there are limits on
what they can contain.  Windows' Microsoft Keyboard Layout Creator
(MSKLC) is a popular tool for creating and packaging these DLLs.
A key plus it modifiers can be mapped to:

1) A sequence of UTF-16 code units.  The documented limit is, I believe
4, but there are reports of people being able to use 6.  The four
sequences listed above each constitute a sequence of 3 code units, so
they can be readily accommodated.  This technique may well not work
for a script in the SMP, and I think one cannot use the MSKLC simply to
create the DLLs storing long sequences.  So here is an added layer of
complexity, though not relevant to the Bengali script.

2) A key can be designated a 'dead key'.  I think it has to have a
fallback to a BMP character, or rather, a single UTF-16 code unit.  On
then pressing a key that maps to a single code unit, this is converted
to another single code unit, which is the character that the
combination types.  The restriction is built into the data structure.

There is a technique to chain dead keys, but that is not relevant to
the difficulty or ease of typing ligatures.

The next level up I am acquainted with is the level of input methods.
Here, one types a sequence of characters on a 'simple' keyboard, and
this sequence controls the derivation of characters being input to the
application.  Modifier keys may be available to influence this
derivation.  Now, some of these input methods may be unreliable, and
there may be problems for users who can switch between simple
keyboards, e.g. US and British, or US and Hindi.

If this type of method works, then inputting sequences in response to
a single keystroke is not a problem.  Multiple key strokes can be a
different matter, as the interface with applications may be ill-defined
or broken.  I have found this a problem with using the backslash key to
cycle through candidate characters, and deleting SMP characters in
LibreOffice has in the past resulted in the creation of lone surrogates.

Now, writing these input methods can be easy.  I have fairly simple
input methods for inputting both true characters and sequences
perceived as characters for Emacs, ibus (using KMfL) and fcitx (using
M17n).  However, the ibus method has been unreliable in the past, and I
have fallen back to a simple X keyboard map.  When I do that, I lose
the ability to input sequences by a single keystroke.

Richard.


From unicode at unicode.org  Wed May 16 19:24:09 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 17 May 2018 01:24:09 +0100
Subject: L2/18-181
In-Reply-To: <20180517012043.2d5a2f7d@JRWUBU2>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <20180516230110.31f9efa2@JRWUBU2>
 <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com>
 <20180517012043.2d5a2f7d@JRWUBU2>
Message-ID: <E604ACDB-5981-4493-8BD8-A145CE6FF71D@evertype.com>

It sounds to me like a fault in the keyboard software, which could be fixed by the people who own and maintain that software.

> On 17 May 2018, at 01:20, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Thu, 17 May 2018 00:34:35 +0100
> Michael Everson via Unicode <unicode at unicode.org> wrote:
> 
>> This is not a fault of the encoding.
>> 
>>> On 16 May 2018, at 23:01, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
>>> 
>>> I think simple Windows keyboards have a limit of 4 16-bit code
>>> units; for an Indic SMP script, one couldn't map <x> to a single
>>> key, as it would require 6 code units.  
> 
> It is a consequence of the policy of avoiding precomposed characters.
> If there were a precomposed character for <x>, the keyboard could emit
> that character - job done.
> 
> One objection is that one would need a sequence of decompositions:
> 
> <XA> = <KA_PLUS, SSA>
> <KA_PLUS> = <KA, VIRAMA>
> 
> Some people are vehemently opposed to unnatural characters like
> <KA_PLUS>.
> 
> Presumable the official view is that Windows Text Services have taken us
> beyond that point, and the likes of <XA> above are not needed.
> 
> If X persists, perhaps named sequences should be assigned numbers so
> that X can make a generic allocation of keysym codes to named
> sequences.
> 
> Richard. 


From unicode at unicode.org  Wed May 16 19:49:21 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 01:49:21 +0100
Subject: L2/18-181
In-Reply-To: <E604ACDB-5981-4493-8BD8-A145CE6FF71D@evertype.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <20180516230110.31f9efa2@JRWUBU2>
 <8A5B21CA-5241-43A9-B537-5F6525198BD5@evertype.com>
 <20180517012043.2d5a2f7d@JRWUBU2>
 <E604ACDB-5981-4493-8BD8-A145CE6FF71D@evertype.com>
Message-ID: <20180517014921.7be07a44@JRWUBU2>

On Thu, 17 May 2018 01:24:09 +0100
Michael Everson via Unicode <unicode at unicode.org> wrote:

> It sounds to me like a fault in the keyboard software, which could be
> fixed by the people who own and maintain that software.

We had this discussion a few years ago.  See
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0036.html.

Richard.

From unicode at unicode.org  Thu May 17 01:47:45 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Wed, 16 May 2018 23:47:45 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
Message-ID: <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>

On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <
unicode at unicode.org> wrote:

>
> > On 16 May 2018, at 00:48, Ken Whistler via Unicode <unicode at unicode.org>
> wrote:
> >
> > On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote:
> >> I am proposing the addition of 2 new characters to the Musical Symbols
> table:
> >>
> >> - the half-flat sign (lowers a note by a quarter tone)
> >> - the half-sharp sign (raises a note by a quarter tone)
> >>
> >> In an actual proposal, I would expect a discussion of whether you are
> proposing to encode established symbols, or whether you are proposing new
> symbols to be adopted by the community (in which case Unicode would
> probably wait & see if they get established).
> >>
> >> A proposal should also show evidence of usage and glyph variations.
> >
> > And should probably refer to the relationship between these signs and
> the existing:
>
> It would be best to encode the SMuFL symbols, which is rather
> comprehensive and include those:
>  https://www.smufl what should be unified.org <https://www.smufl.org>
>  http://www.smufl.org/version/latest/


If you want to write up a proposal for that entire set of characters,
godspeed and good luck.

> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
> > U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT
> >
> > which are also half-sharp or half-flat accidentals.
> >
> > The wiki on flat signs shows this flat with a crossbar, as well as a
> reversed flat symbol, to represent the half-flat.
> >
> > And the wiki on sharp signs shows this sharp minus one vertical bar to
> represent the half-sharp.
> >
> > So there may be some use of these signs in microtonal notation, outside
> of an Arabic context, as well. See:
> >
> > https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation
>
> These are otherwise originally the same, but has since drifted. So whether
> to unify them or having them separate might be best to see what SMuFL does,
> as they are experts on the issue.
>

SMuFL's standards on unification are not the same as Unicode's. For one
thing, they re-encode Latin letters and Arabic digits multiple times for
various different uses (such as numbers used in tuplets and those used in
time signatures). There are duplicates all over the place, like how the
half-sharp symbol is encoded at U+E282 as
"accidentalQuarterToneSharpStein", at U+E422 as
"accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as
"accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp".
They are graphically identical, and the first three even all mean the same
thing, a quarter tone sharp! The last, though meaning something different
in Turkish context (Turkish theory divides tones into 1/9-tones), is still
clearly the same symbol. The "Arabic accidentals" section even re-encodes
all of the non-microtonal accidentals (basic sharp, flat, natural, etc.)
for no reason that I can determine.

There are definitely many things in SMuFL where you could make a claim that
they should be in Unicode proper. But not all, and the standard itself is a
bit of a mess.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180516/8573e0ea/attachment.html>

From unicode at unicode.org  Thu May 17 02:40:54 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Thu, 17 May 2018 09:40:54 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
Message-ID: <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>


> On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode at unicode.org> wrote:
> 
>> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <unicode at unicode.org> wrote:
>> 
>> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those:
>>  https://www.smufl what should be unified.org
>>  http://www.smufl.org/version/latest/
>> ...
>> 
>> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue.
>> 
> SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures).

The reason is probably because it is intended for use with music engraving, and they should then be rendered differently.

> There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp!

But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings.

> The last, though meaning something different in Turkish context (Turkish theory divides tones into 1/9-tones), is still clearly the same symbol. The "Arabic accidentals" section even re-encodes all of the non-microtonal accidentals (basic sharp, flat, natural, etc.) for no reason that I can determine.

In Turkish AEU (Arel-Ezgi-Uzdilek) notation the sharp # is a microtonal symbol, not the ordinary sharp, so it should be different. In Arabic music, they are the same though, so they can be unified.

> There are definitely many things in SMuFL where you could make a claim that they should be in Unicode proper. But not all, and the standard itself is a bit of a mess.

You need to work through those little details to see what fits. Should it help with music engraving, or merely be used in plain text? Should symbols that that look alike but have different musical meaning be unified?


From unicode at unicode.org  Thu May 17 03:51:55 2018
From: unicode at unicode.org (Otto Stolz via Unicode)
Date: Thu, 17 May 2018 10:51:55 +0200
Subject: L2/18-181
In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
Message-ID: <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de>

Am 2018-05-16 um 22:46 Uhr hat Doug Ewell geschrieben:
> 2. Collation is different between the Assamese and Bengali languages,
> and code point order should reflect collation order.
?
> 4. The use of a single encoded script to write two languages forces
> users to use language identifiers to identify the language.

I wonder how English and French ever could
be made to use a single script, let alone
German (???), Icelandic (???), Swedish (???),
Latvian (???), Chech (???) or ? you name it.

Best wishes,
   Otto Stolz

From unicode at unicode.org  Thu May 17 01:49:55 2018
From: unicode at unicode.org (dinar qurbanov via Unicode)
Date: Thu, 17 May 2018 09:49:55 +0300
Subject: how to make custom combining diacritical marks for arabic letters?
Message-ID: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>

how to make custom combining diacritical marks for arabic letters?
should only font drivers and programs support it, or should also
unicode support it, for example, have special area for them?

as i know, private use area can be used to make combining diacritical
marks for latin script without problems.

but when i tried, several years ago, to make that for arabic script,
with fontforge, i had to use right to left override mark, and manually
insert beginning, middle, ending forms of arabic letters, and even
then, my custom marks were not located very properly above letters.

From unicode at unicode.org  Thu May 17 05:04:25 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Thu, 17 May 2018 11:04:25 +0100 (BST)
Subject: L2/18-181
In-Reply-To: <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de>
Message-ID: <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost>

Otto Stolz wrote:

> I wonder how English and French ever could be made to use a single script, let alone German (???), Icelandic (???), Swedish (???), Latvian (???), Chech (???) or ? you name it.

Years ago I used to hand set metal type - letterpress printing was a family hobby.

For a fount of type of a particular style and size and case, there was a typecase, subdivided into areas of various sizes and there was a more or less standard lay of the typecase so that, for example, a lowercase e was in a larger area than a lowercase q, because there were more pieces of type of a lowercase e than of a lowercase q, and e and q were in a known place within the typecase so that a lowercase e in any of the typecases was in the same place within the typecase. There were a number of extra small areas near the edge of the typecase which were unspecified and could be used for extra sorts as they were known.

I had become interested in Esperanto and bought some sorts, some of each of twelve sorts, so as to augment a fount used for printing in English be able to print in Esperanto as well.

These sorts were placed in some of the small areas near the edge of the typecase. Had I wanted to print in French I could have bought the accented sorts needed for French. Indeed the type catalogue from the typefounder had a list of which sorts were needed for each of various European languages. I learned most of that list.

This has proved useful at times, such as in the early 1970s when two researchers were trying to translate a research paper from what they thought was Spanish into English and were having problems and I was able to point out that it was not Spanish but Portuguese as there was an a tilde in the text, even though I do not know Portuguese.

There was a publication by the Monotype Corporation, published in 1963.

Languages of the world that can be set on 'Monotype' machines / compiled by R.A. Downie.

I have just looked it up in the British Library online catalogue.

I bought a copy of the publication in the 1960s. I do not have it immediately to hand. Does anyone have a copy readily available and can say what is said about Assamese in that book please?

Going back to look at what was done in relation to Assamese with metal type - not just the Monotype brand -  could be an interesting insight.

I notice that Otto Stolz mentions the following.

> Icelandic (???),

Yet the thorn character was part of English too.

Yet it was lost from English.

Was that because William Caxton got his founts of metal type from the European mainland and the necessary sort was not in the font?

Is the same sort of thing happening now, over five hundred years later, in relation to Assamese?

Maybe people should be helping to get this resolved to the satisfaction of all and helping rather than criticising.

By the way, in relation to language identification, Unicode has a perfectly good plain text mechanism for language identification built into it, using the character

U+E0001 LANGUAGE TAG

and other tag characters.

All of the tag characters were deprecated years ago, against opposition by at least two of the contributors to this present thread, then all except U+E0001 have been undeprecated more recently.

There is a note in the code chart.

>> This character is deprecated, and its use is strongly discouraged.

It does not say by whom it is discouraged though nor why.

www.unicode.org/charts/PDF/UE0000.pdf

I opine that it time for a rethink on this and that U+E0001 should be undeprecated and its application be encouraged instead of all the stuff about using higher level protocols all the time - after all, higher level protocols are not encouraged instead when people want to send emoji.

William Overington

Thursday 17 May 2018


From unicode at unicode.org  Thu May 17 09:47:25 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Thu, 17 May 2018 07:47:25 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
Message-ID: <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>

On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com> wrote:

>
> > On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode at unicode.org>
> wrote:
> >
> >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <
> unicode at unicode.org> wrote:
> >>
> >> It would be best to encode the SMuFL symbols, which is rather
> comprehensive and include those:
> >>  https://www.smufl what should be unified.org
> >>  http://www.smufl.org/version/latest/
> >> ...
> >>
> >> These are otherwise originally the same, but has since drifted. So
> whether to unify them or having them separate might be best to see what
> SMuFL does, as they are experts on the issue.
> >>
> > SMuFL's standards on unification are not the same as Unicode's. For one
> thing, they re-encode Latin letters and Arabic digits multiple times for
> various different uses (such as numbers used in tuplets and those used in
> time signatures).
>
> The reason is probably because it is intended for use with music
> engraving, and they should then be rendered differently.


Exactly. But Unicode would consider these a matter for font switching in
rich text.

> There are duplicates all over the place, like how the half-sharp symbol
> is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as
> "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as
> "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp".
> They are graphically identical, and the first three even all mean the same
> thing, a quarter tone sharp!
>
> But the tuning system is different, E24 and Pythagorean. Some Latin and
> Greek uppercase letters are exactly the same but have different encodings.


Tuning systems are not scripts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180517/55ef1fa5/attachment.html>

From unicode at unicode.org  Thu May 17 09:51:54 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 May 2018 06:51:54 -0800
Subject: L2/18-181
In-Reply-To: <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <6063e7e0-fa1e-ce61-2834-01ac36c4dadb@uni-konstanz.de>
 <22771476.14619.1526551465700.JavaMail.defaultUser@defaultHost>
Message-ID: <CABPY6Z21zj2oeSD3-na3euwZP3B-mQJmcoLnYjhapR16AciA9Q@mail.gmail.com>

William Overington offered a suggestion,

? Maybe people should be helping to get this resolved
? to the satisfaction of all and helping rather than
? criticising.

That's a noble thought, but as long as Assamese continues to be
written using the Eastern Nagari script, which is referred to as
"BENGALI" in the Unicode naming tables, any disunification proposal
will be a non-starter.  Hence the criticism.  We should strive to keep
any criticism constructive rather than derisive.  If I'm not mistaken,
the character naming for this script was inherited from the ISCII
standard, so it was the Indian government's convention.  I believe
most English speakers aware of the script call it Bengali.

https://en.wikipedia.org/wiki/Eastern_Nagari_script

?  U+E0001 LANGUAGE TAG
?
? ...
?
? There is a note in the code chart.
?
?  >> This character is deprecated, and its use is strongly
?  discouraged.
?
? It does not say by whom it is discouraged though nor why.

The reason people shouldn't use it is because it is deprecated.  It
was originally deprecated because people shouldn't use it.

Arguably, a plain-text computer character encoding standard which is
language-neutral does not need a language tagging mechanism.  By
encoding scripts rather than languages, Unicode ensures that the data
is legible in plain-text.  If the recipient of an untagged plain-text
file doesn't know the language well enough to recognize it, then a tag
won't help.  If the recipient wants to translate it anyway, various
on-line translators are fairly sophisticated in language
identification.  If that fails, it's a mystery.  Everybody loves a
mystery.


From unicode at unicode.org  Thu May 17 10:08:19 2018
From: unicode at unicode.org (Martinho Fernandes via Unicode)
Date: Thu, 17 May 2018 17:08:19 +0200
Subject: The Unicode Standard and ISO
Message-ID: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>

Hello,

There are several mentions of synchronization with related standards in
unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
https://www.unicode.org/faq/unicode_iso.html. However, all such mentions
never mention anything other than ISO 10646.

I was wondering which ISO standards other than ISO 10646 specify the
same things as the Unicode Standard, and of those, which ones are
actively kept in sync. This would be of importance for standardization
of Unicode facilities in the C++ language (ISO 14882), as reference to
ISO standards is generally preferred in ISO standards.

-- 
Martinho


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://unicode.org/pipermail/unicode/attachments/20180517/b4b69f05/attachment.asc>

From unicode at unicode.org  Thu May 17 10:48:40 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Thu, 17 May 2018 17:48:40 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
Message-ID: <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>


> On 17 May 2018, at 16:47, Garth Wallace via Unicode <unicode at unicode.org> wrote:
> 
> On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com> wrote:
> 
> > On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode at unicode.org> wrote:
> > 
> >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <unicode at unicode.org> wrote:
> >> 
> >> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those:
> >>  https://www.smufl what should be unified.org
> >>  http://www.smufl.org/version/latest/
> >> ...
> >> 
> >> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue.
> >> 
> > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures).
> 
> The reason is probably because it is intended for use with music engraving, and they should then be rendered differently.
> 
> Exactly. But Unicode would consider these a matter for font switching in rich text.

One original principle was ensure different encodings, so if the practise in music engraving is to keep them different, they might be encoded differently.

> > There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp!
> 
> But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings.
> 
> Tuning systems are not scripts.

That seems obvious. As I pointed out above, the Arabic glyphs were originally taken from Western ones, but have a different musical meaning, also when played using E12, as some do.


From unicode at unicode.org  Thu May 17 11:43:28 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 17 May 2018 09:43:28 -0700
Subject: The Unicode Standard and ISO
In-Reply-To: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>
References: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>
Message-ID: <f31f8390-fe0f-e8ab-5769-e8bff72ca78f@ix.netcom.com>

On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> Hello,
>
> There are several mentions of synchronization with related standards in
> unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> https://www.unicode.org/faq/unicode_iso.html. However, all such mentions
> never mention anything other than ISO 10646.

Because that is the standard for which there is an explicit 
understanding by all involved
relating to synchronization. There have been occasionally some 
challenging differences
in the process and procedures, but generally the synchronization is 
being maintained,
something that's helped by the fact that so many people are active in 
both arenas.

There are really no other standards where the same is true to the same 
extent.
>
> I was wondering which ISO standards other than ISO 10646 specify the
> same things as the Unicode Standard, and of those, which ones are
> actively kept in sync. This would be of importance for standardization
> of Unicode facilities in the C++ language (ISO 14882), as reference to
> ISO standards is generally preferred in ISO standards.
>
One of the areas the Unicode Standard differs from ISO 10646 is that its 
conception
of a character's identity implicitly contains that character's 
properties - and those are
standardized as well and alongside of just name and serial number.

Many of these properties have associated with them algorithms, e.g. the 
bidi algorithm,
that are an essential element of data interchange: if you don't know 
which order in
the backing store is expected by the recipient to produce a certain 
display order, you
cannot correctly prepare your data.

There is one area where standardization in ISO relates to work in 
Unicode that I can
think of, and that is sorting. However, sorting, beyond the underlying 
framework,
ultimately relates to languages, and language-specific data is now 
housed in CLDR.

Early attempts by ISO to standardize a similar framework for locale data 
failed, in
part because the framework alone isn't the interesting challenge for a 
repository,
instead it is the collection, vetting and management of the data.

The reality is that the ISO model and its organizational structures are 
not well suited
to the needs of many important area where some form of standardization 
is needed.
That's why we have organization like IETF, W3C, Unicode etc..

Duplicating all or even part of their effort inside ISO really serves 
nobody's purpose.

A./


From unicode at unicode.org  Thu May 17 11:43:35 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 May 2018 08:43:35 -0800
Subject: how to make custom combining diacritical marks for arabic letters?
In-Reply-To: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
References: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
Message-ID: <CABPY6Z2KVKbFMRGq3Z+nU0u7de87HDn4gzk2QCRe_QSJ_ZQhag@mail.gmail.com>

This page describes the essentials of OpenType Arabic font development:

https://docs.microsoft.com/en-us/typography/script-development/arabic

From unicode at unicode.org  Thu May 17 11:46:16 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 17 May 2018 09:46:16 -0700
Subject: Fwd: L2/18-181
In-Reply-To: <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
 <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
Message-ID: <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180517/4ae5e39d/attachment.html>

From unicode at unicode.org  Thu May 17 12:47:28 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 17 May 2018 10:47:28 -0700
Subject: L2/18-181
Message-ID: <20180517104728.665a7a7059d7ee80bb4d670165c8327d.5c16fa60d3.wbe@email03.godaddy.com>

Everyone,

I was not serious about this proposal being "fascinating" or in any way
a model for what should happen with the Bengali script.

Please imagine a tongue-in-cheek expression as you re-read my post.
Maybe there is an emoji that depicts this. Maybe I've just been away
from the list too long and forgot that plain text often does not
communicate dry humor effectively.

James Kass wrote:

> We should strive to keep any criticism constructive rather than
> derisive.

Fair enough. My constructive suggestion would be to press vendors to
support Assamese language tools, so that spell-checking, sorting,
transcription, and other language-dependent operations will work
properly, whether or not that was the goal of the proposal. A language
with 15 million native speakers deserves no less.

Regarding keyboards, ?? is a conjunct consisting of three code
points (U+0995, U+09CD, U+09B7) and fits comfortably on a single key
within a standard Windows layout. Indeed, the Assamese keyboards shipped
with Windows since at least 7 already have this key (E06, level 2).
Systems that limit a keystroke to one code point have problems that go
well beyond Assamese.

> If I'm not mistaken, the character naming for this script was
> inherited from the ISCII standard, so it was the Indian government's
> convention.

BIS made a mistake here in failing to distinguish languages, or
language-specific alphabets, from scripts, but it only cost them a
single attribute byte assignment in ISCII. Disunifying Assamese from
Bengali in Unicode would have a much greater impact.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Thu May 17 12:53:07 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 17 May 2018 10:53:07 -0700
Subject: L2/18-181
Message-ID: <20180517105307.665a7a7059d7ee80bb4d670165c8327d.1b0a3c7241.wbe@email03.godaddy.com>

I wrote:
 
> ?? is a conjunct consisting of three code points
 
s/??/???/

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Thu May 17 13:04:21 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 19:04:21 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180516223936.32a843d1@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
 <20180516223936.32a843d1@JRWUBU2>
Message-ID: <20180517190421.30f4041f@JRWUBU2>

On Wed, 16 May 2018 22:39:36 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:

> As an
> example of correct rendering, I include the Pali for 'O mind!', _bho
> mano_, encoded  <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN AA, SIGN E>,
> as rendered by the Lamphun font.

Sorry, wrong sequence, wrong font.  The correct sequence is <LOW PHA,
SIGN E, SIGN AA, MA, NA, SIGN E, SIGN AA>, which is rendered by the
Lamphun font as shown in the attached PNG file.

Richard.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: omind2.png
Type: image/png
Size: 3129 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180517/215cac5e/attachment.png>

From unicode at unicode.org  Thu May 17 13:12:40 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 19:12:40 +0100
Subject: how to make custom combining diacritical marks for arabic letters?
In-Reply-To: <CABPY6Z2KVKbFMRGq3Z+nU0u7de87HDn4gzk2QCRe_QSJ_ZQhag@mail.gmail.com>
References: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
 <CABPY6Z2KVKbFMRGq3Z+nU0u7de87HDn4gzk2QCRe_QSJ_ZQhag@mail.gmail.com>
Message-ID: <20180517191240.698aac75@JRWUBU2>

On Thu, 17 May 2018 08:43:35 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> This page describes the essentials of OpenType Arabic font
> development:
> 
> https://docs.microsoft.com/en-us/typography/script-development/arabic

But isn't the problem that PUA diacritics won't reach most Arabic
shapers?  I think we're back to the vexed issue of defining Unicode
properties for PUA characters to applications.

Richard.

From unicode at unicode.org  Thu May 17 13:43:00 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 17 May 2018 11:43:00 -0700
Subject: L2/18-181
Message-ID: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com>

Otto Stolz wrote:

> I wonder how English and French ever could
> be made to use a single script, let alone
> German (???), Icelandic (???), Swedish (???),
> Latvian (???), Chech (???) or ? you name it.

They do use the same script, Latin. They do not use the same alphabet.
Each language has its own language-specific alphabet.

It is the same for Bengali and Assamese, although the language-specific
subsets are called abugidas instead of alphabets.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Thu May 17 14:12:55 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 20:12:55 +0100
Subject: how to make custom combining diacritical marks for arabic letters?
In-Reply-To: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
References: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
Message-ID: <20180517201255.5da51fa5@JRWUBU2>

On Thu, 17 May 2018 09:49:55 +0300
dinar qurbanov via Unicode <unicode at unicode.org> wrote:

> how to make custom combining diacritical marks for arabic letters?
> should only font drivers and programs support it, or should also
> unicode support it, for example, have special area for them?
> 
> as i know, private use area can be used to make combining diacritical
> marks for latin script without problems.
> 
> but when i tried, several years ago, to make that for arabic script,
> with fontforge, i had to use right to left override mark, and manually
> insert beginning, middle, ending forms of arabic letters, and even
> then, my custom marks were not located very properly above letters.

I'm offering suggestions, but I don't that they will work.

The one thing that may help you is that these marks cannot appear in
plain text.  There are a number of things you need to do:

1) Persuade the renderer to treat your character as being a run in a
single script.  You might be able to do this by:

a) Not having any lookups for the Arabic script.

b) Using RLM to persuade the renderer that you have a right-to-left run.

It is just possible that his may fail with OpenType fonts but work
with Graphite or AAT fonts.  If it works, you will then have to
implement all the Arabic shaping yourself.

2) If OpenType fonts will treat the data as a single script run, you
will need to ensure that there is an OpenType substitution feature that
the renderer will support.  Fortunately, many modern text applications
will allow you to force the ccmp feature to be enabled - I have used
such feature forcing with OpenType in LibreOffice and also in HTML,
which renders accordingly in all the modern browsers I have tested - MS
Edge on Windows 10, Firefox and, on iPhones, Safari.  While the ccmp
feature is enabled for the PUA in Firefox, it is disabled in MS Edge on
Windows 10.

3) I believe AAT will soon be available for products using the HarfBuzz
layout engine, so it is likely to become available on Firefox and
LibreOffice.  If AAT looks like a solution, you may need to research the
attitudes of Chrome and OpenOffice, for I believe they have chosen not
to support Graphite.

A totally different solution would be to recompile your application so
that it believes that your diacritics are in the Arabic script.

Richard.

From unicode at unicode.org  Thu May 17 14:18:08 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 20:18:08 +0100
Subject: L2/18-181
In-Reply-To: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com>
References: <20180517114300.665a7a7059d7ee80bb4d670165c8327d.299cb7b1c0.wbe@email03.godaddy.com>
Message-ID: <20180517201808.5b0d08e3@JRWUBU2>

On Thu, 17 May 2018 11:43:00 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> It is the same for Bengali and Assamese, although the
> language-specific subsets are called abugidas instead of alphabets.

If we allow an abugida to be different to an alphasyllabary, then, in
Thailand, Pali has a low brow *alphabet* which is a subset of the Thai
*abugida*.

Richard.

From unicode at unicode.org  Thu May 17 14:23:12 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 May 2018 20:23:12 +0100
Subject: L2/18-181
In-Reply-To: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
Message-ID: <20180517202312.32a61900@JRWUBU2>

On Wed, 16 May 2018 13:46:22 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> http://www.unicode.org/L2/L2018/18181-n4947-assamese.pdf
> 
> This is a fascinating proposal to disunify the Assamese script from
> Bengali on the following bases:

According to the proposal, the encoding for the Assamese writing system
*must* be in the BMP.  As it needs over a 100 characters, the only way
to satisfy the need to be in the BMP is for it to share Bengali
characters.  Hey, that solution is already implemented!

Richard.

From unicode at unicode.org  Thu May 17 17:26:15 2018
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Thu, 17 May 2018 22:26:15 +0000
Subject: The Unicode Standard and ISO
In-Reply-To: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>
References: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>
Message-ID: <DM5PR2101MB09823A62269ED1E3B4465EF4D5910@DM5PR2101MB0982.namprd21.prod.outlook.com>

ISO character encoding standards are primarily focused on identifying a repertoire of character elements and their code point assignments in some encoding form. ISO developed other, legacy character-encoding standards in the past, but has not done so for over 20 years. All of those legacy standards can be mapped as a bijection to ISO 10646; in regard to character repertoires, they are all proper subsets of ISO 10646. 

Hence, from an ISO perspective, ISO 10646 is the only standard for which on-going synchronization with Unicode is needed or relevant.


Peter

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Martinho Fernandes via Unicode
Sent: Thursday, May 17, 2018 8:08 AM
To: unicode at unicode.org
Subject: The Unicode Standard and ISO

Hello,

There are several mentions of synchronization with related standards in unicode.org, e.g. in https://www.unicode.org/versions/index.html, and https://www.unicode.org/faq/unicode_iso.html. However, all such mentions never mention anything other than ISO 10646.

I was wondering which ISO standards other than ISO 10646 specify the same things as the Unicode Standard, and of those, which ones are actively kept in sync. This would be of importance for standardization of Unicode facilities in the C++ language (ISO 14882), as reference to ISO standards is generally preferred in ISO standards.

--
Martinho


From unicode at unicode.org  Thu May 17 18:29:36 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Fri, 18 May 2018 00:29:36 +0100
Subject: The Unicode Standard and ISO
In-Reply-To: <DM5PR2101MB09823A62269ED1E3B4465EF4D5910@DM5PR2101MB0982.namprd21.prod.outlook.com>
References: <da087414-2640-1b69-a6f1-93b520dd993c@rmf.io>
 <DM5PR2101MB09823A62269ED1E3B4465EF4D5910@DM5PR2101MB0982.namprd21.prod.outlook.com>
Message-ID: <FBCCB5B0-D483-4485-B256-20D8D48F6B0F@evertype.com>

It would be great if mutual synchronization were considered to be of benefit. Some of us in SC2 are not happy that the Unicode Consortium has published characters which are still under Technical ballot. And this did not happen only once.

> On 17 May 2018, at 23:26, Peter Constable via Unicode <unicode at unicode.org> wrote:
> 
> Hence, from an ISO perspective, ISO 10646 is the only standard for which on-going synchronization with Unicode is needed or relevant.


From unicode at unicode.org  Thu May 17 21:59:05 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 May 2018 18:59:05 -0800
Subject: Fwd: L2/18-181
In-Reply-To: <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com>
References: <20180516134622.665a7a7059d7ee80bb4d670165c8327d.3430066eef.wbe@email03.godaddy.com>
 <68766D80-8411-4FDF-8323-DC6C76116642@umich.edu>
 <CAOcx4CRsfudPAK=eGomFA1fQOyeT2mUXMv6oS_aRdu-RDW-6kw@mail.gmail.com>
 <0fc8e094-ea1a-4aee-fc84-9ac63f7d7d0e@ix.netcom.com>
Message-ID: <CABPY6Z2xqkrazW6A=Jt4zCwxztk7ANS7iUN68qOteivkKktrUw@mail.gmail.com>

On Thu, May 17, 2018 at 8:46 AM, Asmus Freytag via Unicode
<unicode at unicode.org> wrote:
> On 5/16/2018 3:41 PM, Anshuman Pandey via Unicode wrote:
>
> If folks are interested in a valid proposal for disunification of
> Bengali, please look at the proposal for Tirhuta.
>
> Location?

https://www.unicode.org/L2/L2011/11175r-tirhuta.pdf

I think that's the one, and Tirhuta is now in Unicode.

From unicode at unicode.org  Fri May 18 00:50:38 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 May 2018 21:50:38 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180517190421.30f4041f@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
 <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2>
Message-ID: <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>

Richard Wordingham wrote,

? Your example appears to be using the font called 'A Tai Tham KH New'.

Exactly.  The black boxes in the display were becoming tiresome.  The
font package is available from this Tai Tham web page:
http://www.kengtung.org/download-font/

(I'd downloaded a copy of "lamphun.otf", but the installer failed, so
I had to go a-hunting.)

Is it correct to say that the average daily Tai Tham use is already
being more-or-less served by the current state of the fonts and the
USE?  And that many of the problems you are reporting with respect to
things such as mark-to-mark positioning are happening with more exotic
uses of the script, such as the input and display of Pali texts using
the Tai Tham script?

? And how am I supposed to position MAI SAM to the right of the
? rightmost of the level 1 marks above?

Beats me, it's not happening here.  If the GPOS look-up is for (e.g.)
TONE-1 plus MAI SAM, and the string is being re-ordered by the system
to MAI SAM plus TONE-1 before being submitted to the font, then *that*
look-up won't happen.  In which case, change the look-up to accomodate
the re-ordered string.  I suppose you've already tried that.

? The correct sequence is <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN E,
? SIGN AA>, which is rendered by the Lamphun font as shown in the
? attached PNG file.

???????
To confirm, the NAA ligature isn't happening with the 'A Tai Tham KH New' font.

Changing the entry order to:
???????
<LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN AA, SIGN E>
... forms the NAA ligature and the vowel re-ordering matches the
Lamphun graphic you sent.  But that kludge probably breaks the
preferred encoding model/order.


From unicode at unicode.org  Fri May 18 02:38:27 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 May 2018 23:38:27 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
 <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2>
 <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>
Message-ID: <CABPY6Z1nADL8fNYoE1xww5+CcLeo4Zs6GGhsej278tefZmrA9A@mail.gmail.com>

I wrote,

> Changing the entry order to:
> ???????
> <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN AA, SIGN E>
> ... forms the NAA ligature and the vowel re-ordering matches the
> Lamphun graphic you sent.  But that kludge probably breaks the
> preferred encoding model/order.

On the other hand, do the script users normally input the NAA ligature
sequence first and then add any additional signs or marks?  If the
users consider NAA to be a distinct "letter", then that might explain
why a font developed by a user accomodates the ligation for the string
"NA" + "AA" only when nothing else appears between them.  If, for
example, there's a popular input method or keyboard driver which puts
"NAA" on its own key, then the users will be producing data which is
"NA" plus "AA" plus anything else.


From unicode at unicode.org  Fri May 18 02:57:18 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 18 May 2018 08:57:18 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
 <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2>
 <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>
Message-ID: <20180518085718.35402f71@JRWUBU2>

On Thu, 17 May 2018 21:50:38 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote,
> 
> ? Your example appears to be using the font called 'A Tai Tham KH
> New'.
> 
> Exactly.  The black boxes in the display were becoming tiresome.  The
> font package is available from this Tai Tham web page:
> http://www.kengtung.org/download-font/
> 
> (I'd downloaded a copy of "lamphun.otf", but the installer failed, so
> I had to go a-hunting.)

That threatens a long struggle.  The WOFF files work on MS Edge on
Windows 10.  Lamphun (and Da Lekh) depends on the rendering engine for
Indic reordering; more precisely, it relies on dotted circles to know
when reordering has failed.  I don't think it can work possibly via
Uniscribe and DirectWrite.
 
> Is it correct to say that the average daily Tai Tham use is already
> being more-or-less served by the current state of the fonts and the
> USE?

Many fonts depend on bypassing the USE.  I also have a strong suspicion
that they depend on HarfBuzz, though I'll have to recheck what is
happening on iPhones.  I'm only set up there to check what happens with
Safari.

> And that many of the problems you are reporting with respect to
> things such as mark-to-mark positioning are happening with more exotic
> uses of the script, such as the input and display of Pali texts using
> the Tai Tham script?

Since changes to Indic Syllabic category for Unicode 10 unbanned
talk about nirvana (?????? <NA, SIGN I, LOW PA, SIGN LOW PA OR HIGH
RATHA, SIGN AA, NA>, or with TALL AA instead; the vernaculars usually
inserts SAKOT before the second NA) and Tai Khuen (and Tau Lue?) monks'
names -in -dhammo (-?????), the USE should have supported
uncontracted Pali. Pali is simple, though inter-Indic is complicated by
subscript forms not encoded with SAKOT. (There may be a similar
complication with the Myanmar script.)

The complications primarily come with writing the vernacular.
 
> ? And how am I supposed to position MAI SAM to the right of the
> ? rightmost of the level 1 marks above?

> Beats me, it's not happening here.  If the GPOS look-up is for (e.g.)
> TONE-1 plus MAI SAM, and the string is being re-ordered by the system
> to MAI SAM plus TONE-1 before being submitted to the font, then *that*
> look-up won't happen.  In which case, change the look-up to accomodate
> the re-ordered string.  I suppose you've already tried that.

What makes you think the USE tries to address such matters?  If the
developers had made the time to find out about such details (I think
their money tree must have died), we wouldn't have a problem with CVC
askharas.  Also, the USE prohibits spelling where this rearrangement is
desirable.  Hariphunchai and therefore Lamphun addresses the
positioning by having a separate position for MAI SAM, but that doesn't
work well when there is a top vowel in the syllable.

Now, the use of MAI SAM to indicate elision, as opposed to duplication
at the word or syllable level, is somewhat 'exotic'; many writers don't
do it.  It's the use to indicate elision, written in accordance with
the accepted proposal, that the USE prohibits.

> ? The correct sequence is <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN E,
> ? SIGN AA>, which is rendered by the Lamphun font as shown in the
> ? attached PNG file.
> 
> ???????
> To confirm, the NAA ligature isn't happening with the 'A Tai Tham KH
> New' font.
> 
> Changing the entry order to:
> ???????
> <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN AA, SIGN E>
> ... forms the NAA ligature and the vowel re-ordering matches the
> Lamphun graphic you sent.  But that kludge probably breaks the
> preferred encoding model/order.

Exactly.

Richard.


From unicode at unicode.org  Fri May 18 17:06:18 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 18 May 2018 23:06:18 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z1nADL8fNYoE1xww5+CcLeo4Zs6GGhsej278tefZmrA9A@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <CABPY6Z1y2ofnAniGR1tmFzOkwio=s=BLg77MBzSJroU-yeYi+g@mail.gmail.com>
 <20180515235135.7df264c2@JRWUBU2>
 <CABPY6Z0a8T65-V5pZv=ao+b7MyHEg9=mdq1zrtCqzKWdntpjNg@mail.gmail.com>
 <20180516223936.32a843d1@JRWUBU2> <20180517190421.30f4041f@JRWUBU2>
 <CABPY6Z1VPRE5ncyVHgTk+nAOON3rimXHNANwRgFAxCMWS691rA@mail.gmail.com>
 <CABPY6Z1nADL8fNYoE1xww5+CcLeo4Zs6GGhsej278tefZmrA9A@mail.gmail.com>
Message-ID: <20180518230618.4ba5a033@JRWUBU2>

On Thu, 17 May 2018 23:38:27 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> I wrote,
> 
> > Changing the entry order to:
> > ???????
> > <LOW PHA, SIGN E, SIGN AA, MA, NA, SIGN AA, SIGN E>
> > ... forms the NAA ligature and the vowel re-ordering matches the
> > Lamphun graphic you sent.  But that kludge probably breaks the
> > preferred encoding model/order.  
> 
> On the other hand, do the script users normally input the NAA ligature
> sequence first and then add any additional signs or marks?  If the
> users consider NAA to be a distinct "letter", then that might explain
> why a font developed by a user accomodates the ligation for the string
> "NA" + "AA" only when nothing else appears between them.  If, for
> example, there's a popular input method or keyboard driver which puts
> "NAA" on its own key, then the users will be producing data which is
> "NA" plus "AA" plus anything else.

There was a keyboard map in the zip file that you may have got the font
from,
http://www.kengtung.org/font-download/Tai-Tham-Unicode-for-PC.zip .  It
has three key symbols per key - plan, shift and capslock.  All the
combinations correspond to a single character.

There's also a zip file for a non-Unicode font,
http://www.kengtung.org/font-download/Tai-Tham-Non-Unicode-for-PC.zip
and that has a corresponding keyboard.  Now, while I haven't looked at
the font, it looks like a direct key to glyph mapping, and as I would
have expected from the pre-Unicode Wat Inn hack encoding, the English
key stroke for 'o' (the key stroke for THAI CHARACTER NO NU) yields NA
and the key stroke for 'O' yields the NAA ligature.  I may be wrong
about the relationship - the top vowel + tone ligatures seem to be
missing from the keyboard.

So, the evidence is ambiguous.

The dictionaries I have seen do not treat NAA as an indivisible
character - NAA plus subscript is treated differently depending on
whether the subscript phonetically precedes or follows the subscript
consonant.  However, the rule that homorganic subscript precedes and
others follow the vowel works pretty well.

Now, the chanting of Pali declensions, if related to writing, should
bring home via the participles in -nt- that there is a close
relationship between <NA, subscript HIGH TA> and <NAA, subscript HIGH
TA>.  It would be interesting to see how often ligation fails in
participles.

However, I think there is a different explanation for the sequence.
There are suggestions around that aksharas should be encoded with left
matras in second place.  This makes it easier for fonts.  I think we're
seeing an encoding based on ease of font design.  Now, one doesn't need
this.  If feature ss02 is enabled, the fonts of my Da Lekh family will
convert a transliteration of Tai Tham letters, numbers and marks to
ASCII back to the original Tai Tham text.  All I need is a feature
activation, which ASCII is normally has the privilege of receiving.  I
believe I could do it all by ccmp, but this feature is a fall back for
when the renderer does not support Tai Tham.

At present, Tai Tham seems to be in grave danger of breaking up into a
number of font encodings - one chooses the rendering system, and that
determines the allowed sequences, even for fairly simple words.  The
Xishuangbanna News appears to be using a visual order encoding.  I
suspect this works because syllables are separated by spaces, so they
don't have to worry about Indic rearrangement being applied despite the
lack of lookups for OTL script "lana".

Richard.


From unicode at unicode.org  Fri May 18 22:00:20 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 19 May 2018 04:00:20 +0100
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
Message-ID: <20180519040020.5f223dd8@JRWUBU2>

On Tue, 15 May 2018 04:19:42 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:

> > I've seen an implementation of the USE render
> > canonically equivalent strings differently.  ...  

> Because the USE failed or because the font provided look-ups for each
> of those strings to different glyphs?

Unless I haven't picked up a recent change, neither Microsoft (by
evidence of MS Edge) nor Apple (by evidence of Safari in iOS 10.3.2)
normalises Tai Tham text. <Tone, SAKOT> gets just one dotted circle,
while Apple and Microsoft award a dotted circle to each mark in the
canonically equivalent <SAKOT, tone>.  Not many fonts handle two dotted
circles - subscript formation has to work in the context <DOTTED
CIRCLE, SAKOT, DOTTED CIRCLE, tone, base>.  There's also the formal
problem that <DOTTED CIRCLE, SAKOT, DOTTED CIRCLE, tone> is actually a
legitimate sequence in the backing store. The defence to a charge of
violating the character identity of DOTTED CIRCLE would be to say that
such sequences are not supported - a renderer is not required to
support all strings!

Incidentally, I've fixed the Lamphun font; it will now install in
Windows 10.  TTX found ways to reduce its size by 10%.  While it should
work for most text, there are a few sequences that aren't handled
properly. These are issues that pertain to the font domain, not the
domain of the rendering engine.

Richard.

From unicode at unicode.org  Sat May 19 05:22:45 2018
From: unicode at unicode.org (dinar qurbanov via Unicode)
Date: Sat, 19 May 2018 13:22:45 +0300
Subject: how to make custom combining diacritical marks for arabic letters?
In-Reply-To: <20180517201255.5da51fa5@JRWUBU2>
References: <CAPzvyQiaefECoWSQLS8ty9wr1g_KvzzmFeFDsPDTZfvwK0AQ+A@mail.gmail.com>
 <20180517201255.5da51fa5@JRWUBU2>
Message-ID: <CAPzvyQgMGuhUfuwKM5VS3=1585=uFiyA0CjTyD1WHYK5h2HRuw@mail.gmail.com>

this is a test i made that time: http://tmf.org.ru/arabic.html . look
at second line. my custom mark is located too left on the most left
"B", and is located too right on the middle (that is of middle form of
B) and on the most righ "B" (that is of starter form of B). it should
be located right above the below dot.

- this was the problem that i could not solve.

also there are problems that i could solve by using 1) rtl override
mark; 2) and using start, middle, end, separate B characters instead
of using simple arabic B, that would be easier. (you can see in the
example that that characters are used). (using different forms of
letter can also be achieved by using php or javascript, etc).


2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode <unicode at unicode.org>:
> On Thu, 17 May 2018 09:49:55 +0300
> dinar qurbanov via Unicode <unicode at unicode.org> wrote:
>
>> how to make custom combining diacritical marks for arabic letters?
>> should only font drivers and programs support it, or should also
>> unicode support it, for example, have special area for them?
>>
>> as i know, private use area can be used to make combining diacritical
>> marks for latin script without problems.
>>
>> but when i tried, several years ago, to make that for arabic script,
>> with fontforge, i had to use right to left override mark, and manually
>> insert beginning, middle, ending forms of arabic letters, and even
>> then, my custom marks were not located very properly above letters.
>
> I'm offering suggestions, but I don't that they will work.
>
> The one thing that may help you is that these marks cannot appear in
> plain text.  There are a number of things you need to do:
>
> 1) Persuade the renderer to treat your character as being a run in a
> single script.  You might be able to do this by:
>
> a) Not having any lookups for the Arabic script.
>
> b) Using RLM to persuade the renderer that you have a right-to-left run.
>
> It is just possible that his may fail with OpenType fonts but work
> with Graphite or AAT fonts.  If it works, you will then have to
> implement all the Arabic shaping yourself.
>
> 2) If OpenType fonts will treat the data as a single script run, you
> will need to ensure that there is an OpenType substitution feature that
> the renderer will support.  Fortunately, many modern text applications
> will allow you to force the ccmp feature to be enabled - I have used
> such feature forcing with OpenType in LibreOffice and also in HTML,
> which renders accordingly in all the modern browsers I have tested - MS
> Edge on Windows 10, Firefox and, on iPhones, Safari.  While the ccmp
> feature is enabled for the PUA in Firefox, it is disabled in MS Edge on
> Windows 10.
>
> 3) I believe AAT will soon be available for products using the HarfBuzz
> layout engine, so it is likely to become available on Firefox and
> LibreOffice.  If AAT looks like a solution, you may need to research the
> attitudes of Chrome and OpenOffice, for I believe they have chosen not
> to support Graphite.
>
> A totally different solution would be to recompile your application so
> that it believes that your diacritics are in the Arabic script.
>
> Richard.

From unicode at unicode.org  Sat May 19 22:33:20 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 19 May 2018 19:33:20 -0800
Subject: Choosing the Set of Renderable Strings
In-Reply-To: <20180519040020.5f223dd8@JRWUBU2>
References: <25466389.4557.1526283733905.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <1805937.5572.1526284505886.JavaMail.defaultUser@defaultHost>
 <CABPY6Z2M20iHGN1YX-=88yYG00FPirZW_C55yno7dkVYsaFF3g@mail.gmail.com>
 <20180514203115.5c093920@JRWUBU2>
 <CABPY6Z2QLQd9dewsNuOpEpYUqSWrSKdT-fbESkE402Ux=qcdNw@mail.gmail.com>
 <20180519040020.5f223dd8@JRWUBU2>
Message-ID: <CABPY6Z3_E2p8UYwpO++4suWuxMJNEPA40prC-6DhM4F_vi8cCw@mail.gmail.com>

Richard Wordingham wrote,

> Incidentally, I've fixed the Lamphun font; it will now install in Windows 10.

Confirming successful installation on Windows 7.

From unicode at unicode.org  Tue May 22 05:51:33 2018
From: unicode at unicode.org (Martinho Fernandes via Unicode)
Date: Tue, 22 May 2018 12:51:33 +0200
Subject: Extended grapheme cluster stability
Message-ID: <db1f58b8-210c-75ca-b5f9-a4e3e2204756@rmf.io>

Hello,

None of the *_Break properties are stable, as far as I can see in
https://www.unicode.org/policies/stability_policy.html. If I understand
correctly, this means that, at least in theory, it is possible that in
Unicode version X a sequence of characters AB forms an extended grapheme
cluster, i.e. A ? B in the notation used in the algorithm description
and in the test data, but then in Unicode version X+1, that changes to A
? B.

Am I reading this correctly or is this not possible? Or is it possible
in theory but not in practice? Or maybe it has happened before?

-- 
Martinho


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://unicode.org/pipermail/unicode/attachments/20180522/cb6d3b7f/attachment.asc>

From unicode at unicode.org  Tue May 22 07:43:23 2018
From: unicode at unicode.org (Martinho Fernandes via Unicode)
Date: Tue, 22 May 2018 14:43:23 +0200
Subject: Extended grapheme cluster stability
In-Reply-To: <db1f58b8-210c-75ca-b5f9-a4e3e2204756@rmf.io>
References: <db1f58b8-210c-75ca-b5f9-a4e3e2204756@rmf.io>
Message-ID: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io>

On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:

> Hello,
>
> None of the *_Break properties are stable, as far as I can see in
> https://www.unicode.org/policies/stability_policy.html. If I understand
> correctly, this means that, at least in theory, it is possible that in
> Unicode version X a sequence of characters AB forms an extended grapheme
> cluster, i.e. A ? B in the notation used in the algorithm description
> and in the test data, but then in Unicode version X+1, that changes to A
> ? B.
>
> Am I reading this correctly or is this not possible? Or is it possible
> in theory but not in practice? Or maybe it has happened before?
>
Hmm, to answer my own question, yes, this has happened before. In
Unicode 8 there were no breaks between regional indicators. In Unicode 9
now there are no breaks "between regional indicator (RI) symbols if
there is an odd number of RI characters before the break point". I has
also happened in the direction break=>no break, with when emoji ZWJ
sequences were introduced.

-- 
Martinho


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://unicode.org/pipermail/unicode/attachments/20180522/c44e1b86/attachment.asc>

From unicode at unicode.org  Tue May 22 13:27:17 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 22 May 2018 19:27:17 +0100
Subject: Extended grapheme cluster stability
In-Reply-To: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io>
References: <db1f58b8-210c-75ca-b5f9-a4e3e2204756@rmf.io>
 <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io>
Message-ID: <20180522192717.52a29289@JRWUBU2>

On Tue, 22 May 2018 14:43:23 +0200
Martinho Fernandes via Unicode <unicode at unicode.org> wrote:

> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:
> 
> > Hello,
> >
> > None of the *_Break properties are stable, as far as I can see in
> > https://www.unicode.org/policies/stability_policy.html. If I
> > understand correctly, this means that, at least in theory, it is
> > possible that in Unicode version X a sequence of characters AB
> > forms an extended grapheme cluster, i.e. A ? B in the notation used
> > in the algorithm description and in the test data, but then in
> > Unicode version X+1, that changes to A ? B.
> >
> > Am I reading this correctly or is this not possible? Or is it
> > possible in theory but not in practice? Or maybe it has happened
> > before? 
> Hmm, to answer my own question, yes, this has happened before. In
> Unicode 8 there were no breaks between regional indicators. In
> Unicode 9 now there are no breaks "between regional indicator (RI)
> symbols if there is an odd number of RI characters before the break
> point". I has also happened in the direction break=>no break, with
> when emoji ZWJ sequences were introduced.

These are more refinements of the algorithm than fundamental changes.
However, many of the breaks are inherently uncertain and may therefore
be tailored.

English has uncertainties as to word boundaries, but the author's
decision is represented in writing, e.g. 'beam width' v. 'beamwidth'.
In writing systems without visible boundaries between words, such as
Thai, such vacillation could occur between software versions rather
than between version of Unicode.

Line break opportunities can in practice vacillate in such writing
systems, e.g. between breaks at syllable boundaries and breaks at word
boundaries.

Formal extended grapheme cluster boundaries have varied in normal,
well established text.  In Thai, left matras and consonants were
briefly part of the same grapheme cluster.  When that formal property
was implemented in editors, there were howls of pain from Thailand,
and the change was promptly reversed.

I do not believe one rules suits all Indic consonant clusters.  While
splitting

X virama | Y

makes sense for Devanagari with its half-forms, 

X | coeng Y

makes no sense for scripts where it is the second consonant that
changes shape.  It makes even less sense when some combinations of
'coeng Y' are encoded separately, as in mainland SE Asia.  These
combinations are categorised as marks.

In Burma, the syllable boundary comes after U+1A58 TAI THAM SIGN MAI
KANG LAI. In Laos, it comes before it.

We came very close to extended grapheme clusters being extended to
whole aksharas in Unicode 11.0.  My view is that Unicode has
attempted to conflate several concepts in grapheme cluster, and it
just doesn't work.   

Richard.


From unicode at unicode.org  Tue May 22 16:48:56 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 22 May 2018 14:48:56 -0700
Subject: Extended grapheme cluster stability
In-Reply-To: <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io>
References: <db1f58b8-210c-75ca-b5f9-a4e3e2204756@rmf.io>
 <8ff7f43b-f7dd-35ec-9da4-c7d770e18414@rmf.io>
Message-ID: <b99572f1-f2a8-26ad-cef4-78e69824a895@ix.netcom.com>

On thing to bear in mind about breaks: Unicode is plain-text and not 
"final rendered text".

Many types of breaks depend on things like actual font selection, column 
width and other factors determined by styling. They are therefore not 
necessarily stable from a plain text perspective (the same goes for 
things not specified by Unicode, like hyphenation, because hyphenation, 
for example, depends on the actual language associated with a text, 
something not part of the plain text back-bone).

The moral is that if you need a frozen representation of text that does 
not behave differently if accessed, iterated, viewed etc. at different 
times, you need to have some kind of rich-text format that can represent 
all segmentation choices. If, on the other hand, you are doing a live 
interaction with the text, then Unicode segmentation gives you the "best 
available" algorithm - which may change over time as new information 
becomes available about what constitutes best practice.

For many writing systems, the understanding of best practice is still 
quite limited at this point - in the sense that even if it is known, it 
is not widely available and therefore there has not yet been a chance to 
validate and standardize it. (Setting aside areas of actual innovation, 
like emoji). For these reasons, it would be outright detrimental if any 
of these algorithms are "frozen" -- however, the hope is that updates 
are handled with some sensitivity to avoid unnecessary disruption of 
settled practice.

A./


On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote:
> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:
>
>> Hello,
>>
>> None of the *_Break properties are stable, as far as I can see in
>> https://www.unicode.org/policies/stability_policy.html. If I understand
>> correctly, this means that, at least in theory, it is possible that in
>> Unicode version X a sequence of characters AB forms an extended grapheme
>> cluster, i.e. A ? B in the notation used in the algorithm description
>> and in the test data, but then in Unicode version X+1, that changes to A
>> ? B.
>>
>> Am I reading this correctly or is this not possible? Or is it possible
>> in theory but not in practice? Or maybe it has happened before?
>>
> Hmm, to answer my own question, yes, this has happened before. In
> Unicode 8 there were no breaks between regional indicators. In Unicode 9
> now there are no breaks "between regional indicator (RI) symbols if
> there is an odd number of RI characters before the break point". I has
> also happened in the direction break=>no break, with when emoji ZWJ
> sequences were introduced.
>


From unicode at unicode.org  Wed May 23 10:53:35 2018
From: unicode at unicode.org (Abe Voelker via Unicode)
Date: Wed, 23 May 2018 10:53:35 -0500
Subject: =?UTF-8?Q?Major_vendors_changing_U=2B1F52B_PISTOL_=F0=9F=94=AB_depiction?=
 =?UTF-8?Q?_from_firearm_to_squirt_gun?=
Message-ID: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>

Hello,

I'm curious if there has been any discussion on all the major vendors
changing this emoji's depiction? (
https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/)

As a user I find it troublesome because previous messages I've sent using
this character on these platforms may now be interpreted differently due to
the changed representation. That aspect has me wondering if this change is
in line with Unicode standard conformance requirements.

Regards,
Abe Voelker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180523/d05801ba/attachment.html>

From unicode at unicode.org  Wed May 23 12:08:31 2018
From: unicode at unicode.org (via Unicode)
Date: Wed, 23 May 2018 20:08:31 +0300
Subject: =?UTF-8?Q?VS:_Major_vendors_changing_U+1F5?=
 =?UTF-8?Q?2B_PISTOL_=F0=9F=94=AB_depiction_from_fire?=
 =?UTF-8?Q?arm_to_squirt_gun?=
In-Reply-To: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
Message-ID: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>

I?d treat these as glyph changes within fonts.

 
Sincerely

 
Erkki I. Kolehmainen

 
L?hett?j?: Unicode <unicode-bounces at unicode.org> Puolesta Abe Voelker via Unicode
L?hetetty: keskiviikko 23. toukokuuta 2018 18.54
Vastaanottaja: unicode at unicode.org
Aihe: Major vendors changing U+1F52B PISTOL ?? depiction from firearm to squirt gun

 
Hello,

 
I'm curious if there has been any discussion on all the major vendors changing this emoji's depiction? (https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/)

 
As a user I find it troublesome because previous messages I've sent using this character on these platforms may now be interpreted differently due to the changed representation. That aspect has me wondering if this change is in line with Unicode standard conformance requirements.

 
Regards,

Abe Voelker

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180523/1cef0b45/attachment.html>

From unicode at unicode.org  Wed May 23 12:49:31 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 23 May 2018 18:49:31 +0100
Subject: Major vendors changing U+1F52B PISTOL =?UTF-8?B?8J+Uqw==?=
 depiction from firearm to squirt gun
In-Reply-To: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
 <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>
Message-ID: <20180523184931.2c5840f7@JRWUBU2>

On Wed, 23 May 2018 20:08:31 +0300
via Unicode <unicode at unicode.org> wrote:

> I?d treat these as glyph changes within fonts.

I'd treat them as gross violations of character identity.

Richard.


From unicode at unicode.org  Wed May 23 12:59:02 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Wed, 23 May 2018 10:59:02 -0700
Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?=
 =?UTF-8?Q?piction_from_firearm_to_squirt_gun?=
In-Reply-To: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
Message-ID: <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net>


On 5/23/2018 8:53 AM, Abe Voelker via Unicode wrote:
> As a user I find it troublesome because previous messages I've sent 
> using this character on these platforms may now be interpreted 
> differently due to the changed representation. That aspect has me 
> wondering if this change is in line with Unicode standard conformance 
> requirements.
>

The Unicode Standard publishes only *text presentation* (black and 
white) representative glyphs for emoji characters. And those text 
presentation glyphs have been quite stable in the standard. For U+1F52B 
PISTOL, the glyph currently published in Unicode 10.0 (and the one which 
will be published imminently in Unicode 11.0) is precisely the same as 
the glyph that was initially published nearly 8 years ago in Unicode 
6.0. Care to check up on that?

Unicode 6.0: https://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf

Unicode 11.0: https://www.unicode.org/charts/PDF/Unicode-11.0/U110-1F300.pdf

What vendors do for their colorful *emoji presentation* glyphs is 
basically outside the scope of the Unicode Standard. Technically, it is 
outside the scope even of the separate Unicode Technical Standard #51, 
Unicode Emoji, which specifies data, behavior, and other mechanisms for 
promoting interoperability and valid interchange of emoji characters and 
emoji sequences, but which does *not* try to constrain vendors in their 
emoji glyph designs.

Now, sure, nobody wants their emoji for an avocado, to willy-nilly turn 
into a completely unrelated emoji for a crying face. But many emoji are 
deliberately vague in their scope of denotation and connotation, and the 
vendors have a lot a leeway to design little images that they like and 
their customers like. And the Unicode Standard does not now and probably 
never will try to define and enforce precise semantics and usage rules 
for every single emoji character.

Basically, it is a fool's game to be using emoji as if they were a 
well-defined and standardized pictographic orthography with unchanging 
semantics. If you want stable presentation of content, use a pdf 
document or an image. If you want stable and accurate conveyance of 
particular meaning -- well, write it out in the standard orthography of 
a particular language. If you want playful and emotional little 
pictographs accompanying text, well, then don't expect either stability 
of the images or the meaning, because that isn't how emoji work. Case in 
point: if you are using U+1F351 PEACH for its well-known resemblance to 
a bum, well, don't complain to the Unicode Consortium if a phone vendor 
changes the meaning of your message by redesigning its emoji glyph for 
U+1F351 to a cut peach slice that more resembles a smile.

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180523/fbc55ea5/attachment.html>

From unicode at unicode.org  Wed May 23 13:00:33 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Wed, 23 May 2018 19:00:33 +0100
Subject: =?utf-8?Q?Re=3A_Major_vendors_changing_U+1F52B_PISTOL_?=
 =?utf-8?Q?=F0=9F=94=AB_depiction_from_firearm_to_squirt_gun?=
In-Reply-To: <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
 <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>
Message-ID: <3F9180F7-03CA-4636-9816-742589F63720@evertype.com>

I consider it a significant semantic shift from the intended meaning of the character in the source Japanese character set. 

Michael Everson

From unicode at unicode.org  Wed May 23 14:55:17 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 23 May 2018 20:55:17 +0100
Subject: Major vendors changing U+1F52B PISTOL =?UTF-8?B?8J+Uqw==?=
 depiction from firearm to squirt gun
In-Reply-To: <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
 <0503725b-afc9-73dd-e62e-fe2d3740f7c6@att.net>
Message-ID: <20180523205517.1ad64f0a@JRWUBU2>

On Wed, 23 May 2018 10:59:02 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> If you want stable and accurate
> conveyance of particular meaning -- well, write it out in the
> standard orthography of a particular language.

Preferably not of a living language, though even the semantics of a dead
language can wobble.

Richard.

From unicode at unicode.org  Wed May 23 20:18:10 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Thu, 24 May 2018 10:18:10 +0900
Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?=
 =?UTF-8?Q?piction_from_firearm_to_squirt_gun?=
In-Reply-To: <3F9180F7-03CA-4636-9816-742589F63720@evertype.com>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
 <002301d3f2b8$ae413690$0ac3a3b0$@iki.fi>
 <3F9180F7-03CA-4636-9816-742589F63720@evertype.com>
Message-ID: <90d61d43-db51-89dc-82d8-d2b6de8b2dba@it.aoyama.ac.jp>

On 2018/05/24 03:00, Michael Everson via Unicode wrote:
> I consider it a significant semantic shift from the intended meaning of the character in the source Japanese character set.

Yes and no. I'd consider the semantic shift from a real pistol in a 
Japanese message to a real pistol in a message in the US quite significant.

The former, except for some extremely small and marginal segment of 
Japanese society, essentially has no "I might shoot you" implications at 
all. In the later case, that may be quite a bit different.

I'm not saying the (glyph or whatever you call it) change was okay. But 
when talking about semantics, it's important to not only consider 
surface semantics, but also the overall context.

Regards,   Martin.

From unicode at unicode.org  Thu May 24 15:28:51 2018
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Thu, 24 May 2018 22:28:51 +0200 (CEST)
Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTO?=
 =?UTF-8?Q?L_=F0=9F=94=AB_depiction_from_firearm_to_squirt_gun?=
In-Reply-To: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
References: <CAGN6hM5Usod=mo8zPVkkUPMR8Od6FZHDip01PAFfAc3JSKQ+bQ@mail.gmail.com>
Message-ID: <1638057770.543374.1527193731416@ox.hosteurope.de>

Abe Voelker:
> 
> I'm curious if there has been any discussion on all the major vendors
> changing this emoji's depiction? (
> https://blog.emojipedia.org/all-major-vendors-commit-to-gun-redesign/)

Curiously, this happened right before UTC 155 in a possibly concerted (but at least not independent) manner by Twitter and Google at least. My comments on PRI 356 (UTS 51.11) from 17 April already seem outdated. 

<https://www.unicode.org/review/pri356/feedback.html>

>From the single-line feedback I've received, it seems the issue has not been discussed at the meeting in late April. (I've yet to review the minutes.) I'm suggesting ZWJ sequences to distinguish between a firearm  (????) and a toy (????) for PISTOL. This does not solve the valid compatibility concerns.

> As a user I find it troublesome because previous messages I've sent using
> this character on these platforms may now be interpreted differently due to
> the changed representation.

We must discourage the perception that emojis are only used in volatile text messages (often in walled-garden systems) and tweets. They also appear in texts that are meant to be read in the future as well.


From unicode at unicode.org  Sat May 26 16:58:54 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 26 May 2018 23:58:54 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
Message-ID: <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>

Even flat notes or rythmic and pause symbols in Western musical notations
have different contextual meaning depending on musical keys at start of
scores, and other notations or symbols added above the score. So their
interpretation are also variable according to context, just like tuning in
a Arabic musical score, which is also keyed and annotated differently.
These keys can also change within the same partition score.
So both the E12 vs. E24 systems (which are not incompatible) may also be
used in Western and Arabic music notations. The score keys will give the
interpretation.
Tone marks taken isolately mean absolutely nothing in both systems outside
the keyed scores in which they are inserted, except that they are just
glyphs, which may be used to mean something else (e.g. a note in a comics
artwork could be used to denote someone whistling, without actually
encoding any specific tone, or rythmic).


2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode <unicode at unicode.org>:

>
>
> > On 17 May 2018, at 16:47, Garth Wallace via Unicode <unicode at unicode.org>
> wrote:
> >
> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com> wrote:
> >
> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode <
> unicode at unicode.org> wrote:
> > >
> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <
> unicode at unicode.org> wrote:
> > >>
> > >> It would be best to encode the SMuFL symbols, which is rather
> comprehensive and include those:
> > >>  https://www.smufl what should be unified.org
> > >>  http://www.smufl.org/version/latest/
> > >> ...
> > >>
> > >> These are otherwise originally the same, but has since drifted. So
> whether to unify them or having them separate might be best to see what
> SMuFL does, as they are experts on the issue.
> > >>
> > > SMuFL's standards on unification are not the same as Unicode's. For
> one thing, they re-encode Latin letters and Arabic digits multiple times
> for various different uses (such as numbers used in tuplets and those used
> in time signatures).
> >
> > The reason is probably because it is intended for use with music
> engraving, and they should then be rendered differently.
> >
> > Exactly. But Unicode would consider these a matter for font switching in
> rich text.
>
> One original principle was ensure different encodings, so if the practise
> in music engraving is to keep them different, they might be encoded
> differently.
>
> > > There are duplicates all over the place, like how the half-sharp
> symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at
> U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "
> accidentalQuarterToneSharpArabic", and at U+E444 as
> "accidentalKomaSharp". They are graphically identical, and the first three
> even all mean the same thing, a quarter tone sharp!
> >
> > But the tuning system is different, E24 and Pythagorean. Some Latin and
> Greek uppercase letters are exactly the same but have different encodings.
> >
> > Tuning systems are not scripts.
>
> That seems obvious. As I pointed out above, the Arabic glyphs were
> originally taken from Western ones, but have a different musical meaning,
> also when played using E12, as some do.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180526/38322704/attachment.html>

From unicode at unicode.org  Sun May 27 06:02:24 2018
From: unicode at unicode.org (Ivan Panchenko via Unicode)
Date: Sun, 27 May 2018 13:02:24 +0200
Subject: =?UTF-8?Q?Re:_Major_vendors_changing_U+1F52B_PISTOL_=f0=9f=94=ab_de?=
 =?UTF-8?Q?piction_from_firearm_to_squirt_gun?=
Message-ID: <1222dee2-7468-972d-dd1f-12724ec22924@gmail.com>

On another note, the ?crocodile shot by police? (??????) example in UTS 
#51 appears with a water gun glyph (taken from Apple) now.

If the pistol is to be gotten rid of, would it not be more sensible to 
stop supporting the emoji rather than to corrupt its meaning? NTT DOCOMO 
apparently did not change to a squirt gun: 
https://www.nttdocomo.co.jp/binary/pdf/service/developer/smart_phone/make_contents/pictograph/pictograph_list.pdf

Best regards
Ivan

From unicode at unicode.org  Sun May 27 15:18:43 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sun, 27 May 2018 13:18:43 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
Message-ID: <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>

Philippe is entirely correct here. The fact that a symbol has somewhat
different meanings in different contexts does not mean that it is actually
multiple visually identical symbols. Otherwise Unicode would be re-encoding
the Latin alphabet many, many times over.

During most of Bach's career, the prevailing tuning system was meantone. He
wrote the Well-Tempered Clavier to explore the possibilities afforded by a
new tuning system called well temperament. In the modern era, his work has
typically been played in 12-tone equal temperament. That does not mean that
the ? that Bach used in his score for the Well-Tempered Clavier was not the
same symbol as the ? in his other scores, or that they somehow invisibly
became yet another symbol when the score is opened on the music desk of a
modern Steinway.

On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Even flat notes or rythmic and pause symbols in Western musical notations
> have different contextual meaning depending on musical keys at start of
> scores, and other notations or symbols added above the score. So their
> interpretation are also variable according to context, just like tuning in
> a Arabic musical score, which is also keyed and annotated differently.
> These keys can also change within the same partition score.
> So both the E12 vs. E24 systems (which are not incompatible) may also be
> used in Western and Arabic music notations. The score keys will give the
> interpretation.
> Tone marks taken isolately mean absolutely nothing in both systems outside
> the keyed scores in which they are inserted, except that they are just
> glyphs, which may be used to mean something else (e.g. a note in a comics
> artwork could be used to denote someone whistling, without actually
> encoding any specific tone, or rythmic).
>
>
> 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode <unicode at unicode.org>:
>
>>
>>
>> > On 17 May 2018, at 16:47, Garth Wallace via Unicode <
>> unicode at unicode.org> wrote:
>> >
>> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com> wrote:
>> >
>> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode <
>> unicode at unicode.org> wrote:
>> > >
>> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <
>> unicode at unicode.org> wrote:
>> > >>
>> > >> It would be best to encode the SMuFL symbols, which is rather
>> comprehensive and include those:
>> > >>  https://www.smufl what should be unified.org
>> > >>  http://www.smufl.org/version/latest/
>> > >> ...
>> > >>
>> > >> These are otherwise originally the same, but has since drifted. So
>> whether to unify them or having them separate might be best to see what
>> SMuFL does, as they are experts on the issue.
>> > >>
>> > > SMuFL's standards on unification are not the same as Unicode's. For
>> one thing, they re-encode Latin letters and Arabic digits multiple times
>> for various different uses (such as numbers used in tuplets and those used
>> in time signatures).
>> >
>> > The reason is probably because it is intended for use with music
>> engraving, and they should then be rendered differently.
>> >
>> > Exactly. But Unicode would consider these a matter for font switching
>> in rich text.
>>
>> One original principle was ensure different encodings, so if the practise
>> in music engraving is to keep them different, they might be encoded
>> differently.
>>
>> > > There are duplicates all over the place, like how the half-sharp
>> symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at
>> U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as
>> "accidentalQuarterToneSharpArabic", and at U+E444 as
>> "accidentalKomaSharp". They are graphically identical, and the first three
>> even all mean the same thing, a quarter tone sharp!
>> >
>> > But the tuning system is different, E24 and Pythagorean. Some Latin and
>> Greek uppercase letters are exactly the same but have different encodings.
>> >
>> > Tuning systems are not scripts.
>>
>> That seems obvious. As I pointed out above, the Arabic glyphs were
>> originally taken from Western ones, but have a different musical meaning,
>> also when played using E12, as some do.
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180527/af48ee40/attachment.html>

From unicode at unicode.org  Sun May 27 15:33:00 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 27 May 2018 22:33:00 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
Message-ID: <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>

Thanks!

Le dim. 27 mai 2018 22:18, Garth Wallace <gwalla at gmail.com> a ?crit :

> Philippe is entirely correct here. The fact that a symbol has somewhat
> different meanings in different contexts does not mean that it is actually
> multiple visually identical symbols. Otherwise Unicode would be re-encoding
> the Latin alphabet many, many times over.
>
> During most of Bach's career, the prevailing tuning system was meantone.
> He wrote the Well-Tempered Clavier to explore the possibilities afforded by
> a new tuning system called well temperament. In the modern era, his work
> has typically been played in 12-tone equal temperament. That does not mean
> that the ? that Bach used in his score for the Well-Tempered Clavier was
> not the same symbol as the ? in his other scores, or that they somehow
> invisibly became yet another symbol when the score is opened on the music
> desk of a modern Steinway.
>
> On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> Even flat notes or rythmic and pause symbols in Western musical notations
>> have different contextual meaning depending on musical keys at start of
>> scores, and other notations or symbols added above the score. So their
>> interpretation are also variable according to context, just like tuning in
>> a Arabic musical score, which is also keyed and annotated differently.
>> These keys can also change within the same partition score.
>> So both the E12 vs. E24 systems (which are not incompatible) may also be
>> used in Western and Arabic music notations. The score keys will give the
>> interpretation.
>> Tone marks taken isolately mean absolutely nothing in both systems
>> outside the keyed scores in which they are inserted, except that they are
>> just glyphs, which may be used to mean something else (e.g. a note in a
>> comics artwork could be used to denote someone whistling, without actually
>> encoding any specific tone, or rythmic).
>>
>>
>> 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode <unicode at unicode.org>:
>>
>>>
>>>
>>> > On 17 May 2018, at 16:47, Garth Wallace via Unicode <
>>> unicode at unicode.org> wrote:
>>> >
>>> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com>
>>> wrote:
>>> >
>>> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode <
>>> unicode at unicode.org> wrote:
>>> > >
>>> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <
>>> unicode at unicode.org> wrote:
>>> > >>
>>> > >> It would be best to encode the SMuFL symbols, which is rather
>>> comprehensive and include those:
>>> > >>  https://www.smufl what should be unified.org
>>> > >>  http://www.smufl.org/version/latest/
>>> > >> ...
>>> > >>
>>> > >> These are otherwise originally the same, but has since drifted. So
>>> whether to unify them or having them separate might be best to see what
>>> SMuFL does, as they are experts on the issue.
>>> > >>
>>> > > SMuFL's standards on unification are not the same as Unicode's. For
>>> one thing, they re-encode Latin letters and Arabic digits multiple times
>>> for various different uses (such as numbers used in tuplets and those used
>>> in time signatures).
>>> >
>>> > The reason is probably because it is intended for use with music
>>> engraving, and they should then be rendered differently.
>>> >
>>> > Exactly. But Unicode would consider these a matter for font switching
>>> in rich text.
>>>
>>> One original principle was ensure different encodings, so if the
>>> practise in music engraving is to keep them different, they might be
>>> encoded differently.
>>>
>>> > > There are duplicates all over the place, like how the half-sharp
>>> symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422
>>> as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as
>>> "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp".
>>> They are graphically identical, and the first three even all mean the same
>>> thing, a quarter tone sharp!
>>> >
>>> > But the tuning system is different, E24 and Pythagorean. Some Latin
>>> and Greek uppercase letters are exactly the same but have different
>>> encodings.
>>> >
>>> > Tuning systems are not scripts.
>>>
>>> That seems obvious. As I pointed out above, the Arabic glyphs were
>>> originally taken from Western ones, but have a different musical meaning,
>>> also when played using E12, as some do.
>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180527/8c9fb6c7/attachment.html>

From unicode at unicode.org  Sun May 27 17:36:02 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 00:36:02 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
Message-ID: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>

The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available. By contrast, Persian music notation invented new microtonal accidentals, called the koron and sori, and my impression is that their average value, as measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] using this value; note that one actually needs two extra microtonal accidentals?Arabic microtonal notation is in fact not complete.

The E24 exact quarter-tones are suitable for making a piano sound badly out of tune. Compare that with the accordion in [2], Farid El Atrache - "Noura-Noura".

1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
2. https://www.youtube.com/watch?v=fvp6fo7tfpk


> On 27 May 2018, at 22:33, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> Thanks! 
> 
> Le dim. 27 mai 2018 22:18, Garth Wallace <gwalla at gmail.com> a ?crit :
> Philippe is entirely correct here. The fact that a symbol has somewhat different meanings in different contexts does not mean that it is actually multiple visually identical symbols. Otherwise Unicode would be re-encoding the Latin alphabet many, many times over.
> 
> During most of Bach's career, the prevailing tuning system was meantone. He wrote the Well-Tempered Clavier to explore the possibilities afforded by a new tuning system called well temperament. In the modern era, his work has typically been played in 12-tone equal temperament. That does not mean that the ? that Bach used in his score for the Well-Tempered Clavier was not the same symbol as the ? in his other scores, or that they somehow invisibly became yet another symbol when the score is opened on the music desk of a modern Steinway.
> 
> On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> Even flat notes or rythmic and pause symbols in Western musical notations have different contextual meaning depending on musical keys at start of scores, and other notations or symbols added above the score. So their interpretation are also variable according to context, just like tuning in a Arabic musical score, which is also keyed and annotated differently. These keys can also change within the same partition score.
> So both the E12 vs. E24 systems (which are not incompatible) may also be used in Western and Arabic music notations. The score keys will give the interpretation.
> Tone marks taken isolately mean absolutely nothing in both systems outside the keyed scores in which they are inserted, except that they are just glyphs, which may be used to mean something else (e.g. a note in a comics artwork could be used to denote someone whistling, without actually encoding any specific tone, or rythmic).
> 
> 
> 2018-05-17 17:48 GMT+02:00 Hans ?berg via Unicode <unicode at unicode.org>:
> 
> 
> > On 17 May 2018, at 16:47, Garth Wallace via Unicode <unicode at unicode.org> wrote:
> > 
> > On Thu, May 17, 2018 at 12:41 AM Hans ?berg <haberg-1 at telia.com> wrote:
> > 
> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode at unicode.org> wrote:
> > > 
> > >> On Wed, May 16, 2018 at 12:42 AM, Hans ?berg via Unicode <unicode at unicode.org> wrote:
> > >> 
> > >> It would be best to encode the SMuFL symbols, which is rather comprehensive and include those:
> > >>  https://www.smufl what should be unified.org
> > >>  http://www.smufl.org/version/latest/
> > >> ...
> > >> 
> > >> These are otherwise originally the same, but has since drifted. So whether to unify them or having them separate might be best to see what SMuFL does, as they are experts on the issue.
> > >> 
> > > SMuFL's standards on unification are not the same as Unicode's. For one thing, they re-encode Latin letters and Arabic digits multiple times for various different uses (such as numbers used in tuplets and those used in time signatures).
> > 
> > The reason is probably because it is intended for use with music engraving, and they should then be rendered differently.
> > 
> > Exactly. But Unicode would consider these a matter for font switching in rich text.
> 
> One original principle was ensure different encodings, so if the practise in music engraving is to keep them different, they might be encoded differently.
> 
> > > There are duplicates all over the place, like how the half-sharp symbol is encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". They are graphically identical, and the first three even all mean the same thing, a quarter tone sharp!
> > 
> > But the tuning system is different, E24 and Pythagorean. Some Latin and Greek uppercase letters are exactly the same but have different encodings.
> > 
> > Tuning systems are not scripts.
> 
> That seems obvious. As I pointed out above, the Arabic glyphs were originally taken from Western ones, but have a different musical meaning, also when played using E12, as some do.
> 
> 
> 
> 
> 


From unicode at unicode.org  Sun May 27 20:39:52 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sun, 27 May 2018 18:39:52 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
Message-ID: <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>

On Sun, May 27, 2018 at 3:36 PM, Hans ?berg <haberg-1 at telia.com> wrote:

> The flats and sharps of Arabic music are semantically the same as in
> Western music, departing from Pythagorean tuning, then, but the microtonal
> accidentals are different: they simply reused some that were available.


But they aren't different! They are the same symbols. They are, as you
yourself say, reused. The fact that they do not denote the same width in
cents in Arabic music as they do in Western modern classical does not
matter. That sort of precision is not inherent to the written symbols.


> By contrast, Persian music notation invented new microtonal accidentals,
> called the koron and sori, and my impression is that their average value,
> as measured by Hormoz Farhat in his thesis, is also usable in Arabic music.
> For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation
> [1] using this value; note that one actually needs two extra microtonal
> accidentals?Arabic microtonal notation is in fact not complete.
>
> The E24 exact quarter-tones are suitable for making a piano sound badly
> out of tune. Compare that with the accordion in [2], Farid El Atrache -
> "Noura-Noura".
>
> 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
> 2. https://www.youtube.com/watch?v=fvp6fo7tfpk
> >
>
>
I don't really see how this is relevant. Nobody is claiming that the koron
and sori accidentals are the same symbols as the Arabic half-sharp and flat
with crossbar. They look entirely different.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180527/8b098f79/attachment.html>

From unicode at unicode.org  Sun May 27 14:27:03 2018
From: unicode at unicode.org (SundaraRaman R via Unicode)
Date: Mon, 28 May 2018 00:57:03 +0530
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Message-ID: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  ???? as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?

Cheers,
Sundar


From unicode at unicode.org  Mon May 28 03:08:30 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 10:08:30 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
Message-ID: <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>


> On 28 May 2018, at 03:39, Garth Wallace <gwalla at gmail.com> wrote:
> 
>> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg <haberg-1 at telia.com> wrote:
>> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available.
>> 
> But they aren't different! They are the same symbols. They are, as you yourself say, reused.

Historically, yes, but not necessarily now.

> The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols.

It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not.

> By contrast, Persian music notation invented new microtonal accidentals, called the koron and sori, and my impression is that their average value, as measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] using this value; note that one actually needs two extra microtonal accidentals?Arabic microtonal notation is in fact not complete.
> 
> The E24 exact quarter-tones are suitable for making a piano sound badly out of tune. Compare that with the accordion in [2], Farid El Atrache - "Noura-Noura".
> 
> 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
> 2. https://www.youtube.com/watch?v=fvp6fo7tfpk
> > 
> 
> I don't really see how this is relevant. Nobody is claiming that the koron and sori accidentals are the same symbols as the Arabic half-sharp and flat with crossbar. They look entirely different. 

Arabic music simply happens to use Western style accidentals for concepts similar to Persian music rather than Western music.


From unicode at unicode.org  Mon May 28 04:05:42 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 28 May 2018 10:05:42 +0100 (BST)
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
Message-ID: <slrnpgnhj5.8gj.jcb@home.stevens-bradfield.com>

On 2018-05-28, Hans ?berg via Unicode <unicode at unicode.org> wrote:
>> On 28 May 2018, at 03:39, Garth Wallace <gwalla at gmail.com> wrote:
>>> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg <haberg-1 at telia.com> wrote:
>>> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available.
...
>> The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols.
>
> It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not.

Latin, Greek, Cyrillic etc. could not have been unified, because of the
requirement to have round-trip compatibility with previous encodings.

It is also, of course, convenient for many reasons to have the notion
of "script" hard-coded into unicode code-points, instead of in
higher-level mark-up where it arguably belongs - just as, when
copyright finally expires, it will be convenient to have Tolkien's
runes disunified from historical runes (which is the line taken by the
proposal waiting for that day). Whether it is so convenient to have a
"music script" notion hard-coded is presumably what this argument is
about. It's not obvious to me that musical notation is something that
carries the "script" baggage in the same way that writing systems do.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon May 28 05:43:10 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 12:43:10 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <slrnpgnhj5.8gj.jcb@home.stevens-bradfield.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <slrnpgnhj5.8gj.jcb@home.stevens-bradfield.com>
Message-ID: <88FD19F5-F401-4CED-A397-5D7BAE4EFDB1@telia.com>


> On 28 May 2018, at 11:05, Julian Bradfield via Unicode <unicode at unicode.org> wrote:
> 
> On 2018-05-28, Hans ?berg via Unicode <unicode at unicode.org> wrote:
>>> On 28 May 2018, at 03:39, Garth Wallace <gwalla at gmail.com> wrote:
>>>> On Sun, May 27, 2018 at 3:36 PM, Hans ?berg <haberg-1 at telia.com> wrote:
>>>> The flats and sharps of Arabic music are semantically the same as in Western music, departing from Pythagorean tuning, then, but the microtonal accidentals are different: they simply reused some that were available.
> ...
>>> The fact that they do not denote the same width in cents in Arabic music as they do in Western modern classical does not matter. That sort of precision is not inherent to the written symbols.
>> 
>> It is not about precision, but concepts. Like B, ?, and ?, which could have been unified, but are not.
> 
> Latin, Greek, Cyrillic etc. could not have been unified, because of the
> requirement to have round-trip compatibility with previous encodings.

Indeed, in Unicode because of that, which I pointed out.

> It is also, of course, convenient for many reasons to have the notion
> of "script" hard-coded into unicode code-points, instead of in
> higher-level mark-up where it arguably belongs - just as, when
> copyright finally expires, it will be convenient to have Tolkien's
> runes disunified from historical runes (which is the line taken by the
> proposal waiting for that day). Whether it is so convenient to have a
> "music script" notion hard-coded is presumably what this argument is
> about. It's not obvious to me that musical notation is something that
> carries the "script" baggage in the same way that writing systems do.

Indeed, that is what I also pointed out. So I suggested to contact the SMuFL people which might inform about the underlying reasoning, and then make a decision about what might be suitable for Unicode. They probably have them separate for the same reason as for scripts: originally different fonts encodings, but those are not official, and in addition it is for music engraving, and not writing in text files.


From unicode at unicode.org  Mon May 28 07:57:26 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 13:57:26 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
Message-ID: <20180528135726.7759c425@JRWUBU2>

On Mon, 28 May 2018 00:57:03 +0530
SundaraRaman R via Unicode <unicode at unicode.org> wrote:

> Hi,
> 
> In languages like Ruby or Java
> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
> functions to check if a character is alphabetic do that by looking for
> the 'Alphabetic'  property (defined true if it's in one of the L
> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
> Tamil text, this works out well for independent vowels and consonants
> (which are in Lo), and for most dependent signs (which are in Mc or Mn
> but have the 'Other_Alphabetic' property), but the very common pulli
> (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.
> 
> This doesn't make sense to me since the Virama  ???? as much of an
> alphabetic character as any of the "Dependent Vowel" characters which
> have been given the 'Other_Alphabetic' property. Is there a rationale
> behind this difference, or is it an oversight to be corrected?

There is only one character with a canonical combining class of 9 that
is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
That last had any of the other properties of viramas back in Unicode
1.0; the characters that triggered such behaviours were permanently
removed in Unicode 1.1.

There are some notable absences from the combining marks included.
Significant absences include ZWJ, ZWNJ and CGJ.

However, a non-erroneous *conformant* Unicode process cannot
always determine whether a string, given only that it is a string, is
composed only of alphabetic characters.  The answer would be 'yes' for
<U+00E7 LATIN SMALL LETTER C WITH CEDILLA> but 'no' for the canonically
equivalent <U+0063 LATIN SMALL LETTER C, U+0327 COMBINING CEDILLA>!
(U+0327 is not included as alphabetic either.)

There is at least one combination of Latin letter and combining mark
that occurs in the normal orthography of a natural language and does not
have a precomposed equivalent.

I fear that the correct test for what you want is to split text into
words and check that each word begins with an alphabetic character.
That test can be made by a conformant process.  I think, but have not
checked, that the test an be simplified to:

(a) Check that the first character is alphabetic.

(b) Ignore every character with a WordBreak property of Extend or ZWJ

(c) Check that all other characters are alphabetic.

Richard.


From unicode at unicode.org  Mon May 28 08:10:23 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 14:10:23 +0100
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
Message-ID: <20180528141023.24d2231e@JRWUBU2>

On Mon, 28 May 2018 10:08:30 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 28 May 2018, at 03:39, Garth Wallace <gwalla at gmail.com> wrote:
> > The fact that they do not denote the same width in cents in Arabic
> > music as they do in Western modern classical does not matter. That
> > sort of precision is not inherent to the written symbols.  
> 
> It is not about precision, but concepts. Like B, ?, and ?, which
> could have been unified, but are not.

Unifying these would make a real mess of lower casing!

What is the context in which the Arab use would benefit from having a
different encoding?

Richard.


From unicode at unicode.org  Mon May 28 08:30:55 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 15:30:55 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <20180528141023.24d2231e@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
Message-ID: <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>


> On 28 May 2018, at 15:10, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 28 May 2018 10:08:30 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>> It is not about precision, but concepts. Like B, ?, and ?, which
>> could have been unified, but are not.
> 
> Unifying these would make a real mess of lower casing!

German has a special sign ? for "ss", without upper capital version.

> What is the context in which the Arab use would benefit from having a
> different encoding?

Maybe if they decide to change the glyph, then what already is encoded would get the right appearance. But SMuFL might have had other reasons: the glyphs should probably be designed together. And it is simple, as one does not need to investigate their uses too much. For example, the Turkish AEU sharps are microtonal, not the ordinary ones. So if the Turkish accidentals have their own code points, one can change that later.


From unicode at unicode.org  Mon May 28 09:33:11 2018
From: unicode at unicode.org (SundaraRaman R via Unicode)
Date: Mon, 28 May 2018 20:03:11 +0530
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <20180528135726.7759c425@JRWUBU2>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <20180528135726.7759c425@JRWUBU2>
Message-ID: <CAOORhHr3B=+Ot5RnKAOGwNDriMHa40DNR0aVayuODUhH1BXAQQ@mail.gmail.com>

Hi, thanks for your reply.

> There is only one character with a canonical combining class of 9 that
> is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
> That last had any of the other properties of viramas back in Unicode
> 1.0; the characters that triggered such behaviours were permanently
> removed in Unicode 1.1.

I didn't understand the second sentence here, could you clarify? What
do you mean by "any of the other properties" here? And "triggered such
behaviours" seems to imply having them in other_alphabetic had
negative consequences, could you give an example of what that might
be?

> There are some notable absences from the combining marks included.
> Significant absences include ZWJ, ZWNJ and CGJ.
>
> However, a non-erroneous *conformant* Unicode process cannot
> always determine whether a string, given only that it is a string, is
> composed only of alphabetic characters.  The answer would be 'yes' for
> <U+00E7 LATIN SMALL LETTER C WITH CEDILLA> but 'no' for the canonically
> equivalent <U+0063 LATIN SMALL LETTER C, U+0327 COMBINING CEDILLA>!
> (U+0327 is not included as alphabetic either.)
>
> There is at least one combination of Latin letter and combining mark
> that occurs in the normal orthography of a natural language and does not
> have a precomposed equivalent.

Ah, that's somewhat unfortunate that such a quick and easy alphabetic
check is not possible in the general case, but I can understand how it
might be weird to give the Alphabetic property to a ZWJ or ZWNJ.

But in the case of Tamil, I'm curious why most other combining Tamil
marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
character barely used in Tamil text, has combining class 0 and is
included in Other_Alphabetic, but the visually similar and  similarly
positioned pulli is not. In this particular case, is it a historical
accident that these got assigned this way, or is there a rationale
behind these? Would it at all be possible to get this changed in the
upcoming Unicode standard?

(By the way, I'm happy to get a link to read through for any of my
questions here. I just find it quite hard to search for and find past
discussions and decision rationales regarding these, not knowing how
and where to search for them.)


> I fear that the correct test for what you want is to split text into
> words and check that each word begins with an alphabetic character.

Do you mean "each grapheme cluster begins with an alphabetic
character" here? It seems to me (in my very limited Unicode knowledge)
that such a test, going through grapheme clusters and checking the
first codepoint in each, would also ensure the text is full
alphabetic. And it has the advantage that more languages have a
(relatively) easy way for splitting text into grapheme clusters, than
for checking minor Unicode properties like WordBreak, so this one
might be easier to implement. Does this test anywhere in the ballpark
of being right?

Regards,
Sundar

From unicode at unicode.org  Mon May 28 10:00:37 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 16:00:37 +0100
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
Message-ID: <20180528160037.6b3689e0@JRWUBU2>

On Mon, 28 May 2018 15:30:55 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 28 May 2018, at 15:10, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > On Mon, 28 May 2018 10:08:30 +0200
> > Hans ?berg via Unicode <unicode at unicode.org> wrote:
> >   
> >> It is not about precision, but concepts. Like B, ?, and ?, which
> >> could have been unified, but are not.  
> > 
> > Unifying these would make a real mess of lower casing!  
> 
> German has a special sign ? for "ss", without upper capital version.

That doesn't prevent upper-casing - you just have to know your
audience.  The three letters like 'B' have very different lower case
forms, and very few would agree that they were the same letter.  For the
same reason, there are two utter confusables in THE Latin SCRIPT for
00D0 LATIN CAPITAL LETTER ETH. More notably though, one just has to run
the risk of getting a culturally incorrect upper case when rendering
U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the
same letter is debatable.

Richard.


From unicode at unicode.org  Mon May 28 10:54:47 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 17:54:47 +0200
Subject: Unicode characters unification
In-Reply-To: <20180528160037.6b3689e0@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
Message-ID: <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>


> On 28 May 2018, at 17:00, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 28 May 2018 15:30:55 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 28 May 2018, at 15:10, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
>>> 
>>> On Mon, 28 May 2018 10:08:30 +0200
>>> Hans ?berg via Unicode <unicode at unicode.org> wrote:
>>> 
>>>> It is not about precision, but concepts. Like B, ?, and ?, which
>>>> could have been unified, but are not.  
>>> 
>>> Unifying these would make a real mess of lower casing!  
>> 
>> German has a special sign ? for "ss", without upper capital version.
> 
> That doesn't prevent upper-casing - you just have to know your
> audience.  

That would be the same if the Greek and Latin uppercase letters would have been unified: One would need to know the context.

> The three letters like 'B' have very different lower case
> forms, and very few would agree that they were the same letter.  

They were the same in the Uncial script, but evolved to be viewed as different. That is common with math symbols: something available evolving into separate symbols.

> For the
> same reason, there are two utter confusables in THE Latin SCRIPT for
> 00D0 LATIN CAPITAL LETTER ETH.

The stuff is likely added for computer legacy, if there were separate encodings for those.

> More notably though, one just has to run
> the risk of getting a culturally incorrect upper case when rendering
> U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the
> same letter is debatable.

Unified CJK Ideographs differ by stroke order.


From unicode at unicode.org  Mon May 28 11:45:39 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 17:45:39 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <CAOORhHr3B=+Ot5RnKAOGwNDriMHa40DNR0aVayuODUhH1BXAQQ@mail.gmail.com>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <20180528135726.7759c425@JRWUBU2>
 <CAOORhHr3B=+Ot5RnKAOGwNDriMHa40DNR0aVayuODUhH1BXAQQ@mail.gmail.com>
Message-ID: <20180528174539.29acf556@JRWUBU2>

On Mon, 28 May 2018 20:03:11 +0530
SundaraRaman R via Unicode <unicode at unicode.org> wrote:

> Hi, thanks for your reply.
> 
> > There is only one character with a canonical combining class of 9
> > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER
> > PHINTHU. That last had any of the other properties of viramas back
> > in Unicode 1.0; the characters that triggered such behaviours were
> > permanently removed in Unicode 1.1.  
> 
> I didn't understand the second sentence here, could you clarify?

Sorry, I messed that system up.  It should have read, "The last time
that that had any of the other properties of viramas back
in Unicode 1.0;"

> What
> do you mean by "any of the other properties" here?

The effects of virama that spring to mind are:

(a) Causing one or both letters on either side to change or combine to
indicate combination;

(b) Appearing as a mark only if it does not affect one of the letters
on either side;

(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.

> And "triggered such
> behaviours" seems to imply having them in other_alphabetic had
> negative consequences, could you give an example of what that might
> be?

Nowadays, the Thai syllable ???, normatively pronounced /trai/, is
only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI
CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character
U+0E3A is always visible when used; for most routine purposes it is
little different to U+0E38 THAI CHARACTER SARA U.  However, in Unicode
1.0, while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same
visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74
THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI>  - no glyph would be
rendered for U+0E3A. If one wanted the official Sanskritised Pali
version, one could type ???? <U+0E44, U+0E15, U+0E3A, U+0E23> as at
present.  One could also encode it as <U+0E15, U+0E3A, U+200C, U+0E23,
U+0E74>.

Weirdly, I couldn't have used the phonetically ordered vowel to type a
monk's name ending in ???? <U+0E21 THAI CHARACTER MO MA, U+0E3A, U+0E42
THAI CHARACTER SARA O, U+0E21>, as <U+0E21, U+0E3A, U+200C, U+0E21,
U+0E72 THAI PHONETIC ORDER VOWEL SIGN O> would have been rendered as
????.

As the non-phonetic virama-like behaviours of U+0E3A are only mentioned
under the heading 'Alternate Ordering', I can only presume that they
were triggered by the phonetic order vowel signs, U+0E70 to U+0E74.

It is possible that U+0E3A acquired the alphabetic property because it
ceased to behave like a virama.  Alternatively, it may have acquired
the alphabetic property because of its use in the compound vowels of
minority languages.

> But in the case of Tamil, I'm curious why most other combining Tamil
> marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
> character barely used in Tamil text, has combining class 0 and is
> included in Other_Alphabetic, but the visually similar and  similarly
> positioned pulli is not. In this particular case, is it a historical
> accident that these got assigned this way, or is there a rationale
> behind these? Would it at all be possible to get this changed in the
> upcoming Unicode standard?

Tamil has usually been treated as just another Indian Indic script.
U+0E3A is the only virama-like character with the property of being
'alphabetic'.

I can't see a change making it into Unicode 11.0.  It requires too much
careful thought.  Besides, anything that considered <pulli> as
alphabetic should also considerer <pulli, ZWNJ> as alphabetic - they
should be mostly interchangeable in Tamil.

> > I fear that the correct test for what you want is to split text into
> > words and check that each word begins with an alphabetic
> > character.  
> 
> Do you mean "each grapheme cluster begins with an alphabetic
> character" here? It seems to me (in my very limited Unicode knowledge)
> that such a test, going through grapheme clusters and checking the
> first codepoint in each, would also ensure the text is full
> alphabetic.

Not directly.  Is the string "mark2mark" alphabetic?  It constitutes a
single word.  My suggested simplification would say 'no', as it
contains '2'; perhaps my simplification is wrong.

> And it has the advantage that more languages have a
> (relatively) easy way for splitting text into grapheme clusters, than
> for checking minor Unicode properties like WordBreak, so this one
> might be easier to implement. Does this test anywhere in the ballpark
> of being right?

Yes, it's close to being right.  Note that simple approximations for SE
Asian word-breaking (e.g. treating SE Asian characters as
alphabetic) should work well for your application.

Richard.


From unicode at unicode.org  Mon May 28 12:18:52 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 18:18:52 +0100
Subject: Unicode characters unification
In-Reply-To: <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
Message-ID: <20180528181852.7ce84e52@JRWUBU2>

On Mon, 28 May 2018 17:54:47 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 28 May 2018, at 17:00, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > On Mon, 28 May 2018 15:30:55 +0200
> > Hans ?berg via Unicode <unicode at unicode.org> wrote:

> >> German has a special sign ? for "ss", without upper capital
> >> version.  
> > 
> > That doesn't prevent upper-casing - you just have to know your
> > audience.    
> 
> That would be the same if the Greek and Latin uppercase letters would
> have been unified: One would need to know the context.

I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER
M and U+039C GREEK CAPITAL LETTER MU on it.  I only knew the difference
because I listened to what the lecturer said.

> > For the
> > same reason, there are two utter confusables in THE Latin SCRIPT for
> > 00D0 LATIN CAPITAL LETTER ETH.  

> The stuff is likely added for computer legacy, if there were separate
> encodings for those.

Unlikely.  U+00F0 LATIN SMALL LETTER ETH and U+0256 LATIN SMALL LETTER
D WITH TAIL contrast in the IPA.  The difference between U+0111 LATIN
SMALL LETTER D WITH STROKE and U+00F0 LATIN SMALL LETTER ETH may have
been debated.

Richard.


From unicode at unicode.org  Mon May 28 13:19:09 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 20:19:09 +0200
Subject: Unicode characters unification
In-Reply-To: <20180528181852.7ce84e52@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
Message-ID: <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>


> On 28 May 2018, at 19:18, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 28 May 2018 17:54:47 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 28 May 2018, at 17:00, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
>>> 
>>> On Mon, 28 May 2018 15:30:55 +0200
>>> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>>> German has a special sign ? for "ss", without upper capital
>>>> version.  
>>> 
>>> That doesn't prevent upper-casing - you just have to know your
>>> audience.    
>> 
>> That would be the same if the Greek and Latin uppercase letters would
>> have been unified: One would need to know the context.
> 
> I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER
> M and U+039C GREEK CAPITAL LETTER MU on it.  I only knew the difference
> because I listened to what the lecturer said.

Indistinguishable math styles Latin and Greek uppercase letters have been added, even though that was not so in for example TeX, and thus no encoding legacy to consider.


From unicode at unicode.org  Mon May 28 14:01:39 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 20:01:39 +0100
Subject: Unicode characters unification
In-Reply-To: <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
Message-ID: <20180528200139.744ee706@JRWUBU2>

On Mon, 28 May 2018 20:19:09 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:


> Indistinguishable math styles Latin and Greek uppercase letters have
> been added, even though that was not so in for example TeX, and thus
> no encoding legacy to consider.

They sort differently - one can have vaguely alphabetical indexes of
mathematical symbols.  They also have quite different compatibility
decompositions.

Does sorting offer an argument for encoding these symbols differently.
I'm not sure it's a strong arguments - how likely is one to have a list
where the difference matters?

Richard.

 
From unicode at unicode.org  Mon May 28 14:14:58 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 21:14:58 +0200
Subject: Unicode characters unification
In-Reply-To: <20180528200139.744ee706@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
Message-ID: <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com>


> On 28 May 2018, at 21:01, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 28 May 2018 20:19:09 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>> Indistinguishable math styles Latin and Greek uppercase letters have
>> been added, even though that was not so in for example TeX, and thus
>> no encoding legacy to consider.
> 
> They sort differently - one can have vaguely alphabetical indexes of
> mathematical symbols.  They also have quite different compatibility
> decompositions.
> 
> Does sorting offer an argument for encoding these symbols differently.
> I'm not sure it's a strong arguments - how likely is one to have a list
> where the difference matters?

The main point is that they are not likely to be distinguishable when used side-by-side in the same formula. They could be of significance if using Greek names instead of letters, of length greater than one, then. But it is not wrong to add them, because it is easier than having to think through potential uses.


From unicode at unicode.org  Mon May 28 14:38:27 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 May 2018 20:38:27 +0100
Subject: Unicode characters unification
In-Reply-To: <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
 <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com>
Message-ID: <20180528203827.7c073b30@JRWUBU2>

On Mon, 28 May 2018 21:14:58 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 28 May 2018, at 21:01, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > On Mon, 28 May 2018 20:19:09 +0200
> > Hans ?berg via Unicode <unicode at unicode.org> wrote:
> >   
> >> Indistinguishable math styles Latin and Greek uppercase letters
> >> have been added, even though that was not so in for example TeX,
> >> and thus no encoding legacy to consider.  
> > 
> > They sort differently - one can have vaguely alphabetical indexes of
> > mathematical symbols.  They also have quite different compatibility
> > decompositions.
> > 
> > Does sorting offer an argument for encoding these symbols
> > differently. I'm not sure it's a strong arguments - how likely is
> > one to have a list where the difference matters?  
> 
> The main point is that they are not likely to be distinguishable when
> used side-by-side in the same formula. They could be of significance
> if using Greek names instead of letters, of length greater than one,
> then. But it is not wrong to add them, because it is easier than
> having to think through potential uses.

By these symbols, I meant the quarter-tone symbols.  Capital em and
capital mu, as symbols, need to be encoded separately for proper
sorting.

Richard. 


From unicode at unicode.org  Mon May 28 15:23:54 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 28 May 2018 22:23:54 +0200
Subject: Unicode characters unification
In-Reply-To: <20180528203827.7c073b30@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
 <323E9648-7103-49B0-8CB0-5544759CCFBB@telia.com>
 <20180528203827.7c073b30@JRWUBU2>
Message-ID: <8ABECE66-4FB0-4434-8F9D-CB8AA889621D@telia.com>


> On 28 May 2018, at 21:38, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Mon, 28 May 2018 21:14:58 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 28 May 2018, at 21:01, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
>>> 
>>> On Mon, 28 May 2018 20:19:09 +0200
>>> Hans ?berg via Unicode <unicode at unicode.org> wrote:
>>> 
>>>> Indistinguishable math styles Latin and Greek uppercase letters
>>>> have been added, even though that was not so in for example TeX,
>>>> and thus no encoding legacy to consider.  
>>> 
>>> They sort differently - one can have vaguely alphabetical indexes of
>>> mathematical symbols.  They also have quite different compatibility
>>> decompositions.
>>> 
>>> Does sorting offer an argument for encoding these symbols
>>> differently. I'm not sure it's a strong arguments - how likely is
>>> one to have a list where the difference matters?  
>> 
>> The main point is that they are not likely to be distinguishable when
>> used side-by-side in the same formula. They could be of significance
>> if using Greek names instead of letters, of length greater than one,
>> then. But it is not wrong to add them, because it is easier than
>> having to think through potential uses.
> 
> By these symbols, I meant the quarter-tone symbols.  Capital em and
> capital mu, as symbols, need to be encoded separately for proper
> sorting.

Some of the math style letters are out of order for legacy reasons, so sorting may not work well.

SMuFL have different fonts for text and music engraving, but I can't think of any use of sorting them.


From unicode at unicode.org  Mon May 28 17:13:43 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 28 May 2018 16:13:43 -0600
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Message-ID: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell>

SundaraRaman R wrote:

> but the very common pulli (VIRAMA)
> is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.

Is this definition part of Unicode? I thought the use of General 
Category to answer questions like "this sequence is a word" or "this 
string is alphabetic" was much more complex than that. (I'm not even 
sure what the latter means, for any script with any sort of combining 
mark.)

Richard Wordingham wrote:

> The effects of virama that spring to mind are:
>
> (a) Causing one or both letters on either side to change or combine to
> indicate combination;
>
> (b) Appearing as a mark only if it does not affect one of the letters
> on either side;
>
> (c) Causing a left matra to appear on the left of the sequence of
> consonants joined by a sequence of non-visible viramas.

Most of these don't apply to Tamil, of course.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Mon May 28 23:23:13 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Tue, 29 May 2018 13:23:13 +0900
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
Message-ID: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:
> Hi,
> 
> In languages like Ruby or Java
> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
> functions to check if a character is alphabetic do that by looking for
> the 'Alphabetic'  property (defined true if it's in one of the L
> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
> Tamil text, this works out well for independent vowels and consonants
> (which are in Lo), and for most dependent signs (which are in Mc or Mn
> but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA)
> is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.
> 
> This doesn't make sense to me since the Virama  ???? as much of an
> alphabetic character as any of the "Dependent Vowel" characters which
> have been given the 'Other_Alphabetic' property. Is there a rationale
> behind this difference, or is it an oversight to be corrected?

I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to give 
the VIRAMA the 'Other_Alphabetic' property.

I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).

BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.

Regards,   Martin.

From unicode at unicode.org  Mon May 28 23:40:49 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 28 May 2018 21:40:49 -0700
Subject: Unicode characters unification
In-Reply-To: <20180528200139.744ee706@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
Message-ID: <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com>

In the discussion leading up to this it has been implied that Unicode 
encodes / should encode concepts or pure shape. And there's been some 
confusion as to were concerns about sorting or legacy encodings fit in. 
Time to step back a bit:

Primarily the Unicode Standard encodes by character identity - something 
that is different from either the pure shape or the "concept denoted by 
the character".

For example, for most alphabetic characters, you could say that they 
stand for a more-or-less well-defined phonetic value. But Unicode does 
not encode such values directly, instead it encodes letters - which in 
turn get re-purposed for different sound values in each writing system.

Likewise, the various uses of period or comma are not separately encoded 
- potentially these marks are given mappings to specific functions for 
each writing system or notation using them.

Clearly these are not encoded to represent a single mapping to an 
external concept, and, as we will see, they are not necessarily encoded 
directly by shape.

Instead, the Unicode Standard encodes character identity; but there are 
a number of principled and some ad-hoc deviations from a purist 
implementation of that approach.

The first one is that of forcing a disunification by script. What 
constitutes a script can be argued over, especially as they all seem to 
have evolved from (or been created based on) predecessor scripts, so 
there are always pairs of scripts that have a lot in common. While an 
"Alpha" and an "A" do have much in common, it is best to recognize that 
their membership in different scripts leads to important differences so 
that it's not a stretch to say that they no longer share the same identity.

The next principled deviation is that of requiring case pairs to be 
unique. Bicameral scripts, (and some of the characters in them), 
acquired their lowercase at different times, so that the relation 
between the upper cases and the lower cases are different across 
scripts, and gives rise to some exceptional cases inside certain scripts.

This is one of the reasons to disunify certain bicameral scripts. But 
even inside scripts, there are case pairs that may share lowercase forms 
or may share uppercase forms, but said forms are disunified to make the 
pairs separate. The two first principles match users expectations in 
that case changes (largely) work as expected in plain text and that 
sorting also (largely) matches user expectation by default.

The third principle is to disunify characters based on line-breaking or 
line-layout properties. Implicit in that is the idea that plain text, 
and not markup, is the place to influence basic algorithms such as 
line-breaking and bidi layout (hence two sets of Arabic-Indic digits). 
Once can argue with that decision, but the fact is, there are too many 
places where text exist without the ability to apply markup to go 
entirely without that support.

The fourth principle is that of differential variability of appearance. 
For letters proper, their identity can be associated with a wide range 
of appearances from sparse to fanciful glyphs. If an entire piece of 
text (or even a given word) is set using a particular font style, 
context will enable the reader to identify the underlying letter, even 
if the shape is almost unrelated to the "archetypical shape" documented 
in the Standard.

When letters or marks get re-used in notational systems, though, the 
permissible range of variability changes dramatically - variations that 
do not change the meaning of a word in styled text, suddenly change the 
meaning of text in a certain notational system. Hence the disunification 
of certain letters or marks (but not all of them) in support of 
mathematical notation.

The fifth principle appears to be to disunify only as far as and only 
when necessary. The biggest downside of this principle is that it leads 
to "late" disunifications; some characters get disunified as the 
committee becomes aware of some issue, leading to the problem of legacy 
data. But it has usefully somewhat limited the further proliferation of 
characters of identical appearance.

The final principle is compatibility. This covers being able to 
round-trip from certain legacy encodings. This principle may force some 
disunifications that otherwise might not have happened, but it also 
isn't a panacea: there are legacy encodings that are mutually 
incompatible, so that one needs to make a choice which one to support. 
TeX being a "glyph based" system looses out here in comparison to legacy 
plain-text character encoding systems such as the 8859 series of ISO/IEC 
standards.

Some unification among punctuation marks in particular seem to have been 
made on a more ad-hoc basis. This issue is exacerbated by the fact that 
many such systems lack either the wide familiarity of standard writing 
systems (with their tolerance for glyph variation) nor the rigor of 
something like mathematical notation. This leads to the pragmatic choice 
of letting users select either "shape" or "concept" rather than 
"identity"; generally, such ad-hoc solutions should be resisted -- they 
are certainly not to be seen as a precedent for "encoding concepts" 
generally.

But such exceptions prove the rule, which leads back to where we 
started: the default position is that Unicode encodes a character 
identity that is not the same as encoding the concept that said 
character is used to represent in writing.

A./


From unicode at unicode.org  Mon May 28 23:44:11 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 28 May 2018 21:44:11 -0700
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
Message-ID: <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>

One of the general principles is that combining marks inherit the 
property of their base character.

Normally, "inherited" should be the only property value for combining marks.

There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) where 
it is necessary to recognize the character as combining, but the general 
principle still holds.

Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.

A./


On 5/28/2018 9:23 PM, Martin J. D?rst via Unicode wrote:
> Hello Sundar,
>
> On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:
>> Hi,
>>
>> In languages like Ruby or Java
>> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 
>>
>> functions to check if a character is alphabetic do that by looking for
>> the 'Alphabetic'? property (defined true if it's in one of the L
>> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
>> Tamil text, this works out well for independent vowels and consonants
>> (which are in Lo), and for most dependent signs (which are in Mc or Mn
>> but have the 'Other_Alphabetic' property), but the very common pulli 
>> (VIRAMA)
>> is neither in Lo nor has 'Other_Alphabetic', and so leads to
>> concluding any string containing it to be non-alphabetic.
>>
>> This doesn't make sense to me since the Virama? ???? as much of an
>> alphabetic character as any of the "Dependent Vowel" characters which
>> have been given the 'Other_Alphabetic' property. Is there a rationale
>> behind this difference, or is it an oversight to be corrected?
>
> I suggest submitting an error report via 
> https://www.unicode.org/reporting.html. I haven't studied the issue in 
> detail (sorry, just no time this week), but it sounds reasonable to 
> give the VIRAMA the 'Other_Alphabetic' property.
>
> I'd recommend to mention examples other than Tamil in your report 
> (assuming they exist).
>
> BTW, what's the method you are using in Ruby? If there's a problem in 
> Ruby (which I don't think; it's just using Unicode data), then please 
> make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
> should be able to follow up on that.
>
> Regards,?? Martin.
>


From unicode at unicode.org  Mon May 28 23:45:04 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 28 May 2018 21:45:04 -0700
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
Message-ID: <b40017fa-7f50-9120-f485-6f0ec3481d95@att.net>


On 5/28/2018 9:23 PM, Martin J. D?rst via Unicode wrote:
> Hello Sundar,
>
> On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:
>> Hi,
>>
>> In languages like Ruby or Java
>> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 
>>
>> functions to check if a character is alphabetic do that by looking for
>> the 'Alphabetic'? property (defined true if it's in one of the L
>> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
>> Tamil text, this works out well for independent vowels and consonants
>> (which are in Lo), and for most dependent signs (which are in Mc or Mn
>> but have the 'Other_Alphabetic' property), but the very common pulli 
>> (VIRAMA)
>> is neither in Lo nor has 'Other_Alphabetic', and so leads to
>> concluding any string containing it to be non-alphabetic.
>>
>> This doesn't make sense to me since the Virama? ???? as much of an
>> alphabetic character as any of the "Dependent Vowel" characters which
>> have been given the 'Other_Alphabetic' property. Is there a rationale
>> behind this difference, or is it an oversight to be corrected?
>
> I suggest submitting an error report via 
> https://www.unicode.org/reporting.html. I haven't studied the issue in 
> detail (sorry, just no time this week), but it sounds reasonable to 
> give the VIRAMA the 'Other_Alphabetic' property.

Please don't. This is not an error in the Unicode property assignments, 
which have been stable in scope for Alphabetic for some time now.

The problem is in assuming that the Java or Ruby isAphabetic() API, 
which simply report the Unicode property value Alphabetic for a 
character, suffices for identifying a string as somehow "wordlike". It 
doesn't.

The approximation you are looking for is to add Diacritic to Alphabetic. 
That will automatically pull in all the nuktas and viramas/killers for 
Brahmi-derived scripts. It also will pull in the harakat for Arabic and 
similar abjads, which are also not Alphabetic in the property values. 
And it will pull in tone marks for various writing systems.

For good measure, also add Extender, which will pick up length marks and 
iteration marks.

Please do not assume that the Alphabetic property just automatically 
equates to "what I would write in a word". Or that it should be adjusted 
to somehow make that happen. It would be highly advisable to study *all* 
the UCD properties in more depth, before starting to report bugs in one 
or another simply because using a single property doesn't produce the 
string classification one assumes should be correct in a particular case.

Of course, to get a better approximation of what actually constitutes a 
"word" in a particular writing system, instead of using raw property 
API's, one should be using a WordBreak iterator, preferably one tailored 
for the language in question.

--Ken


>
> I'd recommend to mention examples other than Tamil in your report 
> (assuming they exist).
>
> BTW, what's the method you are using in Ruby? If there's a problem in 
> Ruby (which I don't think; it's just using Unicode data), then please 
> make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
> should be able to follow up on that.
>
> Regards,?? Martin.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180528/7b2ad4e8/attachment.html>

From unicode at unicode.org  Tue May 29 00:02:15 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 28 May 2018 22:02:15 -0700
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
 <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>
Message-ID: <d64edd94-a6b5-a402-6bb9-896ad2dea04e@att.net>


On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> One of the general principles is that combining marks inherit the 
> property of their base character.
>
> Normally, "inherited" should be the only property value for combining 
> marks.
>
> There have been some deviations from this over the years, for various 
> reasons, and there are some properties (such as general category) 
> where it is necessary to recognize the character as combining, but the 
> general principle still holds.
>
> Therefore, if you are trying to see whether a string is alphabetic, 
> combining marks should be "transparent" to such an algorithm.

Generally, good advice. But there are clear exceptions. For example, the 
enclosing combining marks for symbols are intended (basically) to make 
symbols of a sort. And many combining marks have explicit script 
assigments, so they cannot simply willy-nilly inherit the script of a 
base letter if they are misapplied, for example.

This is why I recommend simply adding the Diacritic property into the 
mix for testing a string. That is a closer approximation to the kind of 
naive "Is this string alphabetic?" question that SunaraRaman was asking 
about -- it picks up the correct subset of combining marks to union with 
the set of actual isAlphabetic characters, to produce more expected 
results. (Including, of course, the correct classification of all the 
viramas, stackers, and killers, as well as picking up all the nuktas.).

Folks, please examine the set of character for Diacritic and for 
Extender in:

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already 
there.

--Ken

P.S. And please don't start an argument about the fact that a "virama" 
isn't really a "diacritic". We know that, too. ;-)


From unicode at unicode.org  Tue May 29 00:30:22 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 28 May 2018 22:30:22 -0700
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp symbols
In-Reply-To: <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
Message-ID: <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180528/9f8384e7/attachment.html>

From unicode at unicode.org  Tue May 29 02:49:52 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 May 2018 08:49:52 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <d64edd94-a6b5-a402-6bb9-896ad2dea04e@att.net>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
 <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>
 <d64edd94-a6b5-a402-6bb9-896ad2dea04e@att.net>
Message-ID: <20180529084952.7856023d@JRWUBU2>

On Mon, 28 May 2018 22:02:15 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> > One of the general principles is that combining marks inherit the 
> > property of their base character.
> >
> > Normally, "inherited" should be the only property value for
> > combining marks.
> >
> > There have been some deviations from this over the years, for
> > various reasons, and there are some properties (such as general
> > category) where it is necessary to recognize the character as
> > combining, but the general principle still holds.
> >
> > Therefore, if you are trying to see whether a string is alphabetic, 
> > combining marks should be "transparent" to such an algorithm.  
> 
> Generally, good advice. But there are clear exceptions. For example,
> the enclosing combining marks for symbols are intended (basically) to
> make symbols of a sort. And many combining marks have explicit script 
> assigments, so they cannot simply willy-nilly inherit the script of a 
> base letter if they are misapplied, for example.

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?

> This is why I recommend simply adding the Diacritic property into the 
> mix for testing a string. That is a closer approximation to the kind
> of naive "Is this string alphabetic?" question that SunaraRaman was
> asking about -- it picks up the correct subset of combining marks to
> union with the set of actual isAlphabetic characters, to produce more
> expected results. (Including, of course, the correct classification
> of all the viramas, stackers, and killers, as well as picking up all
> the nuktas.).
> 
> Folks, please examine the set of character for Diacritic and for 
> Extender in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
> 
> to see what I'm talking about. The stuff you are looking for is
> already there.

Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow <pulli, ZWNJ> as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.  U+034F
COMBINING GRAPHEME JOINER is also missing, apparently deliberately in
the case of 'diacritic'. If one uses the definition of words in the
word break algorithm, one will end up accepting combinations of letter
plus enclosing circle or keycap.  (A fix to the word break algorithm
for that would be unpleasant.)

One hopes that the requirement doesn't include accepting all single
words.  Every properly spelt word containing U+0E46 THAI CHARACTER
MAIYAMOK will be rejected, as it will contain a space before the
U+0E46.  (I assume there are such words; certainly there are
dictionary entries with no corresponding entries without U+0E46,
such as "???? ?".) At a lesser level, even English has a very few
words with spaces in them, and there is no solution but to list them.

Richard.


From unicode at unicode.org  Tue May 29 03:08:48 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 May 2018 09:08:48 +0100
Subject: Unicode characters unification
In-Reply-To: <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
 <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com>
Message-ID: <20180529090848.5ffae27a@JRWUBU2>

On Mon, 28 May 2018 21:40:49 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> But such exceptions prove the rule, which leads back to where we 
> started: the default position is that Unicode encodes a character 
> identity that is not the same as encoding the concept that said 
> character is used to represent in writing.

And the problem remains that of determining the 'identity'.  It is
rather like distinguishing species - biologists have dozens of
different concepts.

Richard.

From unicode at unicode.org  Tue May 29 03:15:42 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 29 May 2018 10:15:42 +0200
Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?=
In-Reply-To: <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <CAN49p6prnaSA2MQ0paQH8L6unS78CJ0XAVKy-FAqh3RdkcGO3w@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>
Message-ID: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>


> On 29 May 2018, at 07:30, Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
> On 5/28/2018 6:30 AM, Hans ?berg via Unicode wrote:
>>> Unifying these would make a real mess of lower casing!
>>> 
>> German has a special sign ? for "ss", without upper capital version.
>> 
>> 
> You may want to retract the second part of that sentence.
> 
> An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position). 
> A./

Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version.

1. https://en.wikipedia.org/wiki/?


From unicode at unicode.org  Tue May 29 03:54:27 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Tue, 29 May 2018 17:54:27 +0900
Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?=
In-Reply-To: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>
 <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
Message-ID: <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>

On 2018/05/29 17:15, Hans ?berg via Unicode wrote:
> 
>> On 29 May 2018, at 07:30, Asmus Freytag via Unicode <unicode at unicode.org> wrote:

>> An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position).
>> A./
> 
> Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version.
> 
> 1. https://en.wikipedia.org/wiki/?

The English wikipedia may not be fully up to date.
See https://de.wikipedia.org/wiki/Gro?es_? (second paragraph):

"Seit dem 29. Juni 2017 ist das ? Bestandteil der amtlichen deutschen 
Rechtschreibung.[2][3]"

Translated to English: "Since June 29, 2017, the ? is part of the 
official German orthography."

(As far as I remember the discussion (on this list?) last year, the ? 
(uppercase ?) is allowed, but not required.)

Regards,   Martin.


From unicode at unicode.org  Tue May 29 04:04:17 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 29 May 2018 11:04:17 +0200
Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?=
In-Reply-To: <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>
 <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
Message-ID: <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>


> On 29 May 2018, at 10:54, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> On 2018/05/29 17:15, Hans ?berg via Unicode wrote:
>>> On 29 May 2018, at 07:30, Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
>>> An uppercase exists and it has formally been ruled as acceptable way to write this letter (mostly an issue for ALL CAPS as ? does not occur in word-initial position).
>>> A./
>> Duden used one in 1957, but stated in 1984 that there is no uppercase version [1]. So it would be interesting with a reference to an official version.
>> 1. https://en.wikipedia.org/wiki/?
> 
> The English wikipedia may not be fully up to date.
> See https://de.wikipedia.org/wiki/Gro?es_? (second paragraph):
> 
> "Seit dem 29. Juni 2017 ist das ? Bestandteil der amtlichen deutschen Rechtschreibung.[2][3]"
> 
> Translated to English: "Since June 29, 2017, the ? is part of the official German orthography."
> 
> (As far as I remember the discussion (on this list?) last year, the ? (uppercase ?) is allowed, but not required.)

And it is already in Unicode as ? LATIN CAPITAL LETTER SHARP S U+1E9E. When looking for the lowercase ?
LATIN SMALL LETTER SHARP S U+00DF in a MacOS Character Viewer, it does not give the uppercase version, for some reason.

The equivalence with "ss" shows up ICU Regular Expressions that do case insensitive matching where the cases have different length, so it should do that for the new character to, I gather.
  http://userguide.icu-project.org/strings/regexp


From unicode at unicode.org  Tue May 29 04:17:53 2018
From: unicode at unicode.org (Werner LEMBERG via Unicode)
Date: Tue, 29 May 2018 11:17:53 +0200 (CEST)
Subject: Uppercase =?utf-8?B?w58=?=
In-Reply-To: <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
Message-ID: <20180529.111753.1271341346199266089.wl@gnu.org>


> When looking for the lowercase ? LATIN SMALL LETTER SHARP S U+00DF
> in a MacOS Character Viewer, it does not give the uppercase version,
> for some reason.

Yes, and it will stay so, AFAIK.  The uppercase variant of `?' is
`SS'.  `?' is to be used mainly for names that contain `?', and which
must be printed uppercase, for example in passports.  Here the
distinction is important, cf.

  Strau? vs. Strauss  ?  STRAU? vs. STRAUSS

Since uppercasing is not common in typesetting German text (in
particular headers), the need to make a distinction between words like
`Masse' (mass) and `Ma?e' (dimensions) if written uppercase is rarely
necessary because it can usually deduced by context.


    Werner


From unicode at unicode.org  Tue May 29 05:39:41 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 29 May 2018 12:39:41 +0200
Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?=
In-Reply-To: <20180529.111753.1271341346199266089.wl@gnu.org>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
Message-ID: <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>


> On 29 May 2018, at 11:17, Werner LEMBERG <wl at gnu.org> wrote:
> 
>> When looking for the lowercase ? LATIN SMALL LETTER SHARP S U+00DF
>> in a MacOS Character Viewer, it does not give the uppercase version,
>> for some reason.
> 
> Yes, and it will stay so, AFAIK.  The uppercase variant of `?' is
> `SS'.  `?' is to be used mainly for names that contain `?', and which
> must be printed uppercase, for example in passports.  Here the
> distinction is important, cf.
> 
>  Strau? vs. Strauss  ?  STRAU? vs. STRAUSS
> 
> Since uppercasing is not common in typesetting German text (in
> particular headers), the need to make a distinction between words like
> `Masse' (mass) and `Ma?e' (dimensions) if written uppercase is rarely
> necessary because it can usually deduced by context.

If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available.


From unicode at unicode.org  Tue May 29 07:20:26 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 29 May 2018 14:20:26 +0200
Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?=
In-Reply-To: <20180529105516.GA4094094@phare.normalesup.org>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
 <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
 <20180529105516.GA4094094@phare.normalesup.org>
Message-ID: <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com>


> On 29 May 2018, at 12:55, Arthur Reutenauer <arthur at reutenauer.eu> wrote:
> 
>> If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available.
> 
>  It would, for reasons of stability.

The main point is what users of ? and ? would think, and Unicode to adjust accordingly.


From unicode at unicode.org  Tue May 29 07:57:57 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 29 May 2018 14:57:57 +0200
Subject: =?utf-8?Q?Re=3A_Uppercase_=C3=9F?=
In-Reply-To: <20180529124759.GA4141011@phare.normalesup.org>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
 <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
 <20180529105516.GA4094094@phare.normalesup.org>
 <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com>
 <20180529124759.GA4141011@phare.normalesup.org>
Message-ID: <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com>


> On 29 May 2018, at 14:47, Arthur Reutenauer <arthur at reutenauer.eu> wrote:
> 
>> The main point is what users of ? and ? would think, and Unicode to adjust accordingly.
> 
>  Since users of ? would think that in the vast majority of cases, it
> ought to be uppercased to SS, I think you?re missing the main point.

No, you missed the point.


From unicode at unicode.org  Tue May 29 09:27:21 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 29 May 2018 07:27:21 -0700
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <20180529084952.7856023d@JRWUBU2>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
 <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>
 <d64edd94-a6b5-a402-6bb9-896ad2dea04e@att.net>
 <20180529084952.7856023d@JRWUBU2>
Message-ID: <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net>


On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> How would one know that they are misapplied?  And what if the author of
> the text has broken your rules? Are such texts never to be transcribed
> to pukka Unicode?

Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
Script=Latin) doesn't automatically make the Tamil vowel "inherit" the 
Latin script property value, nor should it.

That said, if someone decides they want that sequence, and their text as 
"broken my rules", so be it. I'm just not going to assume anything 
particular about that text. Note that in terms of trying to determine 
whether such a string is (naively) alphabetic, such a sequence doesn't 
interfere with the determination. On the other hand, a process concerned 
about text runs, script assignment, validity for domains, or other such 
issues *will* be sensitive to such a boundary -- and should not be 
overruled by some generic determination that combining marks inherit all 
the properties of their base.

>
>
> Even without knowing exactly what is wanted, it looks to me as though
> it isn't.  If he wants to allow <pulli, ZWNJ> as a substring, which
> he should, then that fails because there is no overlap between
> p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.

Yes, so if you are working with strings for Indic scripts (or for that 
matter, Arabic), you add Join_Control to the mix:

Alphabetic? ? Diacritic ? Extender ? Join_Control

gets you a decent approximation of what is (naively) expected to fall 
within an "alphabetic" string for most scripts.

For those following along, Alphabetic is roughly meant to cover the ABC, 
?????,... plus ideographic elements of most scripts. Diacritic picks up 
most of the applied combining marks, including nuktas, viramas, and tone 
marks. Extender picks up spacing elements that indicate length, 
reduplication, iteration, etc. And joiners are, well, joiners.

If one wants finer categorization specifically for Indic scripts, then I 
would suggest turning to the Indic_Syllabic_Category property instead of 
a union of PropList.txt properties and/or some twiddling with 
General_Category values.

--Ken


From unicode at unicode.org  Tue May 29 05:55:16 2018
From: unicode at unicode.org (Arthur Reutenauer via Unicode)
Date: Tue, 29 May 2018 12:55:16 +0200
Subject: Uppercase =?iso-8859-1?Q?=DF?=
In-Reply-To: <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
 <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
Message-ID: <20180529105516.GA4094094@phare.normalesup.org>

> If uppercasing is not common, one would think that setting it too ? would pose no problems, no that it is available.

  It would, for reasons of stability.

	Arthur

From unicode at unicode.org  Tue May 29 07:47:59 2018
From: unicode at unicode.org (Arthur Reutenauer via Unicode)
Date: Tue, 29 May 2018 14:47:59 +0200
Subject: Uppercase =?iso-8859-1?Q?=DF?=
In-Reply-To: <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
 <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
 <20180529105516.GA4094094@phare.normalesup.org>
 <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com>
Message-ID: <20180529124759.GA4141011@phare.normalesup.org>

> The main point is what users of ? and ? would think, and Unicode to adjust accordingly.

  Since users of ? would think that in the vast majority of cases, it
ought to be uppercased to SS, I think you?re missing the main point.

	Arthur

From unicode at unicode.org  Tue May 29 12:23:38 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 29 May 2018 10:23:38 -0700
Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?=
In-Reply-To: <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com>
References: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
 <cbde30d9-43e3-78d1-5492-780d777465b6@it.aoyama.ac.jp>
 <2BD47BB7-F4B1-4A87-871E-499101C19AE6@telia.com>
 <20180529.111753.1271341346199266089.wl@gnu.org>
 <19F85948-56B4-4B7A-B2CF-7E47373B8F95@telia.com>
 <20180529105516.GA4094094@phare.normalesup.org>
 <32E1D042-96B0-40BA-B67D-29D5FE84B657@telia.com>
 <20180529124759.GA4141011@phare.normalesup.org>
 <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com>
Message-ID: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180529/4d1d8818/attachment.html>

From unicode at unicode.org  Tue May 29 12:27:17 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Tue, 29 May 2018 10:27:17 -0700
Subject: Unicode characters unification
In-Reply-To: <20180529090848.5ffae27a@JRWUBU2>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <20180528160037.6b3689e0@JRWUBU2>
 <ECFA3143-3501-40BD-8CB2-764AD28B1162@telia.com>
 <20180528181852.7ce84e52@JRWUBU2>
 <476BC09F-4EC6-4A6E-918F-B113E3089631@telia.com>
 <20180528200139.744ee706@JRWUBU2>
 <8e0a34b8-c074-a152-42d0-bc55b9a132ff@ix.netcom.com>
 <20180529090848.5ffae27a@JRWUBU2>
Message-ID: <da277277-bdca-d265-0516-b5825b034ce9@ix.netcom.com>

On 5/29/2018 1:08 AM, Richard Wordingham wrote:
> On Mon, 28 May 2018 21:40:49 -0700
> Asmus Freytag via Unicode <unicode at unicode.org> wrote:
>
>> But such exceptions prove the rule, which leads back to where we
>> started: the default position is that Unicode encodes a character
>> identity that is not the same as encoding the concept that said
>> character is used to represent in writing.
> And the problem remains that of determining the 'identity'.  It is
> rather like distinguishing species - biologists have dozens of
> different concepts.
>
> Richard.
>
Totally. Never said that encoding is a simple algorithmic process. :)

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180529/5e9e018a/attachment.html>

From unicode at unicode.org  Tue May 29 12:33:35 2018
From: unicode at unicode.org (Otto Stolz via Unicode)
Date: Tue, 29 May 2018 19:33:35 +0200
Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?=
In-Reply-To: <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
References: <CABJUkCeAustDR8uO1OYyb2N=oaM3zNaui7OdwzND=gqwCYhkMA@mail.gmail.com>
 <56f80259-d983-6b41-2d4e-02ebffe79562@att.net>
 <8D7C32F2-052F-4796-A8A8-752DAFEE753E@telia.com>
 <CA+p4_H0sUkVPVWixGeocjJ9Tthdn50CmcWBa+Tt_8Mg8i7W8rA@mail.gmail.com>
 <EE043B2C-B4C7-4B37-BD39-44DB1F3F047F@telia.com>
 <CA+p4_H1bx19k29ybQHVsKZMkwBRu8DVtP6uWgs7H5D_MBLvODg@mail.gmail.com>
 <BEA22C1C-5C9A-456D-A9B6-81F1E69ECC8B@telia.com>
 <CAGa7JC0TwveiYvGbS8okVchKhQWSWF=84_KnY0rFeWV9evXoKQ@mail.gmail.com>
 <CA+p4_H0_vuJd5yawt_DifJAZEzNK9S7RTo2RyGFE4ToAnPLAww@mail.gmail.com>
 <CAGa7JC0oiRu8kFkNE9499NOGPkWgb0hRemVyapmazQdXHnJNOw@mail.gmail.com>
 <242DEB2B-0517-4632-BD47-A2F346062F79@telia.com>
 <CA+p4_H0xpEpC8K1TRnNd5mhUbpZGshDuup_-jzyfXJ=hqNnN5w@mail.gmail.com>
 <27CFC4F4-04DC-41BC-879B-7441218F0EEE@telia.com>
 <20180528141023.24d2231e@JRWUBU2>
 <D75700D2-4E61-43B9-89CA-053A557323C9@telia.com>
 <07f69661-ef7e-3f9c-6d5d-7eeae429c56e@ix.netcom.com>
 <6FAD9BDA-A66A-4444-9F09-2B513C1D63FF@telia.com>
Message-ID: <72ae3932-5299-d55f-3d45-d84c542274d1@uni-konstanz.de>

Hello,
am 2018-05-29 um 10:15 Uhr hat Hans ?berg geschrieben:
> Duden used one in 1957, but stated in 1984 that there is no uppercase version [1].

There used to bee two differnt orthographic dictionaries,
both called ?Duden?:
? The Duden from Leipzig (DDR) had a captal ???, on the cover page
   of its 1957 edition.
? The Duden from Mannheim (FRG) never has featured a captal ???, IIRC.

> So it would be interesting with a reference to an official version.

Neither Duden has been anything like an ?official version? ? never ever.
Until 1996, the only official German orthography was somewhat loosely
defined by a common decision of the Ministers of Education of the FRG,
with an additional remark saying: ?In case of doubt, the spelling of the
latest edition of the Duden (i. e. the Mannheim version) will take
effect.?

Nowadays, the official version of the orthographic rules
can be found in:
<http://www.rechtschreibrat.com/regeln-und-woerterverzeichnis/>;
the Uppercase-? rule, particularily, is discussed in
<http://www.rechtschreibrat.com/DOX/rfdr_Regeln_2016_redigiert_2018.pdf>,
under ?25(E3); the latest version of the rule reads thusly:
> E3: Bei Schreibung mit Gro?buchstaben schreibt man SS.
> Daneben ist auch die Verwendung des Gro?buchstabens ? 
> m?glich. 
which means:
   When writing in all-caps, you write SS.
   Alternatively, the capital ? may be used.

So, the normal upper-case equivalent of German sharp-S
still is the double-S. The recently introduced capital sharp-S
is an optional alternative, but not the normal way of
uppercasing the sharp-S.

Best wishes,
    Otto Stolz


From unicode at unicode.org  Tue May 29 12:42:40 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 May 2018 18:42:40 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell>
References: <06D5CE553B39491DA2B6E0C05B63D53F@DougEwell>
Message-ID: <20180529184240.2526ec4e@JRWUBU2>

On Mon, 28 May 2018 16:13:43 -0600
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:
> 
> > The effects of virama that spring to mind are:
> >
> > (a) Causing one or both letters on either side to change or combine
> > to indicate combination;
> >
> > (b) Appearing as a mark only if it does not affect one of the
> > letters on either side;
> >
> > (c) Causing a left matra to appear on the left of the sequence of
> > consonants joined by a sequence of non-visible viramas.  
> 
> Most of these don't apply to Tamil, of course.

They all apply to ????  <U+0B95, U+0BCD, U+0BB7, U+0BC7> TAMIL
SYLLABLE KSSEE.  There are four other named syllables where they all
apply.

Richard


From unicode at unicode.org  Tue May 29 14:15:09 2018
From: unicode at unicode.org (Werner LEMBERG via Unicode)
Date: Tue, 29 May 2018 21:15:09 +0200 (CEST)
Subject: Uppercase =?utf-8?B?w58=?=
In-Reply-To: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>
References: <20180529124759.GA4141011@phare.normalesup.org>
 <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com>
 <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>
Message-ID: <20180529.211509.316853494165711301.wl@gnu.org>


> Overlooked in this discussion is the fact that the revised
> orthography of 1996 introduces for the first time a systematic
> difference in pronunciation for the vowel preceding SS vs. ? (short
> vs. long).  As users of the old orthography age out, I would not be
> surprised if the SS fallback were to become less acceptable over
> time because it would be at odds with how the word is to be
> pronounced. I'm also confidently expecting the use of ALL CAPS to
> become (somewhat) more prevalent under the continued influence of
> English usage.

It's not that simple.

* `?' is never used in Switzerland; it's always `ss' (and `SS').  Even
  ambiguous cases like `Masse' are always written like that.  This
  means that for Swiss users `?' is even more alien than for most
  German and Austrian users.  In particular, there doesn't exist a
  `unity SS' in Swiss German at all!  For example, the word `Ma?e' if
  capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and
  Austria (since `SS' is treated in this case as a unity).  However,
  the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a
  replacement for `?', is *not* treated as a unity.

* There are dialectic differences between northern and southern
  Germany (and Austria).  Example: `Gescho?' vs. `Geschoss', which
  means exactly the same ? and both orthographies are allowed.  For
  such cases, `GESCHOSS' is a much better uppercase version since it
  covers both dialectic forms.

I very much dislike the approach that just for the sake of `simplistic
standardization for uppercase' the use if `?' should be enforced in
German.  It's not the job of a language to fit computer usage.  It's
rather the job of computers to fit language usage.


    Werner


From unicode at unicode.org  Tue May 29 15:13:56 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Tue, 29 May 2018 13:13:56 -0700
Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?=
In-Reply-To: <20180529.211509.316853494165711301.wl@gnu.org>
References: <20180529124759.GA4141011@phare.normalesup.org>
 <974E56E1-9C39-40AB-8DA4-B0012672B288@telia.com>
 <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>
 <20180529.211509.316853494165711301.wl@gnu.org>
Message-ID: <fa54e909-4b06-a423-3e16-2c778d651eb1@ix.netcom.com>

On 5/29/2018 12:15 PM, Werner LEMBERG wrote:
>> Overlooked in this discussion is the fact that the revised
>> orthography of 1996 introduces for the first time a systematic
>> difference in pronunciation for the vowel preceding SS vs. ? (short
>> vs. long).  As users of the old orthography age out, I would not be
>> surprised if the SS fallback were to become less acceptable over
>> time because it would be at odds with how the word is to be
>> pronounced. I'm also confidently expecting the use of ALL CAPS to
>> become (somewhat) more prevalent under the continued influence of
>> English usage.
> It's not that simple.
>
> * `?' is never used in Switzerland; it's always `ss' (and `SS').  Even
>    ambiguous cases like `Masse' are always written like that.  This
>    means that for Swiss users `?' is even more alien than for most
>    German and Austrian users.  In particular, there doesn't exist a
>    `unity SS' in Swiss German at all!  For example, the word `Ma?e' if
>    capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and
>    Austria (since `SS' is treated in this case as a unity).  However,
>    the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a
>    replacement for `?', is *not* treated as a unity.

So the Swiss don't have that issue. What do they do for names?

>
> * There are dialectic differences between northern and southern
>    Germany (and Austria).  Example: `Gescho?' vs. `Geschoss', which
>    means exactly the same ? and both orthographies are allowed.  For
>    such cases, `GESCHOSS' is a much better uppercase version since it
>    covers both dialectic forms.
I don't see the claimed benefit; if you allow two different spellings in 
lowercase to
track the phonetic difference, then that would rather seem to support my 
argument
that there is now a tension in the orthography (for standard German) 
that may well
resolve itself by greater use of the distinct uppercase form.

Users who will end up "resolving" this would be those who grew up only 
with the
revised orthography. Older users are used to a different principle of 
selecting
between SS and ? and that isn't tied to pronunciation of preceding vowel.

>
> I very much dislike the approach that just for the sake of `simplistic
> standardization for uppercase' the use if `?' should be enforced in
> German.  It's not the job of a language to fit computer usage.  It's
> rather the job of computers to fit language usage.
Hmm, don't see anyone calling for that in this discussion.

A./
>
>
>      Werner


From unicode at unicode.org  Tue May 29 15:43:52 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 May 2018 21:43:52 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net>
References: <CAOORhHqzt=ZZJwUgYpkX0_44Yq5q+9TbAFxqGM8MM2wJXrjAdA@mail.gmail.com>
 <6fbc337a-32a1-5b92-fc1b-6a9dc52f42db@it.aoyama.ac.jp>
 <c0059845-1b8a-fe9f-5819-ef7bf9483762@ix.netcom.com>
 <d64edd94-a6b5-a402-6bb9-896ad2dea04e@att.net>
 <20180529084952.7856023d@JRWUBU2>
 <6e7d0393-7a99-55c9-7a73-5b3b2fe52e1c@att.net>
Message-ID: <20180529214352.04906154@JRWUBU2>

On Tue, 29 May 2018 07:27:21 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> > How would one know that they are misapplied?  And what if the
> > author of the text has broken your rules? Are such texts never to
> > be transcribed to pukka Unicode?  
> 
> Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
> Script=Latin) doesn't automatically make the Tamil vowel "inherit"
> the Latin script property value, nor should it.

It's the sort of process that gave us U+0310 COMBINING CANDRABINDU.
However, I see adding SE Asian dependent vowels to Latin letter x
(U+0078, Script=Latin) as rather tending to make 'x' Script=Common.
Others have disagreed quite vehemently.  I see the view that the base
character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has
prevailed.  Serifed U+00D7 is quite common in manually typewritten
material; I remember it from school.  I'm not sure what script the
sequence <U+00D7, U+0EB5 LAO VOWEL SIGN II> belongs to in OpenType
layout. I ought to find out for the benefit of Tai Tham fonts.

> That said, if someone decides they want that sequence, and their text
> as "broken my rules", so be it. I'm just not going to assume anything 
> particular about that text. Note that in terms of trying to determine 
> whether such a string is (naively) alphabetic, such a sequence
> doesn't interfere with the determination. On the other hand, a
> process concerned about text runs, script assignment, validity for
> domains, or other such issues *will* be sensitive to such a boundary
> -- and should not be overruled by some generic determination that
> combining marks inherit all the properties of their base.

When it comes to script runs for rendering, such a rule feels
oppressive; it is widely unenforced.  For example, I have found that
if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham
character, it will generally render satisfactorily on a Tai Tham
character.  Presumably I can now use a few examples of the same
Northern Thai syllable on the same page in a published language-teaching
book as evidence for adding its clone to the Tai Tham script.  There
should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham
syllables, but I haven't found any yet.  See the chart at the end of
"Exemple d??criture ignor?e par Unicode : l??criture tham du Laos"
http://www.laosoftware.com/download/articleTALN.pdf for an implicit
claim of existence.

> > Even without knowing exactly what is wanted, it looks to me as
> > though it isn't.  If he wants to allow <pulli, ZWNJ> as a
> > substring, which he should, then that fails because there is no
> > overlap between p{extender} and p{gc=Cf} or between p{diacritic}
> > and p{gc=Cf}.  
> 
> Yes, so if you are working with strings for Indic scripts (or for
> that matter, Arabic), you add Join_Control to the mix:
> 
> Alphabetic? ? Diacritic ? Extender ? Join_Control
> 
> gets you a decent approximation of what is (naively) expected to fall 
> within an "alphabetic" string for most scripts.

but won't work for collatable Welsh 'Llan?gollen'!  (There's a CGJ
between the 'n' and the 'g'.)


One also needs Join_Control for fraktur German and, to my mind,
English 'Ca?esar'.

> For those following along, Alphabetic is roughly meant to cover the
> ABC, ?????,... plus ideographic elements of most scripts.
> Diacritic picks up most of the applied combining marks, including
> nuktas, viramas, and tone marks. Extender picks up spacing elements
> that indicate length, reduplication, iteration, etc. And joiners are,
> well, joiners.

'Diacritic' mostly includes marks with secondary collation weight;
those with primary weights, such as Indic dependent vowels, are mopped
up in Alphabetic.  (Removing diacritics is very much not the same
as removing combining marks.)

> If one wants finer categorization specifically for Indic scripts,
> then I would suggest turning to the Indic_Syllabic_Category property
> instead of a union of PropList.txt properties and/or some twiddling
> with General_Category values.

You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN
HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN
DOUBLE CANDRABINDU VIRAMA.  And you'd still miss U+0303 COMBINING TILDE
and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I
need to make another attempt to get them appropriate Indic syllabic
category values.

Richard.


From unicode at unicode.org  Tue May 29 16:03:25 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 29 May 2018 14:03:25 -0700
Subject: Why is TAMIL SIGN VIRAMA (pulli) not =?UTF-8?Q?Alphabetic=3F?=
Message-ID: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com>

Richard Wordingham wrote:

>>> The effects of virama that spring to mind are:
>>>
>>> (a) Causing one or both letters on either side to change or combine
>>> to indicate combination;
>>>
>>> (b) Appearing as a mark only if it does not affect one of the
>>> letters on either side;
>>>
>>> (c) Causing a left matra to appear on the left of the sequence of
>>> consonants joined by a sequence of non-visible viramas.
>>
>> Most of these don't apply to Tamil, of course.
>
> They all apply to ???? <U+0B95, U+0BCD, U+0BB7, U+0BC7> TAMIL
> SYLLABLE KSSEE. There are four other named syllables where they all
> apply.

And several others where they do not. TUS explains that visible
pu??i is the general rule in Tamil, and conjunct ligatures are the
exception.

I should have written "These mostly don't apply to Tamil, of course."

In any case, Ken has answered the real underlying question: a process
that checks whether each character in a sequence is "alphabetic" is
inappropriate for determining whether the sequence constitutes a word.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue May 29 16:46:21 2018
From: unicode at unicode.org (Werner LEMBERG via Unicode)
Date: Tue, 29 May 2018 23:46:21 +0200 (CEST)
Subject: Uppercase =?utf-8?B?w58=?=
In-Reply-To: <fa54e909-4b06-a423-3e16-2c778d651eb1@ix.netcom.com>
References: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>
 <20180529.211509.316853494165711301.wl@gnu.org>
 <fa54e909-4b06-a423-3e16-2c778d651eb1@ix.netcom.com>
Message-ID: <20180529.234621.2273334688029588499.wl@gnu.org>


>> * `?' is never used in Switzerland; it's always `ss' (and `SS').
>>   [...]
> 
> So the Swiss don't have that issue. What do they do for names?

Foreign names containing `?' are treated as-is, AFAIK.  It's similar
to using, say, accents in some foreign names in English.

>>   For such cases, `GESCHOSS' is a much better uppercase version
>>   since it covers both dialectic forms.

... and Swiss people would use the same uppercase version...

> I don't see the claimed benefit; [...]
>
> Users who will end up "resolving" this would be those who grew up
> only with the revised orthography.

Indeed.

>> I very much dislike the approach that just for the sake of
>> `simplistic standardization for uppercase' the use if `?' should be
>> enforced in German.  [...]
>
> Hmm, don't see anyone calling for that in this discussion.

Well, I hear an implicit ?Great, there is now an `?' character!  Let's
use it as the uppercase version of `?' everywhere so that this nasty
German peculiarity is finally gone.?

Maybe it's only me...


    Werner


From unicode at unicode.org  Tue May 29 16:49:06 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 May 2018 22:49:06 +0100
Subject: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
In-Reply-To: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com>
References: <20180529140325.665a7a7059d7ee80bb4d670165c8327d.e20e66a5e5.wbe@email03.godaddy.com>
Message-ID: <20180529224906.400d0346@JRWUBU2>

On Tue, 29 May 2018 14:03:25 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> In any case, Ken has answered the real underlying question: a process
> that checks whether each character in a sequence is "alphabetic" is
> inappropriate for determining whether the sequence constitutes a word.

Back in the second post of the thread, I made the point that a
conformant Unicode process cannot always give a yes/no answer to the
question of whether all characters in a string are alphabetic.

What we seem to have established is that Unicode properties are not set
up to facilitate the identification of words.  Given that
spell-checkers work, we have taken a wrong turn.  Perhaps we should
reconsider "b?e?", which consists of two letters each inside its own
enclosing circle.  The spell-checker I'm using considers it a
misspelt word, rather than two symbols side by side.

Richard.


From unicode at unicode.org  Tue May 29 18:32:19 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Tue, 29 May 2018 16:32:19 -0700
Subject: =?UTF-8?Q?Re:_Uppercase_=c3=9f?=
In-Reply-To: <20180529.234621.2273334688029588499.wl@gnu.org>
References: <4dc66fa6-0521-cbad-7447-82e4be9a10d6@ix.netcom.com>
 <20180529.211509.316853494165711301.wl@gnu.org>
 <fa54e909-4b06-a423-3e16-2c778d651eb1@ix.netcom.com>
 <20180529.234621.2273334688029588499.wl@gnu.org>
Message-ID: <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com>

On 5/29/2018 2:46 PM, Werner LEMBERG wrote:
>>> I very much dislike the approach that just for the sake of
>>> `simplistic standardization for uppercase' the use if `?' should be
>>> enforced in German.  [...]
>> Hmm, don't see anyone calling for that in this discussion.
> Well, I hear an implicit ?Great, there is now an `?' character!  Let's
> use it as the uppercase version of `?' everywhere so that this nasty
> German peculiarity is finally gone.?

The ALL-CAPS "SS" really has little to recommend it, intrinsically. It 
is de-facto a fall-back; one that competed with "SZ" as used in 
telegrams (while they still were a thing). Not being able to know how to 
hyphenate MASSE without knowing the meaning of the word is also not 
something that I consider a "benefit".

Uppercase forms for `?' have been kicking around in fonts for a long 
time as was documented around the time that the character was encoded. 
It is possible mainly because running text in ALL CAPS is? indeed 
uncommon (and in the time of Fraktur was effectively not viable because 
the Fraktur capitals don't lend themselves to it. (If SS had ever 
occurred in Title-Case, I doubt it would have survived as long, other 
than the "Swiss solution" of making it the only form, also in lower case).

Saving an uppercase form for a non-initial letter was a godsend on 
typewriters -- adding to the factors that made the "SS" solution 
acceptable. But sign writers, type designers and typesetters did not 
find it so universally attractive - also documented exhaustively.

With changing environment (starting with influence from Anglo-Saxon use 
of type and not ending with the way the character is treated in relation 
to phonetics) I've been expecting so see usage evolving; and not 
necessarily driven by software engineers.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180529/3df3ba4f/attachment.html>

From unicode at unicode.org  Wed May 30 00:45:48 2018
From: unicode at unicode.org (Werner LEMBERG via Unicode)
Date: Wed, 30 May 2018 07:45:48 +0200 (CEST)
Subject: Uppercase =?utf-8?B?w58=?=
In-Reply-To: <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com>
References: <fa54e909-4b06-a423-3e16-2c778d651eb1@ix.netcom.com>
 <20180529.234621.2273334688029588499.wl@gnu.org>
 <4f890d1c-34f2-a6c5-930d-f650e4184601@ix.netcom.com>
Message-ID: <20180530.074548.66833950861253245.wl@gnu.org>


> The ALL-CAPS "SS" really has little to recommend it, intrinsically.
> It is de-facto a fall-back; one that competed with "SZ" as used in
> telegrams (while they still were a thing).

Well, the status of `?' is indeed complicated, and the radical
solution used in Switzerland has certainly benefits.

> Not being able to know how to hyphenate MASSE without knowing the
> meaning of the word is also not something that I consider a
> "benefit".

I don't see much difference to the English example of `re-cord'
vs. `rec-ord'.  And Swiss people won't start to use `?' just for
getting the right meaning...

> Uppercase forms for `?' have been kicking around in fonts for a long
> time as was documented around the time that the character was
> encoded.

Yes, and it was never successful.  The introduction of `?' into
Unicode a few years ago was mainly driven by experts, not something
that had big popularity before.

> With changing environment (starting with influence from Anglo-Saxon
> use of type and not ending with the way the character is treated in
> relation to phonetics) I've been expecting to see usage evolving;
> and not necessarily driven by software engineers.

Yes, let's see how everything will evolve.  Regardless of that,
software should support the status quo as good as possible.


    Werner


From unicode at unicode.org  Thu May 31 11:59:30 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 31 May 2018 17:59:30 +0100
Subject: Character Boundaries - Who is to choose?
Message-ID: <20180531175930.21b0d862@JRWUBU2>

This has nothing to do with grapheme boundaries.

A few days ago, I remarked that deciding whether two character usages
were of the same character was akin to deciding whether two populations
were of the same species.

It can also be difficult to decide where the boundary between two
species lies.  Is it the job of Unicode to prescribe the boundary
between two characters, or should it prefer to describe the boundary
that users largely follow?  A good example of an unobvious boundary is
U+02BC MODIFIER LETTER APOSTROPHE v. U+2019 RIGHT SINGLE QUOTATION MARK.

I am seeing a boundary issue between U+1A7A TAI THAM SIGN RA HAAM and
U+1A7C TAI THAM SIGN KHUEN-LUE KARAN.  Between them, they have two
different functions, namely as the superscript final consonant form of
RA and as a killer.  My understanding of the difference was that it was
based on the glyph shape.  The function of final consonant would always
be performed by U+1A7A, and U+1A7C would always have the function of
killer.  The 'HAAM' in 'RA HAAM' means 'to prohibit'.  KARAN seems to
be a loanword from Siamese, where it originally seems to just mean
'final letter', which is the only meaning I could find for it in Pali
(as _k?ranta_); nowadays, in Siamese it means 'a letter bearing the
mark U+0E4C THAI CHARACTER THANTHAKHAT', which Siamese mark is known as
_mai wanchakan_ when it just kills the vowel. 

In older Tai Khuen (1930's), both functions are performed by the RA
HAAM glyph.  The glyph used is relatively large.

What I have been seeing a lot of recently is Northern Thai text where
the killer function is encoded U+1A7C.  This does not strike me as
unreasonable; the usage expresses the view that the difference between
U+1A7C, which typically has a small glyph, and the Northern Thai glyph
for the killer function, which also tends to be small, is simply glyph
variation.   (I have no evidence of Northern Thai using superscript
final RA.) 

The idea of encoding the two functions differently was abandoned
because of the principle that combining marks are encoded on the basis
of form; encoding them separately would, on the face of the evidence,
have been like encoding diaeresis and the mark of umlaut separately.

Richard.