From christoph.paeper at crissov.de  Fri Dec  2 06:35:41 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 2 Dec 2016 13:35:41 +0100
Subject: Emoji mappings in Shift JIS / CP932/943
Message-ID: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>

I understand from 

- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

that Windows codepage 932 (IBM CP943) is basically (a superset of) Shift-JIS (JIS X 0208 A1). There are at least 3 related mapping files:

- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
- http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

I don?t know much about Shift-JIS, so this question may sound stupid: 
Could and should custom vendor extensions like the ones documented in

- http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt

be included in these mappings?

Related English Wikipedia articles:

- https://en.wikipedia.org/wiki/JIS_X_0208
- https://en.wikipedia.org/wiki/Shift_JIS
- https://en.wikipedia.org/wiki/Code_page_932
- https://en.wikipedia.org/wiki/Code_page_943

____

Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained at all as characters get added to subsequent releases of Unicode? For instance, I think that

- http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/SGML.TXT

(dated 25 July 1997, last modified 8 April 2002) includes several `????` that could be specified nowadays, e.g.:

-     epsiv	ISOgrk3	0x????	# variant epsilon
+     epsiv	ISOgrk3	0x03F5	# GREEK LUNATE EPSILON SYMBOL


From markus.icu at gmail.com  Fri Dec  2 10:46:14 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 2 Dec 2016 08:46:14 -0800
Subject: Emoji mappings in Shift JIS / CP932/943
In-Reply-To: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
References: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
Message-ID: <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>

On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per <christoph.paeper at crissov.de
> wrote:

> Could and should custom vendor extensions like the ones documented in
>
> - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt
>
> be included in these mappings?
>

They could, but it would be best for vendors to publish their actual
mappings rather than others guessing them.

At this point, the Emoji vendor mappings are not very relevant any more
because Unicode has added many Emoji symbols that are not in the old vendor
charsets.

In general, the biggest value of the Unicode mapping tables was for
cross-reference with existing standards when Unicode was being established.

Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained
> at all as characters get added to subsequent releases of Unicode?


I am not aware of anyone working on them. If there is one that you think
would be valuable to add or update, you can propose specific data.

Viele Gr??e,
markus

PS: One of my favorite charts:
https://w3techs.com/technologies/history_overview/character_encoding
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161202/1f2c21dc/attachment.html>

From verdy_p at wanadoo.fr  Fri Dec  2 12:15:22 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 2 Dec 2016 19:15:22 +0100
Subject: Emoji mappings in Shift JIS / CP932/943
In-Reply-To: <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
References: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
 <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
Message-ID: <CAGa7JC1t1yKtKZ4HcgyA7cMtrF35X=VfQ9xP4YE8qfgYOCNC4A@mail.gmail.com>

2016-12-02 17:46 GMT+01:00 Markus Scherer <markus.icu at gmail.com>:

> On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per <
> christoph.paeper at crissov.de> wrote:
>
>> Could and should custom vendor extensions like the ones documented in
>>
>> - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt
>>
>> be included in these mappings?
>>
>
> They could, but it would be best for vendors to publish their actual
> mappings rather than others guessing them.
>

Sometimes these vendors no longer exist, or have be bought by another
comany that has have stopped any earlier developments and supports for
their legacy systems.

Then it just remains a community of users that will have documents or data
to adapt to the newer standard: they'll try to "guess" some best fit
mappings so they can still use these data and encoded documents that remain
in their archives (and not always in an easily printable form such as
PDFs). However the most important documents to save are unlikely to contain
emojis, which are basically used in interactive talks between individual
users that have not archived them (and probably don't want these talks to
be archived for long, if we consider that most of these talks are private).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161202/99337809/attachment.html>

From christoph.paeper at crissov.de  Sat Dec  3 16:37:12 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sat, 3 Dec 2016 23:37:12 +0100
Subject: Emoji mappings in Shift JIS / CP932/943
In-Reply-To: <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
References: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
 <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
Message-ID: <E5F0CC30-D6A1-42E4-AEE6-CA9A99B55997@crissov.de>

Markus Scherer <markus.icu at gmail.com>:
> 
> 
> On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per <christoph.paeper at crissov.de> wrote:
>> Could and should custom vendor extensions like the ones documented in [EmojiSources.txt] be included in these mappings?
> 
> They could, but it would be best for vendors to publish their actual mappings rather than others guessing them.

If an existing character encoding forms the (sole) base of an addition to Unicode, shouldn?t it be part of the UTC?s job to document these sources? This was obviously done in the case of Japanese emoji, hence the existence of EmojiSources.txt, but for some reason that?s been kept separate from related mapping data files. 

I?m not sure the documentation is equally well available for emojis (also) taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/801740535073361920) and I have never seen an authoritative mapping from ASCII emoticons and line-art or from kaomojis to Unicode emojis. (There are plenty implementations of conversion routines, some open-source or well documented, others not.)

> At this point, the Emoji vendor mappings are not very relevant any more because Unicode has added many Emoji symbols that are not in the old vendor charsets.

Sure, but hardly anybody will ever want to convert Unicode emojis to Shift JIS, just (still rarely) the other way around.

>> Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained at all as characters get added to subsequent releases of Unicode?
> 
> I am not aware of anyone working on them. If there is one that you think would be valuable to add or update, you can propose specific data.

For __ML at least, there seem to be more up-to-date mappings available at <https://www.w3.org/2003/entities/2007/htmlmathml.ent> or <https://html.spec.whatwg.org/multipage/entities.json>, but not in a CSV format as preferred at Unicode.

I haven?t gone through all of them, but I think most entries claiming a missing equivalent character in Unicode are outdated. Then there are some edge cases, e.g. Apple could easily have claimed that U+1F34E or U+1F34F maps to their company logo in their typefaces/charsets/encodings. (There?s no Window emoji, by the way, just a Door or a Frame with Picture and ?.)

> https://w3techs.com/technologies/history_overview/character_encoding

Sure, the conversion to UTF-8 on the Internet is finally happening, but there?ll always be someone who?s tasked with rescuing or investigating some obscure files from a floppy or mainframe.

From markus.icu at gmail.com  Sat Dec  3 17:21:25 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Sat, 3 Dec 2016 15:21:25 -0800
Subject: Emoji mappings in Shift JIS / CP932/943
In-Reply-To: <E5F0CC30-D6A1-42E4-AEE6-CA9A99B55997@crissov.de>
References: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
 <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
 <E5F0CC30-D6A1-42E4-AEE6-CA9A99B55997@crissov.de>
Message-ID: <CAN49p6ohj9r6HiEdR-Gx35f9UJ58+i0O+vGqdxpLRZ9_-mfAbw@mail.gmail.com>

On Sat, Dec 3, 2016 at 2:37 PM, Christoph P?per <christoph.paeper at crissov.de
> wrote:

> If an existing character encoding forms the (sole) base of an addition to
> Unicode, shouldn?t it be part of the UTC?s job to document these sources?
> This was obviously done in the case of Japanese emoji, hence the existence
> of EmojiSources.txt, but for some reason that?s been kept separate from
> related mapping data files.
>

For the Japanese carriers, we had information about their Shift-JIS VDC
assignments (Vendor-Defined Characters) but not about their non-VDC
Shift-JIS usage. We only documented the VDC assignments, in a form that
documented our decisions on unifying symbols across the three main carriers
(which was in turn based on their 2006 cross-mapping agreement) and
encoding of Unicode code points. But I think you are right, there probably
was not really a good reason to put EmojiSources.txt into the UCD rather
than into MAPPINGS.

You could submit a proposal to move EmojiSources.txt to the MAPPINGS.

I?m not sure the documentation is equally well available for emojis (also)
> taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/
> 801740535073361920)


The W*dings are not charsets but symbol fonts which were used with a
generic Unicode PUA range. (After standardization, they may have gained
mappings for the new assignments.)

I think the ARIB symbols were lists of characters that wanted to be encoded
in Unicode so that PUA and VDCs could be avoided, so there probably was no
charset to map to either.

In any case, there might well be examples of characters from other charsets
whose mappings are documented in the proposal docs rather than in MAPPINGS.
If you find examples of such, you could collect the data and propose their
additions to MAPPINGS.

Remember that the Unicode Consortium is run by volunteers. Yes, many of us
work for member companies, but we often do Unicode work in addition to our
"main jobs". (Some continue to contribute even after retirement!)

and I have never seen an authoritative mapping from ASCII emoticons and
> line-art or from kaomojis to Unicode emojis. (There are plenty
> implementations of conversion routines, some open-source or well
> documented, others not.)
>

I would say that, as far as Unicode is concerned, the canonical "mapping"
for those are the Unicode *sequences* that are used to represent them.

> At this point, the Emoji vendor mappings are not very relevant any more
> because Unicode has added many Emoji symbols that are not in the old vendor
> charsets.
>
> Sure, but hardly anybody will ever want to convert Unicode emojis to Shift
> JIS, just (still rarely) the other way around.
>

Good point. I assume most do something like what we (Google) do: Take a
base Shift-JIS mapping (we use windows-932 I think), remove the VDC-range
mappings that conflict with a vendor's emoji range, and add the vendor's
emoji mappings. You can see examples for this in Android's ICU source tree.

For __ML at least, there seem to be more up-to-date mappings available at <
> https://www.w3.org/2003/entities/2007/htmlmathml.ent> or <
> https://html.spec.whatwg.org/multipage/entities.json>, but not in a CSV
> format as preferred at Unicode.
>
> I haven?t gone through all of them, but I think most entries claiming a
> missing equivalent character in Unicode are outdated.


Maybe the user community is better served via w3.org and/or whatwg.org; if
so, we could add a link from the MAPPINGS files to there.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161203/a42afacc/attachment.html>

From verdy_p at wanadoo.fr  Sat Dec  3 17:26:47 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 4 Dec 2016 00:26:47 +0100
Subject: Emoji mappings in Shift JIS / CP932/943
In-Reply-To: <E5F0CC30-D6A1-42E4-AEE6-CA9A99B55997@crissov.de>
References: <B2B71287-F47A-4CB6-8C0B-55F2AB61C01C@crissov.de>
 <CAN49p6qxLdDjg_19m+t3v6ri8kNoVjLOYLBHuK4cQink+e5PNA@mail.gmail.com>
 <E5F0CC30-D6A1-42E4-AEE6-CA9A99B55997@crissov.de>
Message-ID: <CAGa7JC04JE0tWej1ywkBMPhwhq3ooNGbxQnY=b67xbvP1pgz9g@mail.gmail.com>

2016-12-03 23:37 GMT+01:00 Christoph P?per <christoph.paeper at crissov.de>:

> Markus Scherer <markus.icu at gmail.com>:
> I haven?t gone through all of them, but I think most entries claiming a
> missing equivalent character in Unicode are outdated. Then there are some
> edge cases, e.g. Apple could easily have claimed that U+1F34E or U+1F34F
> maps to their company logo in their typefaces/charsets/encodings. (There?s
> no Window emoji, by the way, just a Door or a Frame with Picture and ?.)
>

There's also U+229E "plus in a square" ? (from mathematical operators)
which currently best approximates the Windows symbol ; it is used for
example is documents needing to show keytrokes used by an UI.

Notably on several wikis -- where it is also decorated (like all other
function keys or alphanumeric keys) by some CSS generated frames,
background colors with linear gradients and shadows to simulate the form of
a physical key.

For the key on Apple keyboards, the Apple logo is usually replaced by the
U+2318 "fleuron"-like symbol ? (from technical symbols), initially intended
for meaning "point of interest" (used now on Mac keyboards and used since
long in documents originating from Apple itself and in its UI), so that the
Apple logo is not needed for documenting or impelmenting an UI for MacOS.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161204/a6c84d01/attachment.html>

From reini at cpanel.net  Sun Dec  4 05:09:36 2016
From: reini at cpanel.net (Reini Urban)
Date: Sun, 4 Dec 2016 12:09:36 +0100
Subject: Mixed-Script confusables in prog.languages
Message-ID: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>

I?m working on adding Mixed-Script confusable protection to a programming language, 
cperl a perl5 fork, for security reasons, for its identifiers.

i.e. variable names, package names, function names, literals.

This is a bit different to the typical use cases of libidna, in email or browsers.

Is anybody aware of any other language implementation, which does confusable or mixed-script protection?
I think R has something, because it has this header: 
 https://cran.r-project.org/bin/windows/extsoft/3.4/include/unicode/uspoof.h
but I found nothing else, which is quite annoying.

My approach is as following:

* normalize identifiers (NFC) and only store normalized variants. this should catch bidi spoofs, combining characters and such.
* check each unicode code point for its Script property and besides Latin, Common and Inherited
only allow the first script, but error on any other mixed script. Additional scripts need to be declared.
https://github.com/perl11/cperl/issues/229

in perl like this:
    use utf8 ?Greek?, ?Cyrillic?;

utf8 is a pragma to allow unicode identifiers, not strings, to be added to the symbol table.
Obviously this has risks when reviewing a codebase, which might even bypass test suites.
This is fast enough, and has no measurable costs in the parser.

unicode has a nice security/confusable.txt table which could be used for more fine-grained checks, yes.
But I fear this is too much overhead for the generic parser, and I think that avoiding the 
problem by forbidding/need to declare mixed scripts is much easier, and more declarative.

Of course there exist several languages which require more than one script, like 
Japanese = Hiragana and Katakana and maybe Han,
Korean = Hangul + Han, ?
or african languages as some have other than Latin roots, e.g. Ethiopian from Semitic.
Indian languages also sound problematic, and all the Old_<script>

For these I just add aliases to allow multiple Scripts.

Reini Urban
rurban at cpanel.net


From verdy_p at wanadoo.fr  Sun Dec  4 13:07:22 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 4 Dec 2016 20:07:22 +0100
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
Message-ID: <CAGa7JC1n7fhY69o1cUK32Eo4QUmYNt9zU65TU_t_sBjK886zfA@mail.gmail.com>

For Japanese, Korean and Chinese there are already assigned som "script"
codes in ISO 15924 you can use for mixed scripts (e.g. "Jpan"="Hani+Hrkt"
and "Hrkt"="Hira"+"Kana")
These are already standardized aliases you can use. For some languages this
can be more complex (e.g. some Berber languages may use
Latin+Tamazigh+Arabic, probably not in identifiers, but possibly in user
names if they are also used as identifiers)
There will stil remain confusables (such as between Latin, Greek and
Cyrillic variants of letter A) which are unavoidable in some names using
mixed scripts (notably in user names or some geographic feature names or
trademarks if they are used as identifiers for page names or similar on a
community website, forum, wiki, or similar).
Various websites and applications will need their own limitations on usable
names (and must know that any limitation may cause some orthographic
problems notably for user names).

In more technical programming languages however, you can usually be much
more restrictive as the identifiers used are generally abbreviated and
simplified: you can kill lettercase differences for example, as well as
bidi controls, and probably some joiner/disjoiner controls and other
invisible format controls (the identifiers will need to be distinguished,
if needed, using some other characters), and forcing a normalization to NFC
is certainly helpful. If you need to embed in these languages some user
names, they'll need to be "escaped" sometimes, or included in string
litterals rather than plain identifiers.


2016-12-04 12:09 GMT+01:00 Reini Urban <reini at cpanel.net>:

> Of course there exist several languages which require more than one
> script, like
> Japanese = Hiragana and Katakana and maybe Han,
> Korean = Hangul + Han, ?
> or african languages as some have other than Latin roots, e.g. Ethiopian
> from Semitic.
> Indian languages also sound problematic, and all the Old_<script>
>
> For these I just add aliases to allow multiple Scripts.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161204/82d12d5f/attachment.html>

From markus.icu at gmail.com  Sun Dec  4 13:16:47 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Sun, 4 Dec 2016 11:16:47 -0800
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
Message-ID: <CAN49p6o1SGGfNT7WAVuSTV8DPQV0nMswDHx+3WRuPBDsGcE=QA@mail.gmail.com>

On Sun, Dec 4, 2016 at 3:09 AM, Reini Urban <reini at cpanel.net> wrote:

> Is anybody aware of any other language implementation, which does
> confusable or mixed-script protection?
> I think R has something, because it has this header:
>  https://cran.r-project.org/bin/windows/extsoft/3.4/
> include/unicode/uspoof.h
> but I found nothing else, which is quite annoying.
>

That's the ICU spoof detection API.
http://site.icu-project.org/download

Can you call that?

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161204/f7d4ffc8/attachment.html>

From richard.wordingham at ntlworld.com  Sun Dec  4 16:45:58 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 4 Dec 2016 22:45:58 +0000
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
Message-ID: <20161204224558.579e9201@JRWUBU2>

On Sun, 4 Dec 2016 12:09:36 +0100
Reini Urban <reini at cpanel.net> wrote:

> * normalize identifiers (NFC) and only store normalized variants.
> this should catch bidi spoofs, combining characters and such.

That doesn't catch bidi spoofs.

> * check each unicode code point for its Script property and besides
> Latin, Common and Inherited only allow the first script, but error on
> any other mixed script. Additional scripts need to be declared.
> https://github.com/perl11/cperl/issues/229
> 
> in perl like this:
>     use utf8 ?Greek?, ?Cyrillic?;

Your rule isn't clear.  Would an identifier like ?_S be automatically
allowed?

I presume you're handling the spoofing of the SMALL PHI characters by
other means.

For multilingual support, you would want rules more like

'After script X, allow script Y'.

> Of course there exist several languages which require more than one
> script, 
<snip>
> or african languages as some have other than Latin roots, e.g.
> Ethiopian from Semitic.

I don't see your problem here.  What problem do you see with Amharic?

> Indian languages also sound problematic,

Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.

> and
> all the Old_<script>

Now I am confused.  What problem do you see that you don't have in the
Latin script?

Richard.


From doug at ewellic.org  Sun Dec  4 18:44:11 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 4 Dec 2016 17:44:11 -0700
Subject: Mapping files (was: Re: Emoji mappings in Shift JIS / CP932/943)
In-Reply-To: <mailman.1.1480874401.31608.unicode@unicode.org>
References: <mailman.1.1480874401.31608.unicode@unicode.org>
Message-ID: <6F9999955BC24FD2A226164E902C238B@DougEwell>

Christoph P?per wrote:

>>> Could and should custom vendor extensions like the ones documented
>>> in [EmojiSources.txt] be included in these mappings?
>>
>> They could, but it would be best for vendors to publish their actual
>> mappings rather than others guessing them.
>
> If an existing character encoding forms the (sole) base of an addition
> to Unicode, shouldn't it be part of the UTC's job to document these
> sources? This was obviously done in the case of Japanese emoji, hence
> the existence of EmojiSources.txt, but for some reason that's been
> kept separate from related mapping data files.

I can confirm that the UTC is not interested in mappings for W*dings 
contributed by someone other than the vendor, even if they were taken 
directly from the final proposal to encode the remaining unencoded 
symbols in those sets.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From reini at cpanel.net  Mon Dec  5 02:31:11 2016
From: reini at cpanel.net (Reini Urban)
Date: Mon, 5 Dec 2016 09:31:11 +0100
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <20161204224558.579e9201@JRWUBU2>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
Message-ID: <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>


> On Dec 4, 2016, at 11:45 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Sun, 4 Dec 2016 12:09:36 +0100
> Reini Urban <reini at cpanel.net> wrote:
> 
>> * normalize identifiers (NFC) and only store normalized variants.
>> this should catch bidi spoofs, combining characters and such.
> 
> That doesn't catch bidi spoofs.

Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.

i.e. ?goog?le <U+202E (right-to-left override), g, o, o, g, U+202C (pop directional formatting), l, e>
is already caught as illegal.

Mixing RTL scripts, such as Arabic with Latin is not caught with the mixed-script rule per se.

>> * check each unicode code point for its Script property and besides
>> Latin, Common and Inherited only allow the first script, but error on
>> any other mixed script. Additional scripts need to be declared.
>> https://github.com/perl11/cperl/issues/229
>> 
>> in perl like this:
>>    use utf8 ?Greek?, ?Cyrillic?;
> 
> Your rule isn't clear.  Would an identifier like ?_S be automatically
> allowed?

?_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
non-default script, such as Cyrillic would error or need an explicit declaration.

So ?_S alone is fine, if everything else is Greek.
But mixing with the Cyrillic version would lead to an error.

> I presume you're handling the spoofing of the SMALL PHI characters by
> other means.

The spoof attempt would be ?_S with Cyrillic U+0471, Common, Latin.
2 mixed scripts which are illegal, if undeclared.
Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters have confusable 
Cyrillic counterparts, that?s why a declaration of use utf8 ?Greek?, ?Cyrillic?;
i.e. mixing those two sounds highly dangerous. 
With the UCD confusable table this would be an error. In my rule not, since the user
declared those two scripts to be mixed.

> For multilingual support, you would want rules more like
> 
> 'After script X, allow script Y?.

Can you expand on that with an example? I?m no expert on this.

Like after Hangul, allow Han? After Hiragana, allow Katakana?

>> Of course there exist several languages which require more than one
>> script, 
> <snip>
>> or african languages as some have other than Latin roots, e.g.
>> Ethiopian from Semitic.
> 
> I don't see your problem here.  What problem do you see with Amharic?

Amharic is not defined as UCD script property. It?s alphabet is called Ge?ez, which we call
Ethiopic in the UCD. But that?s all I know. I?m not a domain expert. Does Ethiopic uses
other Semitic scripts in its alphabet or is it complete? I learned some CFK languages, 
where you historically allow mixed scripts. But for other scripts I?m clueless.
The examples I got mix it with Runic. Valid or nonsense?

The problem is to decide which scripts are commonly mixed in which languages to allow
them to be valid identifiers.

How about the many Indian scripts? Do they mix?
Being an indian movie expert tells me that indian languages usually don?t mix. 
They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
get common points across the barrier. But their scripts? No idea.

> 
>> Indian languages also sound problematic,
> 
> Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.
> 
>> and
>> all the Old_<script>
> 
> Now I am confused.  What problem do you see that you don't have in the
> Latin script?

That I have no idea if those Old_<script> alphabets are still in use to create 
aliases for them.
In the examples in perl which partially came from parrot there?s a wild eclectic mix of various scripts
which do make no sense at all. So I don?t know if I can trust those tests, that they make sense and 
are readable at all. My guess is that the authors just liked code golfing and picked random unicode
characters. It?s from perl after all.

Such as this perl test t/mro/isa_c3_utf8.t

use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );

...
package ?o?;
package ur???;
@ur???::ISA = 'k?o??';
package ?;
@ur???::ISA = ('k?o??', '?o?');
package ??ck?;
...

These identifiers are unreadable, because I don?t assume that anybody will be able to understand
Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.

So my rule makes sense. You need to declare non-default scripts used in your identifiers if mixed.
(not strings. these can be everything, even illegal UTF-8).


From duerst at it.aoyama.ac.jp  Mon Dec  5 05:29:01 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 5 Dec 2016 20:29:01 +0900
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
Message-ID: <8a11951a-f441-9b6a-1cf7-9c1d7657a655@it.aoyama.ac.jp>


On 2016/12/05 17:31, Reini Urban wrote:

> ?_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
> new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit declaration.
>
> So ?_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.

Allowing mixing of Greek and Latin (or Cyrillic and Latin) would be a 
big problem. As an example, it would allow A_? (the second letter is a 
Greek one).

> Amharic is not defined as UCD script property. It?s alphabet is called Ge?ez, which we call
> Ethiopic in the UCD. But that?s all I know. I?m not a domain expert. Does Ethiopic uses
> other Semitic scripts in its alphabet or is it complete?

It's complete. I have never heard that it would need Arabic or Hebrew or 
some such.


> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don?t mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
> get common points across the barrier. But their scripts? No idea.

I don't think they mix two different scripts in the same word. Would be 
very confusing.


> In the examples in perl which partially came from parrot there?s a wild eclectic mix of various scripts
> which do make no sense at all. So I don?t know if I can trust those tests, that they make sense and
> are readable at all. My guess is that the authors just liked code golfing and picked random unicode
> characters. It?s from perl after all.
>
> Such as this perl test t/mro/isa_c3_utf8.t
>
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );
>
> ...
> package ?o?;
> package ur???;
> @ur???::ISA = 'k?o??';
> package ?;
> @ur???::ISA = ('k?o??', '?o?');
> package ??ck?;
> ...
>
> These identifiers are unreadable, because I don?t assume that anybody will be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.

The mixes aren't illegal, in that they are not against any law. But they 
are complete intellegible garbage anyway.

Regards,   Martin.

From duerst at it.aoyama.ac.jp  Mon Dec  5 05:37:22 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 5 Dec 2016 20:37:22 +0900
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <CAGa7JC1n7fhY69o1cUK32Eo4QUmYNt9zU65TU_t_sBjK886zfA@mail.gmail.com>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <CAGa7JC1n7fhY69o1cUK32Eo4QUmYNt9zU65TU_t_sBjK886zfA@mail.gmail.com>
Message-ID: <9f801fa7-55a3-ab0c-17df-5dbee2697fc8@it.aoyama.ac.jp>

On 2016/12/05 04:07, Philippe Verdy wrote:

> In more technical programming languages however, you can usually be much
> more restrictive as the identifiers used are generally abbreviated and
> simplified: you can kill lettercase differences for example,

In some languages maybe. But languages such as perl, C, Java, Ruby, 
Python, and so on distinguish case. Ruby starts constants (incl. class 
names) with Upper case, and variable names with lower case, so it needs 
this distinction more than e.g. C, where such distinctions may be used 
as conventions, but are not enforced by the language.

Anyway, my guess is that non-latin variable names will mostly be used in 
education and otherwise locally restricted circumstances (e.g. 
government projects), so I think that makes the chances of spoofing 
(other than self-spoofing) pretty low.

Regards,   Martin.

From c933103 at gmail.com  Mon Dec  5 06:51:56 2016
From: c933103 at gmail.com (gfb hjjhjh)
Date: Mon, 5 Dec 2016 20:51:56 +0800
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
Message-ID: <CAGHjPPKFqgA4PapzLRY70Czgnd4MZsCZPhdPQuL7+9UNqfxYcw@mail.gmail.com>

How about package names like ?????21(Note the ?? are Cyrillic), or ?r????,
or ??_??????_?'sic_4?ever? Although they aren't really names that people
would usually use in package/var names, they are meaningful names...

2016?12?5? 16:39 ? "Reini Urban" <reini at cpanel.net> ???

>
> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> >
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini at cpanel.net> wrote:
> >
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.
> >
> > That doesn't catch bidi spoofs.
>
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
>
> i.e. ?goog?le <U+202E (right-to-left override), g, o, o, g, U+202C (pop
> directional formatting), l, e>
> is already caught as illegal.
>
> Mixing RTL scripts, such as Arabic with Latin is not caught with the
> mixed-script rule per se.
>
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error on
> >> any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >>
> >> in perl like this:
> >>    use utf8 ?Greek?, ?Cyrillic?;
> >
> > Your rule isn't clear.  Would an identifier like ?_S be automatically
> > allowed?
>
> ?_S contains Greek U+03C8, Common and Latin. Since Latin and Common are
> always allowed, the only
> new script is Greek. The first non-default script is automatically and
> silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit
> declaration.
>
> So ?_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
>
> > I presume you're handling the spoofing of the SMALL PHI characters by
> > other means.
>
> The spoof attempt would be ?_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters
> have confusable
> Cyrillic counterparts, that?s why a declaration of use utf8 ?Greek?,
> ?Cyrillic?;
> i.e. mixing those two sounds highly dangerous.
> With the UCD confusable table this would be an error. In my rule not,
> since the user
> declared those two scripts to be mixed.
>
> > For multilingual support, you would want rules more like
> >
> > 'After script X, allow script Y?.
>
> Can you expand on that with an example? I?m no expert on this.
>
> Like after Hangul, allow Han? After Hiragana, allow Katakana?
>
> >> Of course there exist several languages which require more than one
> >> script,
> > <snip>
> >> or african languages as some have other than Latin roots, e.g.
> >> Ethiopian from Semitic.
> >
> > I don't see your problem here.  What problem do you see with Amharic?
>
> Amharic is not defined as UCD script property. It?s alphabet is called
> Ge?ez, which we call
> Ethiopic in the UCD. But that?s all I know. I?m not a domain expert. Does
> Ethiopic uses
> other Semitic scripts in its alphabet or is it complete? I learned some
> CFK languages,
> where you historically allow mixed scripts. But for other scripts I?m
> clueless.
> The examples I got mix it with Runic. Valid or nonsense?
>
> The problem is to decide which scripts are commonly mixed in which
> languages to allow
> them to be valid identifiers.
>
> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don?t
> mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall
> back to english to
> get common points across the barrier. But their scripts? No idea.
>
> >
> >> Indian languages also sound problematic,
> >
> > Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.
> >
> >> and
> >> all the Old_<script>
> >
> > Now I am confused.  What problem do you see that you don't have in the
> > Latin script?
>
> That I have no idea if those Old_<script> alphabets are still in use to
> create
> aliases for them.
> In the examples in perl which partially came from parrot there?s a wild
> eclectic mix of various scripts
> which do make no sense at all. So I don?t know if I can trust those tests,
> that they make sense and
> are readable at all. My guess is that the authors just liked code golfing
> and picked random unicode
> characters. It?s from perl after all.
>
> Such as this perl test t/mro/isa_c3_utf8.t
>
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
> Hiragana );
>
> ...
> package ?o?;
> package ur???;
> @ur???::ISA = 'k?o??';
> package ?;
> @ur???::ISA = ('k?o??', '?o?');
> package ??ck?;
> ...
>
> These identifiers are unreadable, because I don?t assume that anybody will
> be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at
> once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds
> highly illegal to me.
>
> So my rule makes sense. You need to declare non-default scripts used in
> your identifiers if mixed.
> (not strings. these can be everything, even illegal UTF-8).
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161205/5c8f7793/attachment.html>

From richard.wordingham at ntlworld.com  Mon Dec  5 08:31:38 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 5 Dec 2016 14:31:38 +0000
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
Message-ID: <20161205143138.57b5cd31@JRWUBU2>

On Mon, 5 Dec 2016 09:31:11 +0100
Reini Urban <reini at cpanel.net> wrote:

> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham
> > <richard.wordingham at ntlworld.com> wrote:
> > 
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini at cpanel.net> wrote:
> >   
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.  
> > 
> > That doesn't catch bidi spoofs.  
> 
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
> 
> i.e. ?goog?le <U+202E (right-to-left override), g, o, o, g, U+202C
> (pop directional formatting), l, e> is already caught as illegal.
> 
> Mixing RTL scripts, such as Arabic with Latin is not caught with the
> mixed-script rule per se.
> 
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error
> >> on any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >> 
> >> in perl like this:
> >>    use utf8 ?Greek?, ?Cyrillic?;  
> > 
> > Your rule isn't clear.  Would an identifier like ?_S be
> > automatically allowed?  
> 
> ?_S contains Greek U+03C8, Common and Latin. Since Latin and Common
> are always allowed, the only new script is Greek. The first
> non-default script is automatically and silently allowed, only a mix
> with another non-default script, such as Cyrillic would error or need
> an explicit declaration.
> 
> So ?_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
> 
> > I presume you're handling the spoofing of the SMALL PHI characters
> > by other means.  
> 
> The spoof attempt would be ?_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek
> characters have confusable Cyrillic counterparts, that?s why a
> declaration of use utf8 ?Greek?, ?Cyrillic?; i.e. mixing those two
> sounds highly dangerous. With the UCD confusable table this would be
> an error. In my rule not, since the user declared those two scripts
> to be mixed.

The choice with PHI includes:

U+0278 LATIN SMALL LETTER PHI
U+03C6 GREEK SMALL LETTER PHI

a Greek (!) script character with compatibiity decomposition to U+03C6

U+03D5 GREEK PHI SYMBOL

and a whole host of common script characters with compatibility
decomposition to U+03C6:

U+1D6D7 MATHEMATICAL BOLD SMALL PHI
U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
U+1D711 MATHEMATICAL ITALIC SMALL PHI
U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL

They are all ID_Start.

You didn't mention the inherited script.  Is that automatically
allowed, e.g. ??? <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
encountered that variable name in a radar specification last week.

There might be issues - it's possible that ?? <U+0915 DEVANAGARI LETTER
KA, U+0310 COMBINING CANDRABINDU> might spoof ?? <U+0915, U+0901
DEVANAGARI SIGN CANDRABINDU>.

> > For multilingual support, you would want rules more like
> > 
> > 'After script X, allow script Y?.  
> 
> Can you expand on that with an example? I?m no expert on this.
> 
> Like after Hangul, allow Han? After Hiragana, allow Katakana?

It allows one to mix Japanese and Korean variables without being able
to kana and Hangul.

Some of the Semitic abjads are sometimes used with vowel symbols
normally assoicated with a different Semitic script.  One could use
such a construct to limit the mixing.  However, for such cases a rule
such as 'allow script Y marks on script X bases' would be much better.

> > I don't see your problem here.  What problem do you see with
> > Amharic?  

> Amharic is not defined as UCD script property. It?s alphabet is
> called Ge?ez, which we call Ethiopic in the UCD. But that?s all I
> know. I?m not a domain expert. Does Ethiopic uses other Semitic
> scripts in its alphabet or is it complete? I learned some CFK
> languages, where you historically allow mixed scripts. But for other
> scripts I?m clueless. The examples I got mix it with Runic. Valid or
> nonsense?

I would say nonsense - or graphic design.  The use of Chinese
ideographs alongside sinoform scripts is the primary example.
However, 'symbols' as opposed to letters may leak from one script to
another, and that may be an issue for variable names.  For example,
English can use Arabic numerals, Roman numerals or Roman letters for
numbering in lists, and I've known people to resort to Greek letters.
Accent marks can also move, though these are usually encoded
separately.  I've already used the example of candrabindu being
borrowed from the Devanagari script to the Latin script - it was
borrowed for use in Sanskrit.

> How about the many Indian scripts? Do they mix?

Microsoft mostly won't let long-supported *Indian* scripts mix within
syllables.

I would say they mixed in much the same way as the Latin and Cyrillic
scripts mix.  In many ways they act as font variants of one another, so
features and rare letters may move between them.  This is most
noticeable where large chunks of the Brahmi character set are missing,
such as Tamil and Lao.  For Tamil, the gaps may be filled by 'Grantha'
letters.  For Lao, subscript consonants bear a very strong resemblance
to the Tai Tham subscript forms.  On the other hand, the unencoded
characters added to the Lao script to support Pali have been
well harmonised to the Lao script, and using characters from other
scripts for them would definitely be wrong.  (There's mostly a
consensus as to what the right bogus coding for them within the Lao
block is.  Unfortunately, I don't have good enough evidence for an
encoding proposal.)

> That I have no idea if those Old_<script> alphabets are still in use
> to create aliases for them.

They'll still be in use.  We had a guy at work (computer department)
who kept notes on his whiteboard in runes.  Someone analysing cuneiform
texts might very well want to create variable names that are a mix of
Latin for function (as 'n_' = "number of") and cuneiform for the form
being counted or whatever.

> Such as this perl test t/mro/isa_c3_utf8.t
> 
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
> Hiragana );
> 
> ...
> package ?o?;
> package ur???;
> @ur???::ISA = 'k?o??';
> package ?;
> @ur???::ISA = ('k?o??', '?o?');
> package ??ck?;
> ...
> 
> These identifiers are unreadable, because I don?t assume that anybody
> will be able to understand Hangul Cyrillic Ethiopic
> Canadian_Aboriginal Malayalam and Hiragana at once. I understand a
> bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal
> to me.

There's no law against it!  More to the point, it was just a test.

However, allowing Cyrillic or Greek immediately makes every apparent
'o' (or 'A') a potential spoof.  Remember, "Letter 'O' Considered
Harmful". 

Richard.


From alastair at alastairs-place.net  Tue Dec  6 05:51:55 2016
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Tue, 6 Dec 2016 11:51:55 +0000
Subject: IdnaTest.txt and RFC 5893
Message-ID: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>

Hi all,

I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there are examples like (line 74)

  B;	0?.\u05D0;	;	xn--0-sfa.xn--4db	#	0?.?

which the file alleges are valid, but I cannot for the life of me see why.  First, ?0?.?? is clearly a ?Bidi domain name? since it has at least one RTL label, ???.  As such, the Bidi Rule (RFC 5893 section 2) should be applied to its labels, and the label ?0?? fails [B1], since the first character has Bidi property EN, not L, R or AL.

Similarly (line 93)

  B;	??.\u05D0;	;	xn--0ca88g.xn--4db	#	??.?

Again, ???.?? is clearly a ?Bidi domain name?, but ???? fails [B6], because ??? has Bidi property ON, not L, EN or NSM.

Have I misunderstood something fundamental here?  Could someone explain why those examples are valid, in spite of RFC 5893?

Kind regards,

Alastair.

--
http://alastairs-place.net


From fabiang at radgametools.com  Thu Dec  8 20:41:35 2016
From: fabiang at radgametools.com (Fabian Giesen)
Date: Thu, 8 Dec 2016 18:41:35 -0800
Subject: UAX #9 (Bidirectional algorithm) reference implementations
Message-ID: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com>

I'm currently implementing the bidirectional algorithm and, while 
testing my version, ran into some issues with the provided reference 
implementations. (http://www.unicode.org/Public/PROGRAMS/)

1. BidiReferenceJava supports Unicode 6.3.0, but has not been updated 
for later versions.

In particular, the changes from revision 33 of UAX#9 (corresponding to 
Unicode 8.0.0; most notably, limitation of maximum depth for nested 
brackets in the PBA, and the rules for handling NSMs following brackets 
in rule N0) are missing.

Now the README of BidiReferenceJava mention that it implements Unicode 
6.3.0 (and hasn't been updated since), but this should probably be made 
more explicit. Maybe move the current implementation to a "6.3.0" 
subdirectory? (Similar to BidiReferenceC)

---

2. I am reasonably certain I found a bug in BidiReferenceC (version 9.0.0).

Consider these two test cases: (in the same format as 
BidiCharacterTest.txt):

0061 0028 0062 0029 0300 05D0;1;1;2 2 2 2 2 1;5 0 1 2 3 4
0061 0028 0062 0029 001B 0300 05D0;1;1;2 2 2 2 x 2 1;6 0 1 2 3 5

This concerns runs of NSMs following a paired bracket, and how they 
interact with BNs (or, in the right circumstances, other types removed 
by Rule X9).

The first is "a(b)<NSM>A" (A denoting a R-class character) in a RTL 
embedding. This test, when run through BidiReferenceC, produces the 
expected result.

The key steps are as follows:
1. Classification before the weak types phase is
    L ON L ON NSM R
2. Weak types phase produces
    L ON L ON ON R
3. Rule N0 resolves bracket pair (2,4) to L; the original NSM
    following the closing bracket gets set to L (as per the last
    clause of rule N0) as well.
    L L L L L R
4. Level assignment produces the given expected result

The second test simply adds an ASCII escape character (class BN) before 
the NSM. Here, BidiReferenceC produces this result:

   Text:        0061 0028 0062 0029 001B 0300 05D0
   Bidi_Class:     L    L    L    L   BN    R    R
   Levels:         2    2    2    2    x    1    1
   Exp Levels:     2    2    2    2    x    2    1
   Mismatches:                              ^
   Runs:        <R------------------------------R>

   Order:      [6 5 0 1 2 3]
   Exp Order:  [6 0 1 2 3 5]

which I believe to be incorrect. The only difference to the previous run 
is the presence of the BN-type character before the NSM (which should 
not matter, since it's supposed to be removed by Rule X9 before we ever 
enter the weak types phase).

The problem appears to be around brrule.c:4376, in the function 
"br_SetBracketPairBC". The code is written to detect a run of NSMs 
following the brackets, but does not skip over deleted characters (which 
are denoted by having "level == NOLEVEL").

Can anyone confirm whether my interpretation of the rules is correct and 
this is an actual bug in BidiReferenceC?

Thanks,

-Fabian

From eliz at gnu.org  Fri Dec  9 02:32:32 2016
From: eliz at gnu.org (Eli Zaretskii)
Date: Fri, 09 Dec 2016 10:32:32 +0200
Subject: UAX #9 (Bidirectional algorithm) reference implementations
In-Reply-To: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com> (message
 from Fabian Giesen on Thu, 8 Dec 2016 18:41:35 -0800)
References: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com>
Message-ID: <83twadh59b.fsf@gnu.org>

> From: Fabian Giesen <fabiang at radgametools.com>
> Date: Thu, 8 Dec 2016 18:41:35 -0800
> 
> Can anyone confirm whether my interpretation of the rules is correct and 
> this is an actual bug in BidiReferenceC?

Emacs 25 (which should be up to date with Unicode 9.0) produces the
result that you expect, FWIW, except that it doesn't remove the ESC
from display (per design).

From kenwhistler at att.net  Fri Dec  9 09:04:42 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 9 Dec 2016 07:04:42 -0800
Subject: UAX #9 (Bidirectional algorithm) reference implementations
In-Reply-To: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com>
References: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com>
Message-ID: <2b433250-fba1-7d7f-1785-694ef25f96bf@att.net>


On 12/8/2016 6:41 PM, Fabian Giesen wrote:
> 1. BidiReferenceJava supports Unicode 6.3.0, but has not been updated 
> for later versions.

We have an updated version of BidiReferenceJava about ready to deploy 
into the PROGRAMS directory.

About the bug you note in BidiReferenceC, I'll investigate.

--Ken


From public at khwilliamson.com  Mon Dec 12 08:59:04 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 12 Dec 2016 07:59:04 -0700
Subject: Should unassigned code points in blocks reserved for combining marks, 
 etc be GCB extended?
Message-ID: <c9a84dc0-7dd9-b5bc-1bcb-dd6db0b48b73@khwilliamson.com>

These are currently GCB Other, but when assigned, don't we know that 
they will be Extended?  So this could be done now.

From kenwhistler at att.net  Mon Dec 12 11:30:31 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Mon, 12 Dec 2016 09:30:31 -0800
Subject: Fwd: Re: Should unassigned code points in blocks reserved for
 combining marks, etc be GCB extended?
In-Reply-To: <5800337b-bea4-26ac-e13a-5908180ec4c0@att.net>
References: <5800337b-bea4-26ac-e13a-5908180ec4c0@att.net>
Message-ID: <f9e5433e-41dd-afc4-25ee-3c2f7484a49c@att.net>


-------- Forwarded Message --------
Subject: 	Re: Should unassigned code points in blocks reserved for 
combining marks, etc be GCB extended?
Date: 	Mon, 12 Dec 2016 08:26:45 -0800
From: 	Ken Whistler <kenwhistler at att.net>
To: 	Karl Williamson <public at khwilliamson.com>


On 12/12/2016 6:59 AM, Karl Williamson wrote:
> These are currently GCB Other, but when assigned, don't we know that
> they will be Extended?  So this could be done now.
>

Short answer: No.

Long answer:

Every proposal to pre-assign some range of unassigned code points a
non-default character property value for that range has a bunch of
hidden costs. This proposal would be particularly costly, because it
would be smack in the middle of some of the properties with the hairiest
dependency chains.

GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
particular change for GCB=Extend would also get reflected into WB=Extend
and SB=Extend, which are also dependent on Grapheme_Extend=Yes.

Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
gc=Me, which would seem to be a natural match for the blocks "reserved"
for combining marks, but it is actually also dependent on
Other_Grapheme_Extend, which is a mixed bag of various spacing combining
marks for normalization closure, plus ZWNJ, plus tag characters, plus
spacing halfwidth dakuten.

So that would raise complicated questions about *how* GCB=Extend would
itself be extended to include certain set ranges of unassigned code
points. Would those simply be assigned directly to Grapheme_Extend=Yes
(which would create a complicated default assignment for that derived
property, and complicate both its documentation *and* its derivation)?
Or would they be assigned directly to Other_Grapheme_Extend (which would
create a new animal in the zoo of properties -- a contributory property
which itself has ranges of unassigned code points given non-default
values). And once decided, what would be the implications for all the
documentation and the tooling?

Any proposal like this then also has hidden costs on the committees,
because it sets up implied requirements for what can be encoded where
and what properties it has to have. Every time such defaults are set up,
it makes the documentation of what is already "pre-assigned" more
complicated and fragile. Already, a large proportion of the participants
in the maintenance committees have very murky understandings about what
can and cannot be put where in the future, and why. And that is a recipe
for mistakes in encoding.

Finally, like it or not, there currently is no actually contract
guaranteeing that the remaining open ranges in blocks "reserved" for
combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
prevent the committees from deciding that one (or more) spacing
combining marks might be appropriate to encode there, or possibly even
spacing non-combining marks of some strange sort, like the spacing
Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
those ranges free of characters that would not be Grapheme_Extend=Yes
would require some guy on the committee to be aware of the arcane
dependencies for segmentation properties, and then to police such
decisions in perpetuity -- or at least until the blocks in question
filled up with non-problematical characters.

So the long answer is also: No.

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161212/963220f8/attachment.html>

From richard.wordingham at ntlworld.com  Mon Dec 12 14:19:29 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 12 Dec 2016 20:19:29 +0000
Subject: Should unassigned code points in blocks reserved for combining
 marks, etc be GCB extended?
In-Reply-To: <f9e5433e-41dd-afc4-25ee-3c2f7484a49c@att.net>
References: <5800337b-bea4-26ac-e13a-5908180ec4c0@att.net>
 <f9e5433e-41dd-afc4-25ee-3c2f7484a49c@att.net>
Message-ID: <20161212201929.3cb828b2@JRWUBU2>

On Mon, 12 Dec 2016 09:30:31 -0800
Ken Whistler <kenwhistler at att.net> wrote:

> On 12/12/2016 6:59 AM, Karl Williamson wrote:

> > These are currently GCB Other, but when assigned, don't we know that
> > they will be Extended?  So this could be done now.

> Any proposal like this then also has hidden costs on the committees,
> because it sets up implied requirements for what can be encoded where
> and what properties it has to have. Every time such defaults are set
> up, it makes the documentation of what is already "pre-assigned" more
> complicated and fragile. Already, a large proportion of the
> participants in the maintenance committees have very murky
> understandings about what can and cannot be put where in the future,
> and why. And that is a recipe for mistakes in encoding.

How does this differ from U+0816 SAMARITAN MARK IN changing from
bidi_class=R to bidi_class=NSM upon assignment?

The idea is to reduce the damage done by the use of obsolete versions of
the Unicode database.
 
> Finally, like it or not, there currently is no actually contract
> guaranteeing that the remaining open ranges in blocks "reserved" for
> combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
> ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
> prevent the committees from deciding that one (or more) spacing
> combining marks might be appropriate to encode there, or possibly even
> spacing non-combining marks of some strange sort, like the spacing
> Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
> those ranges free of characters that would not be Grapheme_Extend=Yes
> would require some guy on the committee to be aware of the arcane
> dependencies for segmentation properties, and then to police such
> decisions in perpetuity -- or at least until the blocks in question
> filled up with non-problematical characters.

What is the down side of a code point changing from Graphme_Extend=Yes
to Grapheme_Extend=No when it is assigned?

Richard.

From verdy_p at wanadoo.fr  Mon Dec 12 16:25:48 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 12 Dec 2016 23:25:48 +0100
Subject: Should unassigned code points in blocks reserved for
 combining marks, etc be GCB extended?
In-Reply-To: <f9e5433e-41dd-afc4-25ee-3c2f7484a49c@att.net>
References: <5800337b-bea4-26ac-e13a-5908180ec4c0@att.net>
 <f9e5433e-41dd-afc4-25ee-3c2f7484a49c@att.net>
Message-ID: <CAGa7JC1z==N6Q4vbfSqF8EUoW7hULhjc3gmr9fKoMkAJbU5soA@mail.gmail.com>

I agree. The remaining slots could be very well allocated for some
notational "superscript" (spacing) marks, more or less formed on ligatures
without being really "extenders" for graphemes as they could as well be
used isolately (I can think about special marks that could be used for
measurement units, or some currencies, or honorific marks, or some
localized variants of symbols like "trademark" or "registered", or some
localized "ampersand" or similar, or some symbol for meaning "birth/death"
after or before a date, or simply the encoding of superscript digits for
Western Arabic or Eastern Arabic for Persian/Urdu, which won't be
"extenders" for any grapheme but used isolately).

The only useful default property is the assignment of a range for strong
RTL letters/digits/punctuation/symbols, because of the complexity and
stability of BiDi algorithms and the security issues and that are related
to them, and difficulties for the UI. On the opposite each assigned block
can contain smaller subranges (sometimes smaller than a full column) for
combining marks, which are spread at various positions (but without huge
complexiuty for handling them in algorithms like normalizations, even if
they are necessarily stabilized: the default combining class for all
unencoded characters is simply 0, blocking any Bidi reordering that would
break later encoded documents using the newly assigned code points:
normalization will apply only to reoder or recombine them only when these
codes will be assigned to known characters with a known possibly non-zero
combining class, but past versions of normalizers will keep them unchanged,
preserving at least the canonical equivalences).


2016-12-12 18:30 GMT+01:00 Ken Whistler <kenwhistler at att.net>:

>
>
>
> -------- Forwarded Message --------
> Subject: Re: Should unassigned code points in blocks reserved for
> combining marks, etc be GCB extended?
> Date: Mon, 12 Dec 2016 08:26:45 -0800
> From: Ken Whistler <kenwhistler at att.net> <kenwhistler at att.net>
> To: Karl Williamson <public at khwilliamson.com> <public at khwilliamson.com>
>
> On 12/12/2016 6:59 AM, Karl Williamson wrote:
> > These are currently GCB Other, but when assigned, don't we know that
> > they will be Extended?  So this could be done now.
> >
>
> Short answer: No.
>
> Long answer:
>
> Every proposal to pre-assign some range of unassigned code points a
> non-default character property value for that range has a bunch of
> hidden costs. This proposal would be particularly costly, because it
> would be smack in the middle of some of the properties with the hairiest
> dependency chains.
>
> GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
> particular change for GCB=Extend would also get reflected into WB=Extend
> and SB=Extend, which are also dependent on Grapheme_Extend=Yes.
>
> Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
> gc=Me, which would seem to be a natural match for the blocks "reserved"
> for combining marks, but it is actually also dependent on
> Other_Grapheme_Extend, which is a mixed bag of various spacing combining
> marks for normalization closure, plus ZWNJ, plus tag characters, plus
> spacing halfwidth dakuten.
>
> So that would raise complicated questions about *how* GCB=Extend would
> itself be extended to include certain set ranges of unassigned code
> points. Would those simply be assigned directly to Grapheme_Extend=Yes
> (which would create a complicated default assignment for that derived
> property, and complicate both its documentation *and* its derivation)?
> Or would they be assigned directly to Other_Grapheme_Extend (which would
> create a new animal in the zoo of properties -- a contributory property
> which itself has ranges of unassigned code points given non-default
> values). And once decided, what would be the implications for all the
> documentation and the tooling?
>
> Any proposal like this then also has hidden costs on the committees,
> because it sets up implied requirements for what can be encoded where
> and what properties it has to have. Every time such defaults are set up,
> it makes the documentation of what is already "pre-assigned" more
> complicated and fragile. Already, a large proportion of the participants
> in the maintenance committees have very murky understandings about what
> can and cannot be put where in the future, and why. And that is a recipe
> for mistakes in encoding.
>
> Finally, like it or not, there currently is no actually contract
> guaranteeing that the remaining open ranges in blocks "reserved" for
> combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
> ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
> prevent the committees from deciding that one (or more) spacing
> combining marks might be appropriate to encode there, or possibly even
> spacing non-combining marks of some strange sort, like the spacing
> Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
> those ranges free of characters that would not be Grapheme_Extend=Yes
> would require some guy on the committee to be aware of the arcane
> dependencies for segmentation properties, and then to police such
> decisions in perpetuity -- or at least until the blocks in question
> filled up with non-problematical characters.
>
> So the long answer is also: No.
>
> --Ken
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161212/12bb871f/attachment.html>

From wjgo_10009 at btinternet.com  Wed Dec 14 08:47:46 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 14 Dec 2016 14:47:46 +0000 (GMT)
Subject: Emoji as Art
Message-ID: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>

https://www.moma.org/calendar/exhibitions/3639

The exhibition, "Inbox: The Original Emoji, by Shigetaka Kurita"  is being held in Floor 1, Lobby at MoMA, the Museum of Modern Art in New York.

William Overington

Wednesday 14 December 2016


From reini at cpanel.net  Wed Dec 14 11:28:23 2016
From: reini at cpanel.net (Reini Urban)
Date: Wed, 14 Dec 2016 18:28:23 +0100
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <CAGHjPPKFqgA4PapzLRY70Czgnd4MZsCZPhdPQuL7+9UNqfxYcw@mail.gmail.com>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
 <CAGHjPPKFqgA4PapzLRY70Czgnd4MZsCZPhdPQuL7+9UNqfxYcw@mail.gmail.com>
Message-ID: <4F93A790-5431-4DF9-BDFD-0B67AE29E1F8@cpanel.net>


> On Dec 5, 2016, at 1:51 PM, gfb hjjhjh <c933103 at gmail.com> wrote:
> 
> How about package names like ?????21(Note the ?? are Cyrillic), or ?r????, or ??_??????_?'sic_4?ever? Although they aren't really names that people would usually use in package/var names, they are meaningful names?

My program thinks otherwise.

1st:
$ cperl5.25.2 -Mutf8 -e'package ?????21;?

Invalid script Cyrillic in identifier ??21;
 for U+041C. Have Katakana at -e line 1.

Legalize those mixed scripts with this:
$ cperl5.25.2 -C -Mutf8=Katakana,Cyrillic -e'package ?????21;'

2nd:
$ cperl5.25.2 -C -Mutf8 -e'$??_??????_?::sic_4?;'
Invalid script Cyrillic in identifier ?
??????????_??::sic_4???;
 for U+0445. Have Katakana at -e line 1.

Illegal mixed scripts Katakana + Cyrillic + Greek.
Almost legal with this:

$ cperl5.25.2 -C -Mutf8=Katakana,Cyrillic,Greek -e'$??_??????_?::sic_4?;'
Unrecognized character \x{20e3}; marked by <-- HERE after ?_?::sic_4<-- HERE near column 20 at -e line 1.

U+20e3 is no ID_Continue.

3rd:
$ cperl5.25.2 -C -Mutf8 -e'$?r????;'
Unrecognized character \x{b2}; marked by <-- HERE after $?r<-- HERE near column 4 at -e line 1.

? is no ID_Continue

Legal with:
$ cperl5.25.2 -C -Mutf8=Greek,Hiragana,Han -e'$?r???;'

> 2016?12?5? 16:39 ? "Reini Urban" <reini at cpanel.net> ???
> 
> > On Dec 4, 2016, at 11:45 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> >
> > On Sun, 4 Dec 2016 12:09:36 +0100
> > Reini Urban <reini at cpanel.net> wrote:
> >
> >> * normalize identifiers (NFC) and only store normalized variants.
> >> this should catch bidi spoofs, combining characters and such.
> >
> > That doesn't catch bidi spoofs.
> 
> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
> 
> i.e. ?goog?le <U+202E (right-to-left override), g, o, o, g, U+202C (pop directional formatting), l, e>
> is already caught as illegal.
> 
> Mixing RTL scripts, such as Arabic with Latin is not caught with the mixed-script rule per se.
> 
> >> * check each unicode code point for its Script property and besides
> >> Latin, Common and Inherited only allow the first script, but error on
> >> any other mixed script. Additional scripts need to be declared.
> >> https://github.com/perl11/cperl/issues/229
> >>
> >> in perl like this:
> >>    use utf8 ?Greek?, ?Cyrillic?;
> >
> > Your rule isn't clear.  Would an identifier like ?_S be automatically
> > allowed?
> 
> ?_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always allowed, the only
> new script is Greek. The first non-default script is automatically and silently allowed, only a mix with another
> non-default script, such as Cyrillic would error or need an explicit declaration.
> 
> So ?_S alone is fine, if everything else is Greek.
> But mixing with the Cyrillic version would lead to an error.
> 
> > I presume you're handling the spoofing of the SMALL PHI characters by
> > other means.
> 
> The spoof attempt would be ?_S with Cyrillic U+0471, Common, Latin.
> 2 mixed scripts which are illegal, if undeclared.
> Same with PHI, which exists as Greek or Cyrillic. Most of Greek characters have confusable
> Cyrillic counterparts, that?s why a declaration of use utf8 ?Greek?, ?Cyrillic?;
> i.e. mixing those two sounds highly dangerous.
> With the UCD confusable table this would be an error. In my rule not, since the user
> declared those two scripts to be mixed.
> 
> > For multilingual support, you would want rules more like
> >
> > 'After script X, allow script Y?.
> 
> Can you expand on that with an example? I?m no expert on this.
> 
> Like after Hangul, allow Han? After Hiragana, allow Katakana?
> 
> >> Of course there exist several languages which require more than one
> >> script,
> > <snip>
> >> or african languages as some have other than Latin roots, e.g.
> >> Ethiopian from Semitic.
> >
> > I don't see your problem here.  What problem do you see with Amharic?
> 
> Amharic is not defined as UCD script property. It?s alphabet is called Ge?ez, which we call
> Ethiopic in the UCD. But that?s all I know. I?m not a domain expert. Does Ethiopic uses
> other Semitic scripts in its alphabet or is it complete? I learned some CFK languages,
> where you historically allow mixed scripts. But for other scripts I?m clueless.
> The examples I got mix it with Runic. Valid or nonsense?
> 
> The problem is to decide which scripts are commonly mixed in which languages to allow
> them to be valid identifiers.
> 
> How about the many Indian scripts? Do they mix?
> Being an indian movie expert tells me that indian languages usually don?t mix.
> They make Tamil and Bengali versions of Hindi movies, and usually fall back to english to
> get common points across the barrier. But their scripts? No idea.
> 
> >
> >> Indian languages also sound problematic,
> >
> > Is this the ZWJ/ZWNJ issue?  That surely is a problem within a script.
> >
> >> and
> >> all the Old_<script>
> >
> > Now I am confused.  What problem do you see that you don't have in the
> > Latin script?
> 
> That I have no idea if those Old_<script> alphabets are still in use to create
> aliases for them.
> In the examples in perl which partially came from parrot there?s a wild eclectic mix of various scripts
> which do make no sense at all. So I don?t know if I can trust those tests, that they make sense and
> are readable at all. My guess is that the authors just liked code golfing and picked random unicode
> characters. It?s from perl after all.
> 
> Such as this perl test t/mro/isa_c3_utf8.t
> 
> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );
> 
> ...
> package ?o?;
> package ur???;
> @ur???::ISA = 'k?o??';
> package ?;
> @ur???::ISA = ('k?o??', '?o?');
> package ??ck?;
> ...
> 
> These identifiers are unreadable, because I don?t assume that anybody will be able to understand
> Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
> I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal to me.
> 
> So my rule makes sense. You need to declare non-default scripts used in your identifiers if mixed.
> (not strings. these can be everything, even illegal UTF-8).
> 
> 


From reini at cpanel.net  Wed Dec 14 11:44:39 2016
From: reini at cpanel.net (Reini Urban)
Date: Wed, 14 Dec 2016 18:44:39 +0100
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <20161205143138.57b5cd31@JRWUBU2>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
 <20161205143138.57b5cd31@JRWUBU2>
Message-ID: <DB2F6734-4F5A-4889-95D3-0CB5FEA5B42B@cpanel.net>


> On Dec 5, 2016, at 3:31 PM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Mon, 5 Dec 2016 09:31:11 +0100
> Reini Urban <reini at cpanel.net> wrote:
> 
>>> On Dec 4, 2016, at 11:45 PM, Richard Wordingham
>>> <richard.wordingham at ntlworld.com> wrote:
>>> 
>>> On Sun, 4 Dec 2016 12:09:36 +0100
>>> Reini Urban <reini at cpanel.net> wrote:
>>> 
>>>> * normalize identifiers (NFC) and only store normalized variants.
>>>> this should catch bidi spoofs, combining characters and such.  
>>> 
>>> That doesn't catch bidi spoofs.  
>> 
>> Right. Bidi spoofs are already caught by the IDStart, IDContinue rule.
>> 
>> i.e. ?goog?le <U+202E (right-to-left override), g, o, o, g, U+202C
>> (pop directional formatting), l, e> is already caught as illegal.
>> 
>> Mixing RTL scripts, such as Arabic with Latin is not caught with the
>> mixed-script rule per se.
>> 
>>>> * check each unicode code point for its Script property and besides
>>>> Latin, Common and Inherited only allow the first script, but error
>>>> on any other mixed script. Additional scripts need to be declared.
>>>> https://github.com/perl11/cperl/issues/229
>>>> 
>>>> in perl like this:
>>>>   use utf8 ?Greek?, ?Cyrillic?;  
>>> 
>>> Your rule isn't clear.  Would an identifier like ?_S be
>>> automatically allowed?  
>> 
>> ?_S contains Greek U+03C8, Common and Latin. Since Latin and Common
>> are always allowed, the only new script is Greek. The first
>> non-default script is automatically and silently allowed, only a mix
>> with another non-default script, such as Cyrillic would error or need
>> an explicit declaration.
>> 
>> So ?_S alone is fine, if everything else is Greek.
>> But mixing with the Cyrillic version would lead to an error.
>> 
>>> I presume you're handling the spoofing of the SMALL PHI characters
>>> by other means.  
>> 
>> The spoof attempt would be ?_S with Cyrillic U+0471, Common, Latin.
>> 2 mixed scripts which are illegal, if undeclared.
>> Same with PHI, which exists as Greek or Cyrillic. Most of Greek
>> characters have confusable Cyrillic counterparts, that?s why a
>> declaration of use utf8 ?Greek?, ?Cyrillic?; i.e. mixing those two
>> sounds highly dangerous. With the UCD confusable table this would be
>> an error. In my rule not, since the user declared those two scripts
>> to be mixed.
> 
> The choice with PHI includes:
> 
> U+0278 LATIN SMALL LETTER PHI
> U+03C6 GREEK SMALL LETTER PHI
> 
> a Greek (!) script character with compatibiity decomposition to U+03C6
> 
> U+03D5 GREEK PHI SYMBOL
> 
> and a whole host of common script characters with compatibility
> decomposition to U+03C6:
> 
> U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> U+1D711 MATHEMATICAL ITALIC SMALL PHI
> U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
> 
> They are all ID_Start.

Oh my. Dragons beware. So I need to add some trie tables to add warnings with those rules also.
I don?t want to error on some obscure confusables rule only yet.
perl doesn?t even ship the security tables, so people are not aware of it.

> You didn't mention the inherited script.  Is that automatically
> allowed, e.g. ??? <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
> encountered that variable name in a radar specification last week.

Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
Combiners are normalized to NFC.

> There might be issues - it's possible that ?? <U+0915 DEVANAGARI LETTER
> KA, U+0310 COMBINING CANDRABINDU> might spoof ?? <U+0915, U+0901
> DEVANAGARI SIGN CANDRABINDU>.

Good test case:

\x{915}\x{310} is legal Devanagari normalized to one char.
\x{915}\x{901} are two legal Devanagari characters.
but they are confusables. This would need special confusable rules.

> 
>>> For multilingual support, you would want rules more like
>>> 
>>> 'After script X, allow script Y?.  
>> 
>> Can you expand on that with an example? I?m no expert on this.
>> 
>> Like after Hangul, allow Han? After Hiragana, allow Katakana?
> 
> It allows one to mix Japanese and Korean variables without being able
> to kana and Hangul.
> 
> Some of the Semitic abjads are sometimes used with vowel symbols
> normally assoicated with a different Semitic script.  One could use
> such a construct to limit the mixing.  However, for such cases a rule
> such as 'allow script Y marks on script X bases' would be much better.
> 
>>> I don't see your problem here.  What problem do you see with
>>> Amharic?  
> 
>> Amharic is not defined as UCD script property. It?s alphabet is
>> called Ge?ez, which we call Ethiopic in the UCD. But that?s all I
>> know. I?m not a domain expert. Does Ethiopic uses other Semitic
>> scripts in its alphabet or is it complete? I learned some CFK
>> languages, where you historically allow mixed scripts. But for other
>> scripts I?m clueless. The examples I got mix it with Runic. Valid or
>> nonsense?
> 
> I would say nonsense - or graphic design.  The use of Chinese
> ideographs alongside sinoform scripts is the primary example.
> However, 'symbols' as opposed to letters may leak from one script to
> another, and that may be an issue for variable names.  For example,
> English can use Arabic numerals, Roman numerals or Roman letters for
> numbering in lists, and I've known people to resort to Greek letters.
> Accent marks can also move, though these are usually encoded
> separately.  I've already used the example of candrabindu being
> borrowed from the Devanagari script to the Latin script - it was
> borrowed for use in Sanskrit.
> 
>> How about the many Indian scripts? Do they mix?
> 
> Microsoft mostly won't let long-supported *Indian* scripts mix within
> syllables.
> 
> I would say they mixed in much the same way as the Latin and Cyrillic
> scripts mix.  In many ways they act as font variants of one another, so
> features and rare letters may move between them.  This is most
> noticeable where large chunks of the Brahmi character set are missing,
> such as Tamil and Lao.  For Tamil, the gaps may be filled by 'Grantha'
> letters.  For Lao, subscript consonants bear a very strong resemblance
> to the Tai Tham subscript forms.  On the other hand, the unencoded
> characters added to the Lao script to support Pali have been
> well harmonised to the Lao script, and using characters from other
> scripts for them would definitely be wrong.  (There's mostly a
> consensus as to what the right bogus coding for them within the Lao
> block is.  Unfortunately, I don't have good enough evidence for an
> encoding proposal.)

I see. This would be a fine case for needed declaration of those mixed scripts.
Similar to East-Asian.


>> That I have no idea if those Old_<script> alphabets are still in use
>> to create aliases for them.
> 
> They'll still be in use.  We had a guy at work (computer department)
> who kept notes on his whiteboard in runes.  Someone analysing cuneiform
> texts might very well want to create variable names that are a mix of
> Latin for function (as 'n_' = "number of") and cuneiform for the form
> being counted or whatever.
> 
>> Such as this perl test t/mro/isa_c3_utf8.t
>> 
>> use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam
>> Hiragana );
>> 
>> ...
>> package ?o?;
>> package ur???;
>> @ur???::ISA = 'k?o??';
>> package ?;
>> @ur???::ISA = ('k?o??', '?o?');
>> package ??ck?;
>> ...
>> 
>> These identifiers are unreadable, because I don?t assume that anybody
>> will be able to understand Hangul Cyrillic Ethiopic
>> Canadian_Aboriginal Malayalam and Hiragana at once. I understand a
>> bit Hangul, Cyrillic and Hiragana, but the mix sounds highly illegal
>> to me.
> 
> There's no law against it!  More to the point, it was just a test.

I declared it as no undeclared mixed-script law in my language :)

> However, allowing Cyrillic or Greek immediately makes every apparent
> 'o' (or 'A') a potential spoof.  Remember, "Letter 'O' Considered
> Harmful?. 

Yes, this should be warned about.


From joe at unicode.org  Wed Dec 14 12:58:50 2016
From: joe at unicode.org (Joe Becker)
Date: Wed, 14 Dec 2016 10:58:50 -0800
Subject: Emoji as Art - Emoji as Marginalia
In-Reply-To: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
References: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
Message-ID: <3a429378-43c7-e637-86f6-d5497d1e8b33@unicode.org>

Emoji - pictorial characters - really descended from the first writing 
systems ... which *were* pictorial characters ... visibly so in ancient 
Egyptian, as in ancient hanzi/kanji.  Egyptian and Asian artists often 
mixed pictorial text with pictorial artwork.

A famous European example of marginal signs in print was the "dangerous 
curve" sign used in the mathematical writings of the Nicolas Bourbaki group.

Living in Japan in the 1970's, before any personal computers there, we 
saw many books with marginal cartoon characters and other such 
meta-text, which I imagine were the immediate antecedent of emoji in the 
Japanese computer context.

Joe


From richard.wordingham at ntlworld.com  Thu Dec 15 14:29:26 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 15 Dec 2016 20:29:26 +0000
Subject: Mixed-Script confusables in prog.languages
In-Reply-To: <DB2F6734-4F5A-4889-95D3-0CB5FEA5B42B@cpanel.net>
References: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net>
 <20161204224558.579e9201@JRWUBU2>
 <285C6A38-8217-4DB5-B309-FEE59A8999E7@cpanel.net>
 <20161205143138.57b5cd31@JRWUBU2>
 <DB2F6734-4F5A-4889-95D3-0CB5FEA5B42B@cpanel.net>
Message-ID: <20161215202926.105d6f69@JRWUBU2>

On Wed, 14 Dec 2016 18:44:39 +0100
Reini Urban <reini at cpanel.net> wrote:

> On Dec 5, 2016, at 3:31 PM, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:

> > The choice with PHI includes:
> > 
> > U+0278 LATIN SMALL LETTER PHI
> > U+03C6 GREEK SMALL LETTER PHI
> > 
> > a Greek (!) script character with compatibiity decomposition to
> > U+03C6
> > 
> > U+03D5 GREEK PHI SYMBOL
> > 
> > and a whole host of common script characters with compatibility
> > decomposition to U+03C6:
> > 
> > U+1D6D7 MATHEMATICAL BOLD SMALL PHI
> > U+1D6DF MATHEMATICAL BOLD PHI SYMBOL
> > U+1D711 MATHEMATICAL ITALIC SMALL PHI
> > U+1D719 MATHEMATICAL ITALIC PHI SYMBOL
> > U+1D74B MATHEMATICAL BOLD ITALIC SMALL PHI
> > U+1D753 MATHEMATICAL BOLD ITALIC PHI SYMBOL
> > U+1D785 MATHEMATICAL SANS-SERIF BOLD SMALL PHI
> > U+1D78D MATHEMATICAL SANS-SERIF BOLD PHI SYMBOL
> > U+1D7BF MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PHI
> > U+1D7C7 MATHEMATICAL SANS-SERIF BOLD ITALIC PHI SYMBOL
> > 
> > They are all ID_Start.  
> 
> Oh my. Dragons beware. So I need to add some trie tables to add
> warnings with those rules also. I don?t want to error on some obscure
> confusables rule only yet. perl doesn?t even ship the security
> tables, so people are not aware of it.

Another solution would be to treat two identifiers as the same if they
have the same NFKC normalisation.

> > You didn't mention the inherited script.  Is that automatically
> > allowed, e.g. ??? <U+03C6, U+0308 COMBINING DIAERESIS, U+1D63 LATIN
> > SUBSCRIPT SMALL LETTER R> (scripts: Greek, inherited, Latin)?  I
> > encountered that variable name in a radar specification last week.  
> 
> Inherited is allowed with ID_Continue, yes. Not in ID_Start position.
> Combiners are normalized to NFC.

<U+03C6, U+0308, U+1D63> is unchanged under normalisation to NFC, NFD,
NFKC and NFKD. 

> > There might be issues - it's possible that ?? <U+0915 DEVANAGARI
> > LETTER KA, U+0310 COMBINING CANDRABINDU> might spoof ?? <U+0915,
> > U+0901 DEVANAGARI SIGN CANDRABINDU>.  

> \x{915}\x{310} is legal Devanagari normalized to one char.

I don't know know what you mean by this statement. <U+0915, U+0310> is
also unchanged under the 4 normalisations.
 
> \x{915}\x{901} are two legal Devanagari characters.
> but they are confusables. This would need special confusable rules.

Additionally, U+0310 can be confused quite readily with the sequence
<U+0306 COMBINING BREVE, U+0307 COMBINING DOT ABOVE>.

Richard.


From wjgo_10009 at btinternet.com  Fri Dec 16 12:50:35 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 16 Dec 2016 18:50:35 +0000 (GMT)
Subject: Emoji as Art
In-Reply-To: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
References: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
Message-ID: <23843016.51094.1481914235485.JavaMail.defaultUser@defaultHost>

Here is a link to a web page that has some pictures of the emoji installation at the Museum of Modern Art, MoMA, in New York, the pictures shown at one quarter of the size of the original pictures that were kindly supplied by MoMA. Thank you to MoMA for the pictures. 

http://www.users.globalnet.co.uk/~ngo/emoji_installation_at_MoMA.htm

William Overington

Friday 16 December 2016


From public at khwilliamson.com  Mon Dec 19 17:04:06 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 19 Dec 2016 16:04:06 -0700
Subject: Best practices for replacing UTF-8 overlongs
Message-ID: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>

It seems counterintuitive to me that the two byte sequence C0 80 should 
be replaced by 2 replacement characters under best practices, or that E0 
80 80 should also be replaced by 2.  Each sequence was legal in early 
Unicode versions, and it seems that it would be best to treat them as 
each a single sequence, replacing by a single replacement character.

What are the advantages to replacing them by multiple characters

From doug at ewellic.org  Mon Dec 19 17:52:36 2016
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 19 Dec 2016 16:52:36 -0700
Subject: Best practices for replacing UTF-8 overlongs
Message-ID: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong
sequences until 2000, but it was never legal to generate them. This was
stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct
use of the instructions and table in RFC 2044 also precluded the
creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From markus.icu at gmail.com  Mon Dec 19 18:17:29 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 19 Dec 2016 16:17:29 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
References: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
Message-ID: <CAN49p6p2giNGD0b=EUqGahXSAwNnycF-hmcJoGWPsbcCDsjfqQ@mail.gmail.com>

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson <public at khwilliamson.com>
wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2.  Each sequence was legal in early Unicode
> versions, and it seems that it would be best to treat them as each a single
> sequence, replacing by a single replacement character.
>

Looks like the ICU converters and string-iteration macros do what you
expect (if I understand your expectations).

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161219/93f7a480/attachment.html>

From Shawn.Steele at microsoft.com  Mon Dec 19 18:25:47 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Tue, 20 Dec 2016 00:25:47 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
Message-ID: <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Mon Dec 19 18:43:15 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 20 Dec 2016 00:43:15 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
References: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
Message-ID: <20161220004315.145a5aad@JRWUBU2>

On Mon, 19 Dec 2016 16:04:06 -0700
Karl Williamson <public at khwilliamson.com> wrote:

> What are the advantages to replacing them by multiple characters

Presumably it just provides more pain for those who code using UTF-8 as
opposed to UTF-16, just like the *former* requirements to be able to be
able to search for lone surrogates (Unicode Regular Expressions RL1.7)
or give lone surrogates a specific position in DUCET collation (UCA
Conformance test - automatic test failure if working in UTF-8!).  Moving
one 'character' backwards through a purported UTF-8 string gets so much
more interesting when one backs into E0 80 BF.

It also makes it harder to bend UTF-8 to allow U+0000 in C strings.
One trick for making essentially UTF-8 programs non-compliant is to have
test strings with embedded nulls.  One solution that has been used is to
allow C0 80 to represent U+0000 in a null-terminated string.

Of course, this problem goes away if C0 is used to introduce
replacements for the formerly useful non-characters. :-)

Of course, there is the issue of what to do with F8 80 81 82 83.
Replace by one character as once legal, or by two as no character can
be represented by more than four bytes?

Richard.

From textexin at xencraft.com  Mon Dec 19 19:23:25 2016
From: textexin at xencraft.com (Tex Texin)
Date: Mon, 19 Dec 2016 17:23:25 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
Message-ID: <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From Shawn.Steele at microsoft.com  Mon Dec 19 19:40:55 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Tue, 20 Dec 2016 01:40:55 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
Message-ID: <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>

IMO, bad bytes == corruption.  At that point all bets are off because the machine has no clue "how" it was corrupted.  It could just be a single flipped bit lost in transmission.  It could have been an attack hack using overlong byte sequences.  It could be an entire lost packet/block/sector.

-Shawn

-----Original Message-----
From: Tex Texin [mailto:textexin at xencraft.com] 
Sent: Monday, December 19, 2016 5:23 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; 'Doug Ewell' <doug at ewellic.org>; 'Unicode Mailing List' <unicode at unicode.org>
Cc: 'Karl Williamson' <public at khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Mon Dec 19 20:08:39 2016
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 19 Dec 2016 19:08:39 -0700
Subject: Best practices for replacing UTF-8 overlongs
Message-ID: <xew2wpa1jusnnn7xnmnv9g39.1482199719895@email.android.com>

I thought there was a corrigendum or other, comparatively recent addition to the Standard that spelled out how replacement characters are supposed to be substituted for invalid code unit sequences -- something about detecting maximally long sequences. I'll look when I have a chance.
--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------From: Tex Texin <textexin at xencraft.com> Date: 12/19/16  18:23  (GMT-07:00) To: 'Shawn Steele' <Shawn.Steele at microsoft.com>, 'Doug Ewell' <doug at ewellic.org>, 'Unicode Mailing List' <unicode at unicode.org> Cc: 'Karl Williamson' <public at khwilliamson.com> Subject: RE: Best practices for replacing UTF-8 overlongs 
If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.? So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.? Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?? If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.? At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161219/40e8e0f4/attachment.html>

From textexin at xencraft.com  Mon Dec 19 20:35:30 2016
From: textexin at xencraft.com (Tex Texin)
Date: Mon, 19 Dec 2016 18:35:30 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
 <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>
Message-ID: <000601d25a69$bbaf00c0$330d0240$@xencraft.com>

Shawn,

Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.

How software behaves in the face of invalid bytes, what it does with them, what it does about them, and how it continues (or not) still needs to be determined.

tex

-----Original Message-----
From: Shawn Steele [mailto:Shawn.Steele at microsoft.com] 
Sent: Monday, December 19, 2016 5:41 PM
To: Tex Texin; 'Doug Ewell'; 'Unicode Mailing List'
Cc: 'Karl Williamson'
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, bad bytes == corruption.  At that point all bets are off because the machine has no clue "how" it was corrupted.  It could just be a single flipped bit lost in transmission.  It could have been an attack hack using overlong byte sequences.  It could be an entire lost packet/block/sector.

-Shawn

-----Original Message-----
From: Tex Texin [mailto:textexin at xencraft.com]
Sent: Monday, December 19, 2016 5:23 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; 'Doug Ewell' <doug at ewellic.org>; 'Unicode Mailing List' <unicode at unicode.org>
Cc: 'Karl Williamson' <public at khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From d3ck0r at gmail.com  Mon Dec 19 20:56:58 2016
From: d3ck0r at gmail.com (J Decker)
Date: Mon, 19 Dec 2016 18:56:58 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
References: <20083c6b-c861-b197-5fdb-d091daaeb517@khwilliamson.com>
Message-ID: <CAA2GJqU5rwRrDDKymkWhfHoDnssDp+JH-i_Z3oanY5Q9UWpjBg@mail.gmail.com>

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson <public at khwilliamson.com>
wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2.  Each sequence was legal in early Unicode
> versions, and it seems that it would be best to treat them as each a single
> sequence, replacing by a single replacement character.
>
> What are the advantages to replacing them by multiple characters
>

C0 80 is about the only exception; due to the prevalent use of '\0' as end
of string.
I tend not to generate that unless coming from wchar_t to utf8, and the
length exceeds the characters

Most things will die badly when fed 'overlong' characters, because
everything should be represented with least possible bits... (0-0x7f is
just 1 char, but c0 80 is not nessecariy 0)

and really is otherwise illegal to most places that implement codepoint
conversions...

there were many 'legal' definitions that just will never be used because
there is really a finite number of characters under 20 bits.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161219/865ca01c/attachment.html>

From duerst at it.aoyama.ac.jp  Mon Dec 19 21:19:43 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 20 Dec 2016 12:19:43 +0900
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <000601d25a69$bbaf00c0$330d0240$@xencraft.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
 <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000601d25a69$bbaf00c0$330d0240$@xencraft.com>
Message-ID: <26caf57b-3338-43e1-b16a-72c354cbb601@it.aoyama.ac.jp>

On 2016/12/20 11:35, Tex Texin wrote:
> Shawn,
>
> Ok, but that begs the questions of what to do...
> "All bets are off" is not instructive.

Well, it may be instructive in that its difficult to get software to 
decide what happened. A human may be in a better position to analyze the 
error and the cause(s) of the error, and to fix these.

> How software behaves in the face of invalid bytes, what it does with them, what it does about them, and how it continues (or not) still needs to be determined.

Yes, but that will depend on circumstances. In a safety-critical 
application, you'll want to do something different than if you are 
sending the text to a printer just to have a look at it.

Regards,   Martin.

From doug at ewellic.org  Mon Dec 19 21:54:31 2016
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 19 Dec 2016 20:54:31 -0700
Subject: Best practices for replacing UTF-8 overlongs
Message-ID: <gkd7kaqhw4k79jw6t8xu1vv0.1482206001507@email.android.com>

The relevant section is titled "Constraints on Conversion Processes" and is in Section 3 of TUS 9.0, page 126, right after definition D93.
--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------From: Doug Ewell <doug at ewellic.org> Date: 12/19/16  19:08  (GMT-07:00) To: Tex Texin <textexin at xencraft.com>, 'Shawn Steele' <Shawn.Steele at microsoft.com>, 'Unicode Mailing List' <unicode at unicode.org> Cc: 'Karl Williamson' <public at khwilliamson.com> Subject: RE: Best practices for replacing UTF-8 overlongs 
I thought there was a corrigendum or other, comparatively recent addition to the Standard that spelled out how replacement characters are supposed to be substituted for invalid code unit sequences -- something about detecting maximally long sequences. I'll look when I have a chance.
--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------From: Tex Texin <textexin at xencraft.com> Date: 12/19/16  18:23  (GMT-07:00) To: 'Shawn Steele' <Shawn.Steele at microsoft.com>, 'Doug Ewell' <doug at ewellic.org>, 'Unicode Mailing List' <unicode at unicode.org> Cc: 'Karl Williamson' <public at khwilliamson.com> Subject: RE: Best practices for replacing UTF-8 overlongs 
If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.? So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.? Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?? If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.? At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161219/51e99a8d/attachment.html>

From Shawn.Steele at microsoft.com  Mon Dec 19 22:35:51 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Tue, 20 Dec 2016 04:35:51 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <000601d25a69$bbaf00c0$330d0240$@xencraft.com>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
 <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000601d25a69$bbaf00c0$330d0240$@xencraft.com>
Message-ID: <MWHPR03MB28132342EAE808B4B57BFBA282900@MWHPR03MB2813.namprd03.prod.outlook.com>

So... an input data stream with corrupt UTF-8 basically has (under any scheme being discussed) some number of replacement characters.

Each of those replacement characters indicates at least one garbled byte, but without additional information, they aren't a great indicator of missing bytes.

I'm uncertain of what software would want to do with them at that point.  Making assumptions about over-long byte sequences being intelligible seems like it would require deep knowledge of how UTF-8 works, in which case why bother calling a generic UTF-8 decoding API?  

"All bets are off" may not be very instructive, however I don't think the "one replacement character per bad byte" or "one replacement character for many bytes" improves that situation at all.  

A too-long lead/trail byte could mean that someone used a bad encoder.
It could also mean that someone was trying something malicious
Another possibility would be flipped bits corruption during transmission/storage.
Alternatively, it could be part of a missing sequence.  
Confused software could have also done a mixed up copy-paste between applications.
Or bad buffering
Or....

Without additional information, I'm not sure what you expect the software should "know" besides "this data stream is definitely not 100% perfect" (alternatively you wouldn't necessarily know that a valid data stream had not been corrupted into an invalid stream).  

The options at that point would seem to be:

* Just keep on going, maybe some user'll fix it later.
* Warn the user somehow (though we're not going to be able to tell them much beyond "corrupted file")
* Reject the data as corrupted and refuse to load it.
* Attempt some sort of repair - that seems unlikely for most applications unless they have unique knowledge of possible corruption modes and have some sort of redundancy built in.

-Shawn

PS: The one thing I *do* know is that (from a program I had to debug once), it is unwise to decode binary executable as UTF-8, (so it contains a large number of replacement characters), and then take the hash of that resulting stream, and then test that to ensure that the binary had not been tampered with.  That program could have been rewritten to do most anything without touching the hash!

-----Original Message-----
From: Tex Texin [mailto:textexin at xencraft.com] 
Sent: Monday, December 19, 2016 6:36 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; 'Doug Ewell' <doug at ewellic.org>; 'Unicode Mailing List' <unicode at unicode.org>
Cc: 'Karl Williamson' <public at khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

Shawn,

Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.

How software behaves in the face of invalid bytes, what it does with them, what it does about them, and how it continues (or not) still needs to be determined.

tex

-----Original Message-----
From: Shawn Steele [mailto:Shawn.Steele at microsoft.com]
Sent: Monday, December 19, 2016 5:41 PM
To: Tex Texin; 'Doug Ewell'; 'Unicode Mailing List'
Cc: 'Karl Williamson'
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, bad bytes == corruption.  At that point all bets are off because the machine has no clue "how" it was corrupted.  It could just be a single flipped bit lost in transmission.  It could have been an attack hack using overlong byte sequences.  It could be an entire lost packet/block/sector.

-Shawn

-----Original Message-----
From: Tex Texin [mailto:textexin at xencraft.com]
Sent: Monday, December 19, 2016 5:23 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; 'Doug Ewell' <doug at ewellic.org>; 'Unicode Mailing List' <unicode at unicode.org>
Cc: 'Karl Williamson' <public at khwilliamson.com>
Subject: RE: Best practices for replacing UTF-8 overlongs

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From Shawn.Steele at microsoft.com  Mon Dec 19 23:31:43 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Tue, 20 Dec 2016 05:31:43 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <26caf57b-3338-43e1-b16a-72c354cbb601@it.aoyama.ac.jp>
References: <20161219165236.665a7a7059d7ee80bb4d670165c8327d.246a1e84fa.wbe@email03.godaddy.com>
 <MWHPR03MB28132D0DFC7E1B05CDF47D5282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000f01d25a5f$a9bacbb0$fd306310$@xencraft.com>
 <MWHPR03MB28136E00EC4061A5AB2E558282900@MWHPR03MB2813.namprd03.prod.outlook.com>
 <000601d25a69$bbaf00c0$330d0240$@xencraft.com>
 <26caf57b-3338-43e1-b16a-72c354cbb601@it.aoyama.ac.jp>
Message-ID: <MWHPR03MB281320A902D1D3AA0AAF7D2582900@MWHPR03MB2813.namprd03.prod.outlook.com>

Yes, I just don't see how the # of emitted replacement characters changes the flowchart on what to do when it's bad :)

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Martin J. D?rst
Sent: Monday, December 19, 2016 7:20 PM
To: 'Unicode Mailing List' <unicode at unicode.org>
Subject: Re: Best practices for replacing UTF-8 overlongs

On 2016/12/20 11:35, Tex Texin wrote:
> Shawn,
>
> Ok, but that begs the questions of what to do...
> "All bets are off" is not instructive.

Well, it may be instructive in that its difficult to get software to decide what happened. A human may be in a better position to analyze the error and the cause(s) of the error, and to fix these.

> How software behaves in the face of invalid bytes, what it does with them, what it does about them, and how it continues (or not) still needs to be determined.

Yes, but that will depend on circumstances. In a safety-critical application, you'll want to do something different than if you are sending the text to a printer just to have a look at it.

Regards,   Martin.


From richard.wordingham at ntlworld.com  Tue Dec 20 00:56:53 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 20 Dec 2016 06:56:53 +0000
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <gkd7kaqhw4k79jw6t8xu1vv0.1482206001507@email.android.com>
References: <gkd7kaqhw4k79jw6t8xu1vv0.1482206001507@email.android.com>
Message-ID: <20161220065653.48356ef2@JRWUBU2>

On Mon, 19 Dec 2016 20:54:31 -0700
Doug Ewell <doug at ewellic.org> wrote:

> There isn't much to be gained by collapsing the bad bytes to a single
> replacement character. However, doing so does remove the information
> about how many bytes were invalid and that may have value to a user
> in assessing how much of the document is suspect.

How many bytes are invalid in the sequence F0 30 A0 B0?  There might
just be one bit error in the data stream.

The chief advantage of collapsing comes in the simplicity of the
decoding logic.  The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also
has to check that the requisite continuation bytes are present.

Arguments then come down to the use or otherwise of library functions
and the number of error-reporting mechanisms to be used.

Richard.

From kenwhistler at att.net  Tue Dec 20 10:59:11 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 20 Dec 2016 08:59:11 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <xew2wpa1jusnnn7xnmnv9g39.1482199719895@email.android.com>
References: <xew2wpa1jusnnn7xnmnv9g39.1482199719895@email.android.com>
Message-ID: <119dcc01-193d-6e92-63d7-c72199387c62@att.net>

Doug,


On 12/19/2016 6:08 PM, Doug Ewell wrote:
> I thought there was a corrigendum or other, comparatively recent 
> addition to the Standard that spelled out how replacement characters 
> are supposed to be substituted for invalid code unit sequences -- 
> something about detecting maximally long sequences. I'll look when I 
> have a chance.
>
You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the 
text there about best practices for using U+FFFD  was the discussion and 
resolution of PRI #121 in August, 2008:

http://www.unicode.org/review/pr-121.html

That was discussed at UTC #116. See the minutes:

http://www.unicode.org/L2/L2008/08253.htm

There was feedback at the time advocating the 3rd option, rather than 
the 2nd one that was eventually chosen by the UTC. See:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

The actual text that resulted was first published in Unicode 5.2, p. 95:

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf

Contrast that with the text in Unicode 5.0, which had no extended 
discussion about handling conversion errors there. The Unicode 5.2 text 
was later expanded with more definitions and explanation, to what you 
see now in Unicode 9.0.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/1a184836/attachment.html>

From markus.icu at gmail.com  Tue Dec 20 12:33:50 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 20 Dec 2016 10:33:50 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <119dcc01-193d-6e92-63d7-c72199387c62@att.net>
References: <xew2wpa1jusnnn7xnmnv9g39.1482199719895@email.android.com>
 <119dcc01-193d-6e92-63d7-c72199387c62@att.net>
Message-ID: <CAN49p6p1bmT-B4C3pD8CbJf7XsCV3k8Nh5=9wHFeFV=qrtKvvQ@mail.gmail.com>

On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler <kenwhistler at att.net> wrote:

> You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the
> text there about best practices for using U+FFFD  was the discussion and
> resolution of PRI #121 in August, 2008:
>
> http://www.unicode.org/review/pr-121.html
>

Yes. However, some of the discussion in this thread is due to details that
were not spelled out in the PRI. There is basically a 2a and a 2b, while
the examples in PRI #121 work the same in both variants.

2a. As Richard said, "The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also has
to check that the requisite continuation bytes are present."

This naturally treats overlong sequences, surrogate-code-point sequences,
and 5/6-byte sequences (and prefixes thereof) as single errors.
(I suppose that lead byte above F4 could be somewhat debatable.)

(This is what ICU does for UTF-8.)

2b. The text in the standard represents the workings of a state machine
that walks strictly valid sequences. Overlong/surrogate/etc. sequences
become multiple errors.

(This is what ICU converters do for multi-byte charsets like Shift-JIS and
GB 18030.)

In my opinion, 2a. "feels right" for UTF-8, because of the history and
mechanics of the encoding, and 2b. is a good fit for MBCS where concepts
like overlong sequences don't exist. (And for GB 18030 you do have to walk
a validity state machine, you can't just look at the lead byte.)

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/c3599968/attachment.html>

From kenwhistler at att.net  Tue Dec 20 12:47:10 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 20 Dec 2016 10:47:10 -0800
Subject: Best practices for replacing UTF-8 overlongs
In-Reply-To: <CAN49p6p1bmT-B4C3pD8CbJf7XsCV3k8Nh5=9wHFeFV=qrtKvvQ@mail.gmail.com>
References: <xew2wpa1jusnnn7xnmnv9g39.1482199719895@email.android.com>
 <119dcc01-193d-6e92-63d7-c72199387c62@att.net>
 <CAN49p6p1bmT-B4C3pD8CbJf7XsCV3k8Nh5=9wHFeFV=qrtKvvQ@mail.gmail.com>
Message-ID: <36c72f47-f4a9-171f-a9da-7487f6427608@att.net>


On 12/20/2016 10:33 AM, Markus Scherer wrote:
> Yes. However, some of the discussion in this thread is due to details 
> that were not spelled out in the PRI. There is basically a 2a and a 
> 2b, while the examples in PRI #121 work the same in both variants.
>

I wasn't intending to argue the case one way or the other, but rather 
just to point people to the historical context for the text as it 
currently is in the standard, since that came up in discussion.

--Ken


From cjm2265 at gmail.com  Tue Dec 20 15:53:42 2016
From: cjm2265 at gmail.com (Chris Monteleone)
Date: Tue, 20 Dec 2016 16:53:42 -0500
Subject: Emoji Packs
Message-ID: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>

Hello

I subscribed just now to ask this question, so please bear with me.

I would like to download emoji from every available vendor into a well
organized folder with the code points as file names. Obviously I can
download them individually from the Unicode.org charts, but holy tedious.
Where can I find something like this?

Also, what are my usage rights (ie how does licensing work)?

Thanks!
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/709b6fec/attachment.html>

From christoph.paeper at crissov.de  Tue Dec 20 19:09:29 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Wed, 21 Dec 2016 02:09:29 +0100
Subject: Emoji Packs
In-Reply-To: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
Message-ID: <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>

Chris Monteleone <cjm2265 at gmail.com>:
> 
> I would like to download emoji from every available vendor into a well organized folder with the code points as file names.

I assume you?re looking for <https://github.com/iamcal/emoji-data/>.

From martinmueller at northwestern.edu  Tue Dec 20 20:29:59 2016
From: martinmueller at northwestern.edu (Martin Mueller)
Date: Wed, 21 Dec 2016 02:29:59 +0000
Subject: a character for an unknown character
Message-ID: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>

I?m new to this list. Please excuse my technical incompetence.
Is there a Unicode character that says ?I represent an alphanumerical character, but I don?t know which?.  This is a very common problem in the transcription of historical texts where you have lacunas. Often, the extent of the lacuna is known, and the alphabet is known as well. The EEBO TCP transcriptions of English texts before 1700 are good examples.  They are SGML transcriptions, where missing stuff is represented by <gap/> elements with attributes about this or that. This is efficient when it comes to pages, very inefficient when it comes to individual characters.
There is a Web character?a diamond with a question mark inside it?which means ?I may know what this character represents, but I can?t display it?. Which is a very different message. On the other hand, if you extened the use of that character, it probably wouldn?t? create much ambiguity.
In the TCP project, various code points from the Geometrical were used to represent lacunae. The black circle (\u25cf) has been used as the character for a missing character.This is OK and unambiguous in its context.   But would be nice to have a special character for just that purpose, and given the number of emoji, this doesn?t seem to be a particularly frivolous request.  Which alphabet, you might ask. But that doesn?t really matter. There is a very high probability that the missing character comes from the character set of the surrounding words. And if that isn?t the case, the transcriber wouldn?t know it. S/he sees that there is something, perhaps even that there is just one of it, but doesn?t know which

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/ded9a98f/attachment.html>

From cjm2265 at gmail.com  Tue Dec 20 21:10:13 2016
From: cjm2265 at gmail.com (Chris Monteleone)
Date: Tue, 20 Dec 2016 22:10:13 -0500
Subject: Emoji Packs
In-Reply-To: <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
Message-ID: <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>

Sir, you are a scholar and a gentleman. Your swift actions of charity are
much appreciated.

Now if you happen to know where I can find the same thing for Samsung,
Facebook, and Windows that would be everything I need.

Thanks!
Chris

PS
I have spent a fair amount of time looking for these, I'm not just
delegating my tedious work to the internets.

On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> Chris Monteleone <cjm2265 at gmail.com>:
> >
> > I would like to download emoji from every available vendor into a well
> organized folder with the code points as file names.
>
> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/c380000e/attachment.html>

From verdy_p at wanadoo.fr  Wed Dec 21 01:57:56 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 21 Dec 2016 08:57:56 +0100
Subject: a character for an unknown character
In-Reply-To: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
Message-ID: <CAGa7JC2vnH-Ze4HXxAtsvpHn8BzB55CWrRvo9FAV=9yNLE0u6g@mail.gmail.com>

there's a "replacement" control, whose rendering is undefined. It may
represent any missing part covering more than one character, such as parts
that have been burned, or overstrikken. This Unicode character can act as a
substitute but its rendering is purposely undefined. An application may
show some greyed box there, but it should not be the tofu box used for
characters not mapped in the specified fonts.
Older encoduing used the ASCII control "SUB" for representing this
function. Some terminals displayed it as a filled box Other documents have
used the ASCII DEL control for the same purpose. However for Unicode
encodings ASCII controls should be avoided.

This is not an Emoji, as Emojis have a clear visual representation and
semantics (and often specific colors). But you're right, it should be a
symbol in Unicode (like Emojis, but unlike ASCII controls)

2016-12-21 3:29 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu>:

> I?m new to this list. Please excuse my technical incompetence.
>
> Is there a Unicode character that says ?I represent an alphanumerical
> character, but I don?t know which?.  This is a very common problem in the
> transcription of historical texts where you have lacunas. Often, the extent
> of the lacuna is known, and the alphabet is known as well. The EEBO TCP
> transcriptions of English texts before 1700 are good examples.  They are
> SGML transcriptions, where missing stuff is represented by <gap/> elements
> with attributes about this or that. This is efficient when it comes to
> pages, very inefficient when it comes to individual characters.
>
> There is a Web character?a diamond with a question mark inside it?which
> means ?I may know what this character represents, but I can?t display it?.
> Which is a very different message. On the other hand, if you extened the
> use of that character, it probably wouldn?t? create much ambiguity.
>
> In the TCP project, various code points from the Geometrical were used to
> represent lacunae. The black circle (\u25cf) has been used as the character
> for a missing character.This is OK and unambiguous in its context.   But
> would be nice to have a special character for just that purpose, and given
> the number of emoji, this doesn?t seem to be a particularly frivolous
> request.  Which alphabet, you might ask. But that doesn?t really matter.
> There is a very high probability that the missing character comes from the
> character set of the surrounding words. And if that isn?t the case, the
> transcriber wouldn?t know it. S/he sees that there is something, perhaps
> even that there is just one of it, but doesn?t know which
>
>
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/fc78c9d1/attachment.html>

From 637275 at gmail.com  Wed Dec 21 02:11:16 2016
From: 637275 at gmail.com (Rebecca T)
Date: Wed, 21 Dec 2016 08:11:16 +0000
Subject: a character for an unknown character
In-Reply-To: <CAGa7JC2vnH-Ze4HXxAtsvpHn8BzB55CWrRvo9FAV=9yNLE0u6g@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <CAGa7JC2vnH-Ze4HXxAtsvpHn8BzB55CWrRvo9FAV=9yNLE0u6g@mail.gmail.com>
Message-ID: <CANDtJjikzAVckfm8rhnzQkSg-axsgawOdrdeiGrJYG44Z5Q6HQ@mail.gmail.com>

U+FFFD REPLACEMENT CHARACTER ?

On Wed, Dec 21, 2016 at 3:05 AM Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> there's a "replacement" control, whose rendering is undefined. It may
> represent any missing part covering more than one character, such as parts
> that have been burned, or overstrikken. This Unicode character can act as a
> substitute but its rendering is purposely undefined. An application may
> show some greyed box there, but it should not be the tofu box used for
> characters not mapped in the specified fonts.
> Older encoduing used the ASCII control "SUB" for representing this
> function. Some terminals displayed it as a filled box Other documents have
> used the ASCII DEL control for the same purpose. However for Unicode
> encodings ASCII controls should be avoided.
>
> This is not an Emoji, as Emojis have a clear visual representation and
> semantics (and often specific colors). But you're right, it should be a
> symbol in Unicode (like Emojis, but unlike ASCII controls)
>
> 2016-12-21 3:29 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu>:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I?m new to this list. Please excuse my technical incompetence.
>
>
> Is there a Unicode character that says ?I represent an alphanumerical
> character, but I don?t know which?.  This is a very common problem in the
> transcription of historical texts where you have lacunas. Often,
>
> the extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by <gap/>
> elements with attributes about this
>
> or that. This is efficient when it comes to pages, very inefficient when
> it comes to individual characters.
>
>
> There is a Web character?a diamond with a question mark inside it?which
> means ?I may know what this character represents, but I can?t display it?.
> Which is a very different message. On the other hand, if you
>
> extened the use of that character, it probably wouldn?t? create much
> ambiguity.
>
>
>
> In the TCP project, various code points from the Geometrical were used to
> represent lacunae. The black circle (\u25cf) has been used as the character
> for a missing character.This is OK and unambiguous in its
>
> context.   But would be nice to have a special character for just that
> purpose, and given the number of emoji, this doesn?t seem to be a
> particularly frivolous request.  Which alphabet, you might ask. But that
> doesn?t really matter. There is a very high probability
>
> that the missing character comes from the character set of the surrounding
> words. And if that isn?t the case, the transcriber wouldn?t know it. S/he
> sees that there is something, perhaps even that there is just one of it,
> but doesn?t know which
>
>
>
>
>
>
> Martin Mueller
>
>
> Professor emeritus of English and Classics
>
>
> Northwestern University
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/4b302b8f/attachment.html>

From corbett.dav at husky.neu.edu  Wed Dec 21 09:45:08 2016
From: corbett.dav at husky.neu.edu (David Corbett)
Date: Wed, 21 Dec 2016 10:45:08 -0500
Subject: a character for an unknown character
Message-ID: <CAKQz=Z4VvM8ZMHMxN8j1XERsMJpnBNDMbJzHJO0Ysf3H6Kjwdg@mail.gmail.com>

One Unicode character specifically for this purpose is U+3013 GETA MARK. It
is a Japanese symbol used to replace characters that cannot be read during
transcription of manuscripts (source: Japanese Wikipedia). It looks like a
bold equals sign: ?.

Other people have suggested U+FFFD REPLACEMENT CHARACTER, but that is the
diamond with a question mark inside it that you already mentioned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/f748f89d/attachment.html>

From public at khwilliamson.com  Wed Dec 21 10:20:38 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Wed, 21 Dec 2016 09:20:38 -0700
Subject: a character for an unknown character
In-Reply-To: <CAKQz=Z4VvM8ZMHMxN8j1XERsMJpnBNDMbJzHJO0Ysf3H6Kjwdg@mail.gmail.com>
References: <CAKQz=Z4VvM8ZMHMxN8j1XERsMJpnBNDMbJzHJO0Ysf3H6Kjwdg@mail.gmail.com>
Message-ID: <318384cf-ea76-a533-a8da-4067e5e3954c@khwilliamson.com>

On 12/21/2016 08:45 AM, David Corbett wrote:
> One Unicode character specifically for this purpose is U+3013 GETA MARK.
> It is a Japanese symbol used to replace characters that cannot be read
> during transcription of manuscripts (source: Japanese Wikipedia). It
> looks like a bold equals sign: ?.
>
> Other people have suggested U+FFFD REPLACEMENT CHARACTER, but that is
> the diamond with a question mark inside it that you already mentioned.

And the RREPLACEMENT CHARACTER is what is used if there is a malformed 
character, such as could occur in transmission.  Thus, its use is 
ambiguous to you.  The reader would not know if the original manuscript 
was illegible here, or if somewhere along the way something happened to 
foul up the text.


From everson at evertype.com  Wed Dec 21 10:25:30 2016
From: everson at evertype.com (Michael Everson)
Date: Wed, 21 Dec 2016 16:25:30 +0000
Subject: a character for an unknown character
In-Reply-To: <318384cf-ea76-a533-a8da-4067e5e3954c@khwilliamson.com>
References: <CAKQz=Z4VvM8ZMHMxN8j1XERsMJpnBNDMbJzHJO0Ysf3H6Kjwdg@mail.gmail.com>
 <318384cf-ea76-a533-a8da-4067e5e3954c@khwilliamson.com>
Message-ID: <56F89D32-17FE-426A-958B-E3B239D778AF@evertype.com>

I still believe that we need INVISIBLE LETTER http://unicode.org/review/pr-41-invisible.pdf

I think that for the display of combining characters without a base character that the recommended NBSP makes no sense. NBSP is supposed to glue the characters on either side of it to itself. It makes sense that the following character, say COMBINING ACUTE ACCENT, would be glued to it. But why should the two of those be glued to whatever precedes?

Michael Everson

From jsbien at mimuw.edu.pl  Wed Dec 21 10:44:43 2016
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Wed, 21 Dec 2016 17:44:43 +0100
Subject: a character for an unknown character
In-Reply-To: <56F89D32-17FE-426A-958B-E3B239D778AF@evertype.com>
References: <CAKQz=Z4VvM8ZMHMxN8j1XERsMJpnBNDMbJzHJO0Ysf3H6Kjwdg@mail.gmail.com>
 <318384cf-ea76-a533-a8da-4067e5e3954c@khwilliamson.com>
 <56F89D32-17FE-426A-958B-E3B239D778AF@evertype.com>
Message-ID: <20161221174443.15894ua4705vwk8r@mail.mimuw.edu.pl>

Quote/Cytat - Michael Everson <everson at evertype.com> (Wed 21 Dec 2016  
05:25:30 PM CET):

> I still believe that we need INVISIBLE LETTER  
> http://unicode.org/review/pr-41-invisible.pdf
>
> I think that for the display of combining characters without a base  
> character that the recommended NBSP makes no sense. NBSP is supposed  
> to glue the characters on either side of it to itself. It makes  
> sense that the following character, say COMBINING ACUTE ACCENT,  
> would be glued to it. But why should the two of those be glued to  
> whatever precedes?

I strongly support this. In our historical corpus of Polish

http://korpusy.klf.uw.edu.pl/en/IMPACT_GT_2/

we have in particular words ending with 'COMBINING LATIN SMALL LETTER  
O' (U+0366).

We had to precede the character with NBSP as the vase, but to preserve  
the correct segmentation into words we had to treat NBSP as a letter.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From corbett.dav at husky.neu.edu  Wed Dec 21 10:56:27 2016
From: corbett.dav at husky.neu.edu (David Corbett)
Date: Wed, 21 Dec 2016 11:56:27 -0500
Subject: Invisible letter (was Re: a character for an unknown character)
Message-ID: <CAKQz=Z5GzTGgfwU9b3_FiSBxM9zNJSHv-7j8v__cU-eUpMX+jw@mail.gmail.com>

Couldn?t you use U+1D52 MODIFIER LETTER SMALL O?

(I changed the subject line because the invisible letter proposal is not
relevant to the question about a lacuna character.)

> I strongly support this. In our historical corpus of Polish
>
> http://korpusy.klf.uw.edu.pl/en/IMPACT_GT_2/
>
> we have in particular words ending with 'COMBINING LATIN SMALL LETTER
> O' (U+0366).
>
> We had to precede the character with NBSP as the vase, but to preserve
> the correct segmentation into words we had to treat NBSP as a letter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/33d30158/attachment.html>

From 637275 at gmail.com  Wed Dec 21 11:07:03 2016
From: 637275 at gmail.com (Rebecca T)
Date: Wed, 21 Dec 2016 12:07:03 -0500
Subject: Emoji Packs
In-Reply-To: <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
Message-ID: <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>

Okay, I threw something together.

github.com/9999years/emoji has all 18,615 images from the charts, and the
generating script is there as process.py
<https://github.com/9999years/emoji/blob/master/process.py> as well!

All the images are thrown together in one directory, but if there?s a
better way to organize them, please let me know!

On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
wrote:

> "Unfamiliar" would be an understatement. If you feel like putting that
> together it would be appreciated, but no pressure.
>
> Thank you!
>
> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>
>> The charts include the images as inline base64, right?Parsing them out
>> with Python might not be a bad idea, depending on how well-organized the
>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>> quick script later.
>>
>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>> wrote:
>>
>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>> are much appreciated.
>>>
>>> Now if you happen to know where I can find the same thing for Samsung,
>>> Facebook, and Windows that would be everything I need.
>>>
>>> Thanks!
>>> Chris
>>>
>>> PS
>>> I have spent a fair amount of time looking for these, I'm not just
>>> delegating my tedious work to the internets.
>>>
>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>> christoph.paeper at crissov.de> wrote:
>>>
>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>
>>>
>>> >
>>>
>>>
>>> > I would like to download emoji from every available vendor into a well
>>> organized folder with the code points as file names.
>>>
>>>
>>>
>>>
>>>
>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>
>>>
>>>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/6de65a07/attachment.html>

From jsbien at mimuw.edu.pl  Wed Dec 21 11:15:19 2016
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Wed, 21 Dec 2016 18:15:19 +0100
Subject: Invisible letter (was Re: a character for an unknown
 character)
In-Reply-To: <CAKQz=Z5GzTGgfwU9b3_FiSBxM9zNJSHv-7j8v__cU-eUpMX+jw@mail.gmail.com>
References: <CAKQz=Z5GzTGgfwU9b3_FiSBxM9zNJSHv-7j8v__cU-eUpMX+jw@mail.gmail.com>
Message-ID: <20161221181519.168145hfwn4nqlyf@mail.mimuw.edu.pl>

Quote/Cytat - David Corbett <corbett.dav at husky.neu.edu> (Wed 21 Dec  
2016 05:56:27 PM CET):

> Couldn?t you use U+1D52 MODIFIER LETTER SMALL O?

In our corpus COMBINING LATIN SMALL LETTER O sometimes occurs in its  
combining function, it seemed more elegant to use a uniform encoding.  
But you are right, in the example quoted MODIFIER LETTER SMALL O could  
be also used.

Regards

Janusz

> (I changed the subject line because the invisible letter proposal is not
> relevant to the question about a lacuna character.)
>
>> I strongly support this. In our historical corpus of Polish
>>
>> http://korpusy.klf.uw.edu.pl/en/IMPACT_GT_2/
>>
>> we have in particular words ending with 'COMBINING LATIN SMALL LETTER
>> O' (U+0366).
>>
>> We had to precede the character with NBSP as the vase, but to preserve
>> the correct segmentation into words we had to treat NBSP as a letter.
>


-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From mark at macchiato.com  Wed Dec 21 15:20:05 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 21 Dec 2016 22:20:05 +0100
Subject: Emoji Packs
In-Reply-To: <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
 <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
Message-ID: <CAJ2xs_HZfpuo1MVfCRYFdJYXWxtU=EsEZrh7oo43HGXq8TxYCQ@mail.gmail.com>

Please consult the line on the charts: "For information about the images
used in these charts, see Emoji Images and Rights
<http://unicode.org/emoji/images.html>."

Mark

On Wed, Dec 21, 2016 at 6:07 PM, Rebecca T <637275 at gmail.com> wrote:

> Okay, I threw something together.
>
> github.com/9999years/emoji has all 18,615 images from the charts, and the
> generating script is there as process.py
> <https://github.com/9999years/emoji/blob/master/process.py> as well!
>
> All the images are thrown together in one directory, but if there?s a
> better way to organize them, please let me know!
>
> On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
> wrote:
>
>> "Unfamiliar" would be an understatement. If you feel like putting that
>> together it would be appreciated, but no pressure.
>>
>> Thank you!
>>
>> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>>
>>> The charts include the images as inline base64, right?Parsing them out
>>> with Python might not be a bad idea, depending on how well-organized the
>>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>>> quick script later.
>>>
>>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>>> wrote:
>>>
>>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>>> are much appreciated.
>>>>
>>>> Now if you happen to know where I can find the same thing for Samsung,
>>>> Facebook, and Windows that would be everything I need.
>>>>
>>>> Thanks!
>>>> Chris
>>>>
>>>> PS
>>>> I have spent a fair amount of time looking for these, I'm not just
>>>> delegating my tedious work to the internets.
>>>>
>>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>>> christoph.paeper at crissov.de> wrote:
>>>>
>>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>>
>>>>
>>>> >
>>>>
>>>>
>>>> > I would like to download emoji from every available vendor into a
>>>> well organized folder with the code points as file names.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/8ff37ed6/attachment.html>

From richard.wordingham at ntlworld.com  Wed Dec 21 17:49:09 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 21 Dec 2016 23:49:09 +0000
Subject: a character for an unknown character
In-Reply-To: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
Message-ID: <20161221234909.09394f15@JRWUBU2>

On Wed, 21 Dec 2016 02:29:59 +0000
Martin Mueller <martinmueller at northwestern.edu> wrote:

> I?m new to this list. Please excuse my technical incompetence.
> Is there a Unicode character that says ?I represent an alphanumerical
> character, but I don?t know which?.  This is a very common problem in
> the transcription of historical texts where you have lacunas. Often,
> the extent of the lacuna is known, and the alphabet is known as well.

U+FFFD REPLACEMENT CHARACTER says that one can't represent (or
interpret) the specified character as Unicode.

U+3013 GETA MARK says the character isn't encoded, and I suspect
implies not being available as a usable PUA character.

U+0359 COMBINING ASTERISK BELOW can mean that we have to take someone's
word for what the character is - he claimed he knew, but we can't see
the evidence.  (That is the meaning given when the character was
added, but it can have other meanings - I've seen a Thai dictionary
use it as a nukta.) 

The concept here is 'no-one in communication knows for sure what the
character is'.  The usual notation for this is diagonal shading, for
which CSS mark-up repeating-linear-gradient is now available.
Graphically, the best character, which may not be considered completely
appropriate, is

U+26C6 RAIN

Having a general class of symbol_other, just like U+3013, it should
have the appropriate Unicode properties.  I'm just not sure that one
can justify it as 'something washed the character out' -:)  Script
should only matter if there is a known combining character, in which
case we are heading for the territory of partial damage marks, which
generally feel like mark-up.

If we add a bespoke character, it might belong in a punctuation block,
just as u+3013 does.  It represents a gap, like SPACE, but this time,
generally a hole in the medium of the text.

Richard.


From gwalla at gmail.com  Wed Dec 21 18:00:05 2016
From: gwalla at gmail.com (Garth Wallace)
Date: Wed, 21 Dec 2016 16:00:05 -0800
Subject: a character for an unknown character
In-Reply-To: <20161221234909.09394f15@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <20161221234909.09394f15@JRWUBU2>
Message-ID: <CA+p4_H2Fj7_25v=ShRa76pUbGmOf+Gtk0gd5VPp7VHFsz5nYkg@mail.gmail.com>

I think CYFI has characters in the PUA for "lost sign" and "damaged sign".
Both are shaded squares using different patterns.

On Wed, Dec 21, 2016 at 3:49 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Wed, 21 Dec 2016 02:29:59 +0000
> Martin Mueller <martinmueller at northwestern.edu> wrote:
>
> > I?m new to this list. Please excuse my technical incompetence.
> > Is there a Unicode character that says ?I represent an alphanumerical
> > character, but I don?t know which?.  This is a very common problem in
> > the transcription of historical texts where you have lacunas. Often,
> > the extent of the lacuna is known, and the alphabet is known as well.
>
> U+FFFD REPLACEMENT CHARACTER says that one can't represent (or
> interpret) the specified character as Unicode.
>
> U+3013 GETA MARK says the character isn't encoded, and I suspect
> implies not being available as a usable PUA character.
>
> U+0359 COMBINING ASTERISK BELOW can mean that we have to take someone's
> word for what the character is - he claimed he knew, but we can't see
> the evidence.  (That is the meaning given when the character was
> added, but it can have other meanings - I've seen a Thai dictionary
> use it as a nukta.)
>
> The concept here is 'no-one in communication knows for sure what the
> character is'.  The usual notation for this is diagonal shading, for
> which CSS mark-up repeating-linear-gradient is now available.
> Graphically, the best character, which may not be considered completely
> appropriate, is
>
> U+26C6 RAIN
>
> Having a general class of symbol_other, just like U+3013, it should
> have the appropriate Unicode properties.  I'm just not sure that one
> can justify it as 'something washed the character out' -:)  Script
> should only matter if there is a known combining character, in which
> case we are heading for the territory of partial damage marks, which
> generally feel like mark-up.
>
> If we add a bespoke character, it might belong in a punctuation block,
> just as u+3013 does.  It represents a gap, like SPACE, but this time,
> generally a hole in the medium of the text.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/be432a9d/attachment.html>

From manish at mozilla.com  Wed Dec 21 17:24:21 2016
From: manish at mozilla.com (Manish Goregaokar)
Date: Wed, 21 Dec 2016 15:24:21 -0800
Subject: UAX #29: Ambiguities in WB4, and contributing back testcases
Message-ID: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>

Hi,

We've been implementing[1] the Unicode 9 version of UAX #29[2] in
Rust, and came across some ambiguities and issues.

One issue is that the tests[3] are a bit lacking. They don't handle
cases with multiple flag emoji, for example (the handling of which
changed since Unicode 8). We have a couple testcases[4][5] for these
things (and may create more), is there any way to contribute these
back?


Aside from that, WB4's[6] greediness is underspecified. In previous
versions, the rule was

> X (Extend | Format)* ? X

which means that you can "collapse" proceeding extend/format
characters into a character itself, without changing the state you're
in. This would just work because Extend/Format characters only appear
in this rule.

However, now the rule is

> X (Extend | Format | ZWJ)* ? X

The problem here is that ZWJ appears in the previous rule as well, WB3c[7]:

> ZWJ ? (Glue_After_Zwj | EBG)

which says that we should not break between a ZWJ and a GAZ ("Glue
After ZWJ") character.

WB3c has precedence over WB4, which means that a sequence like
`Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ?EBG` *first*, before the
ZWJ is collapsed into the Emoji_Base. This is fine.

However, more complicated sequences depend on the greediness of the
Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
since we have a Extend/ZWJ sequence.

WB4 can apply in multiple ways. If it is applied greedily, we get
`Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
characters). This does break since you don't break between Emoji_Base
and EBG.

However, we can apply it conservatively instead. We can get
`Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
collapse.

These aren't really sequences that will occur in practice (I think?),
but I think it's important that implementations don't differ in their
behavior and segment things differently. If we don't actually care
about this, I think this ambiguity should at least be called out
explicitly in the spec.

WB4 makes the word break steps loop in on themselves. Previously you
just had to pattern match each interval between code points with the
rules in order, which can be done in any order and produce the same
result. Now that there's a replacement rule which changes the
structure of the string, the algorithm is suddenly dependent on the
order and fashion in which WB4 is applied.


Could this be clarified?

Thanks,

-Manish Goregaokar

 [1]: https://github.com/unicode-rs/unicode-segmentation/pull/10
 [2]: http://www.unicode.org/reports/tr29/ (permalink:
http://www.unicode.org/reports/tr29/tr29-29.html)
 [3]: http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakTest.txt
 [4]: https://github.com/unicode-rs/unicode-segmentation/blob/8bac7c72ddd70426acfe1e58545cdd1694c61d88/src/test.rs#L94
 [5]: https://github.com/unicode-rs/unicode-segmentation/blob/8bac7c72ddd70426acfe1e58545cdd1694c61d88/src/test.rs#L19
 [6]: http://www.unicode.org/reports/tr29/#WB4
 [7]: http://www.unicode.org/reports/tr29/#WB3c

 -Manish


From cjm2265 at gmail.com  Wed Dec 21 21:20:47 2016
From: cjm2265 at gmail.com (Chris Monteleone)
Date: Wed, 21 Dec 2016 22:20:47 -0500
Subject: Emoji Packs
In-Reply-To: <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
 <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
Message-ID: <CAPnwY4QB77dKO7WUidoWKTA5mejebwaRSB8yVbCF6o65GbtXTQ@mail.gmail.com>

Thank you so much Rebecca, this is really above and beyond.

I'm messing around with the script a bit to control how it's all
organized/named. So here's the dumb question: How do I run the script to
get it to pull the images from the website?

First of all, when I downloaded everything from github I didn't get the
data.html, I got '.gitignore'. Is the index page you mentioned found at
http://unicode.org/emoji/charts/index.html? or is it one of those pages
that lists all of the objects on a website?

Once I have that, do I just run process.py?

I'm so sorry for being dumb, but thanks again!

Chris

On Wed, Dec 21, 2016 at 12:07 PM, Rebecca T <637275 at gmail.com> wrote:

> Okay, I threw something together.
>
> github.com/9999years/emoji has all 18,615 images from the charts, and the
> generating script is there as process.py
> <https://github.com/9999years/emoji/blob/master/process.py> as well!
>
> All the images are thrown together in one directory, but if there?s a
> better way to organize them, please let me know!
>
> On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
> wrote:
>
>> "Unfamiliar" would be an understatement. If you feel like putting that
>> together it would be appreciated, but no pressure.
>>
>> Thank you!
>>
>> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>>
>>> The charts include the images as inline base64, right?Parsing them out
>>> with Python might not be a bad idea, depending on how well-organized the
>>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>>> quick script later.
>>>
>>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>>> wrote:
>>>
>>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>>> are much appreciated.
>>>>
>>>> Now if you happen to know where I can find the same thing for Samsung,
>>>> Facebook, and Windows that would be everything I need.
>>>>
>>>> Thanks!
>>>> Chris
>>>>
>>>> PS
>>>> I have spent a fair amount of time looking for these, I'm not just
>>>> delegating my tedious work to the internets.
>>>>
>>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>>> christoph.paeper at crissov.de> wrote:
>>>>
>>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>>
>>>>
>>>> >
>>>>
>>>>
>>>> > I would like to download emoji from every available vendor into a
>>>> well organized folder with the code points as file names.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/6cc22677/attachment.html>

From cjm2265 at gmail.com  Wed Dec 21 21:21:07 2016
From: cjm2265 at gmail.com (Chris Monteleone)
Date: Wed, 21 Dec 2016 22:21:07 -0500
Subject: Emoji Packs
In-Reply-To: <CAJ2xs_HZfpuo1MVfCRYFdJYXWxtU=EsEZrh7oo43HGXq8TxYCQ@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
 <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
 <CAJ2xs_HZfpuo1MVfCRYFdJYXWxtU=EsEZrh7oo43HGXq8TxYCQ@mail.gmail.com>
Message-ID: <CAPnwY4QoNZto1F7Jw5WSdVDG_zMqZrzO-5jzWNXxfGR7G7M0WA@mail.gmail.com>

Thank you, Mark!

On Wed, Dec 21, 2016 at 4:20 PM, Mark Davis ?? <mark at macchiato.com> wrote:

> Please consult the line on the charts: "For information about the images
> used in these charts, see Emoji Images and Rights
> <http://unicode.org/emoji/images.html>."
>
> Mark
>
> On Wed, Dec 21, 2016 at 6:07 PM, Rebecca T <637275 at gmail.com> wrote:
>
>> Okay, I threw something together.
>>
>> github.com/9999years/emoji has all 18,615 images from the charts, and
>> the generating script is there as process.py
>> <https://github.com/9999years/emoji/blob/master/process.py> as well!
>>
>> All the images are thrown together in one directory, but if there?s a
>> better way to organize them, please let me know!
>>
>> On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
>> wrote:
>>
>>> "Unfamiliar" would be an understatement. If you feel like putting that
>>> together it would be appreciated, but no pressure.
>>>
>>> Thank you!
>>>
>>> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>>>
>>>> The charts include the images as inline base64, right?Parsing them out
>>>> with Python might not be a bad idea, depending on how well-organized the
>>>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>>>> quick script later.
>>>>
>>>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>>>> wrote:
>>>>
>>>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>>>> are much appreciated.
>>>>>
>>>>> Now if you happen to know where I can find the same thing for Samsung,
>>>>> Facebook, and Windows that would be everything I need.
>>>>>
>>>>> Thanks!
>>>>> Chris
>>>>>
>>>>> PS
>>>>> I have spent a fair amount of time looking for these, I'm not just
>>>>> delegating my tedious work to the internets.
>>>>>
>>>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>>>> christoph.paeper at crissov.de> wrote:
>>>>>
>>>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>> > I would like to download emoji from every available vendor into a
>>>>> well organized folder with the code points as file names.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161221/4cba7f35/attachment.html>

From 637275 at gmail.com  Thu Dec 22 01:43:37 2016
From: 637275 at gmail.com (Rebecca T)
Date: Thu, 22 Dec 2016 02:43:37 -0500
Subject: Emoji Packs
In-Reply-To: <CAPnwY4QB77dKO7WUidoWKTA5mejebwaRSB8yVbCF6o65GbtXTQ@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
 <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
 <CAPnwY4QB77dKO7WUidoWKTA5mejebwaRSB8yVbCF6o65GbtXTQ@mail.gmail.com>
Message-ID: <CANDtJjhwiRv4N7VC9ZBSJZ42mn=FUMcGjCz+nr66F-4UNHN+7w@mail.gmail.com>

Yes, by running process.py in a directory containing
http://unicode.org/emoji/charts/full-emoji-list.html (I re-commited and
renamed data.html to full-emoji-list.html for clarity), the script will
generate images in an /img/ sub-directory. (Be careful that /img/ already
exists! Strange things might go wrong if it doesn?t.)

Also note that I?m running Python 3.5 on Windows ? I?m fairly confident the
differences between platforms is fairly minor, but a certain degree of
zaniness may

So yes: the directory should look like (at a minimum):

.../repository-directory     (DIRECTORY)
    ??? full-emoji-list.html (FILE)
    ??? process.py           (FILE)
    ??? /img                 (DIRECTORY)
         ?
         ??? ...             (OUTPUT FILES)

I hope that?s clear enough! Tell me if any of that doesn?t make sense.


On Wed, Dec 21, 2016 at 10:20 PM, Chris Monteleone <cjm2265 at gmail.com>
wrote:

> Thank you so much Rebecca, this is really above and beyond.
>
> I'm messing around with the script a bit to control how it's all
> organized/named. So here's the dumb question: How do I run the script to
> get it to pull the images from the website?
>
> First of all, when I downloaded everything from github I didn't get the
> data.html, I got '.gitignore'. Is the index page you mentioned found at
> http://unicode.org/emoji/charts/index.html? or is it one of those pages
> that lists all of the objects on a website?
>
> Once I have that, do I just run process.py?
>
> I'm so sorry for being dumb, but thanks again!
>
> Chris
>
> On Wed, Dec 21, 2016 at 12:07 PM, Rebecca T <637275 at gmail.com> wrote:
>
>> Okay, I threw something together.
>>
>> github.com/9999years/emoji has all 18,615 images from the charts, and
>> the generating script is there as process.py
>> <https://github.com/9999years/emoji/blob/master/process.py> as well!
>>
>> All the images are thrown together in one directory, but if there?s a
>> better way to organize them, please let me know!
>>
>> On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
>> wrote:
>>
>>> "Unfamiliar" would be an understatement. If you feel like putting that
>>> together it would be appreciated, but no pressure.
>>>
>>> Thank you!
>>>
>>> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>>>
>>>> The charts include the images as inline base64, right?Parsing them out
>>>> with Python might not be a bad idea, depending on how well-organized the
>>>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>>>> quick script later.
>>>>
>>>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>>>> wrote:
>>>>
>>>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>>>> are much appreciated.
>>>>>
>>>>> Now if you happen to know where I can find the same thing for Samsung,
>>>>> Facebook, and Windows that would be everything I need.
>>>>>
>>>>> Thanks!
>>>>> Chris
>>>>>
>>>>> PS
>>>>> I have spent a fair amount of time looking for these, I'm not just
>>>>> delegating my tedious work to the internets.
>>>>>
>>>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>>>> christoph.paeper at crissov.de> wrote:
>>>>>
>>>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>> > I would like to download emoji from every available vendor into a
>>>>> well organized folder with the code points as file names.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/a52d012d/attachment.html>

From wjgo_10009 at btinternet.com  Thu Dec 22 06:03:51 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 22 Dec 2016 12:03:51 +0000 (GMT)
Subject: a character for an unknown character
In-Reply-To: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
Message-ID: <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>

Martin Mueller wrote:

> Is there a Unicode character that says ?I represent an alphanumerical character, but I don?t know which?.  This is a very common problem in the transcription of historical texts where you have lacunas. 

I have been reading this thread with interest.

I have produced nine designs for glyphs.

If you so choose, you can assign specific meanings to one, some, or all of them. If you need more than nine designs please say.

Please find attached nine .png files, one glyph design in each file.

The size of each of the images and the names of the files follow the following specification.

http://www.unicode.org/emoji/selection.html#images

However the images are not congruently in accordance with those rules as there is a one pixel width transparent surround as the designs were made using filled rectangles upon a theoretical seven row by seven column arrangement of blocks, each block ten pixels by ten pixels. I used the Serif PagePlus X7 desktop publishing program.

The characters are not intended as emoji, I just applied the above specification as it is convenient to make the designs compatible with that specification as far as possible.

I have assigned Private Use Area code points of U+EA60 through to U+EA68 to the glyphs. The specific code point for each glyph is indicated in the file name of the image of that glyph.

I have chosen those code points as the Alt codes for U+EA60 through to U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being that if the designs are implemented in fonts that those easy to remember Alt codes might be helpful to someone using the Microsoft WordPad program.

I checked that those code points are not being used in the Medieval Unicode Font Initiative.

http://skaldic.abdn.ac.uk/db.php?cp=EA&if=mufi&table=mufi_char

Readers who so choose are welcome to implement these glyphs in fonts.

The http://www.unicode.org/emoji/selection.html#images specification mentions licensing. For the avoidance of doubt these designs are free to share and use.

A Private Use Area solution is not ideal, yet may be helpful in getting things started and could be helpful in establishing usage, which could help in getting the characters implemented into regular Unicode.

I am attaching the images to this email. The nature of the email system is that the order of the images might not be in the order of the code points, yet each image has an indication of the code point within its name so that information should help to resolve any such problem in the transmission of the email attachments.

William Overington

Thursday 22 December 2016

-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea60.png
Type: image/png
Size: 3078 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea61.png
Type: image/png
Size: 3081 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea62.png
Type: image/png
Size: 3099 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea63.png
Type: image/png
Size: 3053 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea64.png
Type: image/png
Size: 3088 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea65.png
Type: image/png
Size: 3094 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea66.png
Type: image/png
Size: 3089 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea67.png
Type: image/png
Size: 3091 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transcribe_ea68.png
Type: image/png
Size: 3058 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/28f2ed90/attachment-0008.png>

From manish at mozilla.com  Thu Dec 22 12:35:55 2016
From: manish at mozilla.com (Manish Goregaokar)
Date: Thu, 22 Dec 2016 10:35:55 -0800
Subject: Another UAX #29 bug: property tables need updating
Message-ID: <CAFOnWkkta+dprjhOtkxA35p=cxOSA2W4U+nO1r77hBWJp_UyxQ@mail.gmail.com>

The spec lists GraphemeBreakProperty.txt[1] and
WordBreakProperty.txt[2] as the normative source for grapheme and word
categorization respectively.

However, the spec also gives non-normative definitions of these
properties. In particular, it defines Glue_After_Zwj[3] as

> Emoji characters that do not break from a previous ZWJ in a defined emoji zwj sequence, and are not listed as Emoji_Modifier_Base=Yes in emoji-data.txt. See [UTR51].

Going through emoji-zwj-sequences.txt[4], there are a lot of emoji
characters that satisfy this property. The kiss/heart emojis are like
this, as well as every object emoji in the "Gendered Role, with
object" section. However, we only count the kiss, heart, and speech
bubble emoji as GAZ in the property table.

The property table should include all role and gender modifiers as GAZ.

Could this be updated?

 [1]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
 [2]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt
 [3]:http://www.unicode.org/reports/tr29/proposed.html#Glue_After_Zwj
 [4]: http://unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt

Thanks,
-Manish

From cjm2265 at gmail.com  Thu Dec 22 12:39:31 2016
From: cjm2265 at gmail.com (Chris Monteleone)
Date: Thu, 22 Dec 2016 13:39:31 -0500
Subject: Emoji Packs
In-Reply-To: <CANDtJjhwiRv4N7VC9ZBSJZ42mn=FUMcGjCz+nr66F-4UNHN+7w@mail.gmail.com>
References: <CAPnwY4Q7LgeJf9MMQcqjWz8GuL=+a0wPQS0N4HnbetEFCw99dQ@mail.gmail.com>
 <D2147B9C-6275-4A56-BCED-6CB9B0C3D6F4@crissov.de>
 <CAPnwY4S9O1yCR-WTBZBZpceqxfb07WywX8ZPectfZtZXUZes6g@mail.gmail.com>
 <CANDtJjhf6ZKXZD=fZCKy_zFD+JjKaiuYKv+rG0nQXcDw_OTH5Q@mail.gmail.com>
 <CAPnwY4Qh+-MvesJRXS31iPvgS6i5ub5eNiL39mQffzuBuOUenw@mail.gmail.com>
 <CANDtJjhhZ78syTyo-oedqBKvNZR9qqSWx3C1DnsysE=iYZW_Hw@mail.gmail.com>
 <CAPnwY4QB77dKO7WUidoWKTA5mejebwaRSB8yVbCF6o65GbtXTQ@mail.gmail.com>
 <CANDtJjhwiRv4N7VC9ZBSJZ42mn=FUMcGjCz+nr66F-4UNHN+7w@mail.gmail.com>
Message-ID: <CAPnwY4R8xLJhaSy8O-tSE+TTXN2rFuX=9-nxMn-C2hNZ0OmpZg@mail.gmail.com>

I got it working, thank you so much! Happy holidays!

On Thu, Dec 22, 2016 at 2:43 AM, Rebecca T <637275 at gmail.com> wrote:

> Yes, by running process.py in a directory containing http://unicode.org/
> emoji/charts/full-emoji-list.html (I re-commited and renamed data.html to
> full-emoji-list.html for clarity), the script will generate images in an
> /img/ sub-directory. (Be careful that /img/ already exists! Strange things
> might go wrong if it doesn?t.)
>
> Also note that I?m running Python 3.5 on Windows ? I?m fairly confident
> the differences between platforms is fairly minor, but a certain degree of
> zaniness may
>
> So yes: the directory should look like (at a minimum):
>
> .../repository-directory     (DIRECTORY)
>     ??? full-emoji-list.html (FILE)
>     ??? process.py           (FILE)
>     ??? /img                 (DIRECTORY)
>          ?
>          ??? ...             (OUTPUT FILES)
>
> I hope that?s clear enough! Tell me if any of that doesn?t make sense.
>
>
>
> On Wed, Dec 21, 2016 at 10:20 PM, Chris Monteleone <cjm2265 at gmail.com>
> wrote:
>
>> Thank you so much Rebecca, this is really above and beyond.
>>
>> I'm messing around with the script a bit to control how it's all
>> organized/named. So here's the dumb question: How do I run the script to
>> get it to pull the images from the website?
>>
>> First of all, when I downloaded everything from github I didn't get the
>> data.html, I got '.gitignore'. Is the index page you mentioned found at
>> http://unicode.org/emoji/charts/index.html? or is it one of those pages
>> that lists all of the objects on a website?
>>
>> Once I have that, do I just run process.py?
>>
>> I'm so sorry for being dumb, but thanks again!
>>
>> Chris
>>
>> On Wed, Dec 21, 2016 at 12:07 PM, Rebecca T <637275 at gmail.com> wrote:
>>
>>> Okay, I threw something together.
>>>
>>> github.com/9999years/emoji has all 18,615 images from the charts, and
>>> the generating script is there as process.py
>>> <https://github.com/9999years/emoji/blob/master/process.py> as well!
>>>
>>> All the images are thrown together in one directory, but if there?s a
>>> better way to organize them, please let me know!
>>>
>>> On Wed, Dec 21, 2016 at 10:28 AM, Chris Monteleone <cjm2265 at gmail.com>
>>> wrote:
>>>
>>>> "Unfamiliar" would be an understatement. If you feel like putting that
>>>> together it would be appreciated, but no pressure.
>>>>
>>>> Thank you!
>>>>
>>>> On Tue, Dec 20, 2016 at 11:01 PM, Rebecca T <637275 at gmail.com> wrote:
>>>>
>>>>> The charts include the images as inline base64, right?Parsing them out
>>>>> with Python might not be a bad idea, depending on how well-organized the
>>>>> HTML is. If you?re unfamiliar with it, I might be able to throw together a
>>>>> quick script later.
>>>>>
>>>>> On Tue, Dec 20, 2016 at 10:59 PM Chris Monteleone <cjm2265 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sir, you are a scholar and a gentleman. Your swift actions of charity
>>>>>> are much appreciated.
>>>>>>
>>>>>> Now if you happen to know where I can find the same thing for
>>>>>> Samsung, Facebook, and Windows that would be everything I need.
>>>>>>
>>>>>> Thanks!
>>>>>> Chris
>>>>>>
>>>>>> PS
>>>>>> I have spent a fair amount of time looking for these, I'm not just
>>>>>> delegating my tedious work to the internets.
>>>>>>
>>>>>> On Tue, Dec 20, 2016 at 8:09 PM, Christoph P?per <
>>>>>> christoph.paeper at crissov.de> wrote:
>>>>>>
>>>>>> Chris Monteleone <cjm2265 at gmail.com>:
>>>>>>
>>>>>>
>>>>>> >
>>>>>>
>>>>>>
>>>>>> > I would like to download emoji from every available vendor into a
>>>>>> well organized folder with the code points as file names.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I assume you?re looking for <https://github.com/iamcal/emoji-data/>.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/2e7addd6/attachment.html>

From kenwhistler at att.net  Thu Dec 22 13:16:03 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 22 Dec 2016 11:16:03 -0800
Subject: Another UAX #29 bug: property tables need updating
In-Reply-To: <CAFOnWkkta+dprjhOtkxA35p=cxOSA2W4U+nO1r77hBWJp_UyxQ@mail.gmail.com>
References: <CAFOnWkkta+dprjhOtkxA35p=cxOSA2W4U+nO1r77hBWJp_UyxQ@mail.gmail.com>
Message-ID: <88485348-f439-aacb-9f30-f4cf9bdf171c@att.net>

Manish,


On 12/22/2016 10:35 AM, Manish Goregaokar wrote:
> The property table should include all role and gender modifiers as GAZ.
>
> Could this be updated?
>

Property values cannot be updated for *published* versions of the 
standard. What you should do is submit your feedback as part of the 
public review for UAX #29 for version 10.0 of the standard. See:

http://www.unicode.org/review/pri341/

If you submit your feedback to UAX #29 (and its associated data files) 
according to the directions there, that will ensure that it gets 
properly considered during the review of UAX #29 at the next UTC meeting 
scheduled at the end of January.

--Ken

P.S. In general, any feedback on property values in the UCD need to be 
handled that way, to make sure they get appropriate consideration by the 
UTC.


From richard.wordingham at ntlworld.com  Thu Dec 22 15:08:22 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 22 Dec 2016 21:08:22 +0000
Subject: UAX #29: Ambiguities in WB4, and contributing back testcases
In-Reply-To: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>
References: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>
Message-ID: <20161222210822.6b07ff41@JRWUBU2>

On Wed, 21 Dec 2016 15:24:21 -0800
Manish Goregaokar <manish at mozilla.com> wrote:


> Aside from that, WB4's[6] greediness is underspecified. In previous
> versions, the rule was
<snip>

> However, now the rule is
> 
> > X (Extend | Format | ZWJ)* ? X  
> 
> The problem here is that ZWJ appears in the previous rule as well,
> WB3c[7]:
> 
> > ZWJ ? (Glue_After_Zwj | EBG)  
> 
> which says that we should not break between a ZWJ and a GAZ ("Glue
> After ZWJ") character.
> 
> WB3c has precedence over WB4, which means that a sequence like
> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ?EBG` *first*, before the
> ZWJ is collapsed into the Emoji_Base. This is fine.
> 
> However, more complicated sequences depend on the greediness of the
> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
> since we have a Extend/ZWJ sequence.
> 
> WB4 can apply in multiple ways. If it is applied greedily, we get
> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
> characters). This does break since you don't break between Emoji_Base
> and EBG.
> 
> However, we can apply it conservatively instead. We can get
> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
> collapse.

>From your terminology, I think you have an error in your transformation
to a 'regular' expression.  Why don't you have the same problem when
you determine word breaks in

CR Extend LF?

I'm guessing that you have some mechanism that makes WB3 (CR ? LF)
redundant.  Rule WB3c does *not* transform to

ZWJ(...) ? (Glue_After_Zwj | EBG) 


Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.

Richard.


From manish at mozilla.com  Thu Dec 22 16:05:18 2016
From: manish at mozilla.com (Manish Goregaokar)
Date: Thu, 22 Dec 2016 14:05:18 -0800
Subject: UAX #29: Ambiguities in WB4, and contributing back testcases
In-Reply-To: <20161222210822.6b07ff41@JRWUBU2>
References: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>
 <20161222210822.6b07ff41@JRWUBU2>
Message-ID: <CAFOnWk=fhjW+yASJoLtOxRHUxj8wgnyTjx=YH8aZiM_6zVF09g@mail.gmail.com>

> Why don't you have the same problem when you determine word breaks in CR Extend LF?

By rule WB4, we don't break between CR and Extend, and treat the
CRxExtend aggregate as CR, and that in turn doesn't break with LF by
WB3.

The rule states that we "treat whatever is on the left side (X
(Format|Extend|ZWJ)*) as if it were whatever is on the right side
(X)".

I guess the confusion is, with ? rules, do we apply them globally, or
only apply them when considering subsequent rules?

I suspect the answer here is that you only apply them in order. The
list of rules is not a list of precedences, but rather a list with the
order in which the rules are applied. So a ? rule means "Treat the
left side as if it were the right side in the context of all
subsequent rules"

Thanks,
-Manish


On Thu, Dec 22, 2016 at 1:08 PM, Richard Wordingham
<richard.wordingham at ntlworld.com> wrote:
> On Wed, 21 Dec 2016 15:24:21 -0800
> Manish Goregaokar <manish at mozilla.com> wrote:
>
>
>> Aside from that, WB4's[6] greediness is underspecified. In previous
>> versions, the rule was
> <snip>
>
>> However, now the rule is
>>
>> > X (Extend | Format | ZWJ)* ? X
>>
>> The problem here is that ZWJ appears in the previous rule as well,
>> WB3c[7]:
>>
>> > ZWJ ? (Glue_After_Zwj | EBG)
>>
>> which says that we should not break between a ZWJ and a GAZ ("Glue
>> After ZWJ") character.
>>
>> WB3c has precedence over WB4, which means that a sequence like
>> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ?EBG` *first*, before the
>> ZWJ is collapsed into the Emoji_Base. This is fine.
>>
>> However, more complicated sequences depend on the greediness of the
>> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
>> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
>> since we have a Extend/ZWJ sequence.
>>
>> WB4 can apply in multiple ways. If it is applied greedily, we get
>> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
>> characters). This does break since you don't break between Emoji_Base
>> and EBG.
>>
>> However, we can apply it conservatively instead. We can get
>> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
>> collapse.
>
> From your terminology, I think you have an error in your transformation
> to a 'regular' expression.  Why don't you have the same problem when
> you determine word breaks in
>
> CR Extend LF?
>
> I'm guessing that you have some mechanism that makes WB3 (CR ? LF)
> redundant.  Rule WB3c does *not* transform to
>
> ZWJ(...) ? (Glue_After_Zwj | EBG)
>
>
> Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
> ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.
>
> Richard.
>


From manish at mozilla.com  Thu Dec 22 16:20:04 2016
From: manish at mozilla.com (Manish Goregaokar)
Date: Thu, 22 Dec 2016 14:20:04 -0800
Subject: Another UAX #29 bug: property tables need updating
In-Reply-To: <88485348-f439-aacb-9f30-f4cf9bdf171c@att.net>
References: <CAFOnWkkta+dprjhOtkxA35p=cxOSA2W4U+nO1r77hBWJp_UyxQ@mail.gmail.com>
 <88485348-f439-aacb-9f30-f4cf9bdf171c@att.net>
Message-ID: <CAFOnWkmwzRgKmv4Zt5-aPD6hS05YuDqbYywPQfYu_qWF+Ax3_g@mail.gmail.com>

Will do, thanks!
-Manish


On Thu, Dec 22, 2016 at 11:16 AM, Ken Whistler <kenwhistler at att.net> wrote:
> Manish,
>
>
> On 12/22/2016 10:35 AM, Manish Goregaokar wrote:
>>
>> The property table should include all role and gender modifiers as GAZ.
>>
>> Could this be updated?
>>
>
> Property values cannot be updated for *published* versions of the standard.
> What you should do is submit your feedback as part of the public review for
> UAX #29 for version 10.0 of the standard. See:
>
> http://www.unicode.org/review/pri341/
>
> If you submit your feedback to UAX #29 (and its associated data files)
> according to the directions there, that will ensure that it gets properly
> considered during the review of UAX #29 at the next UTC meeting scheduled at
> the end of January.
>
> --Ken
>
> P.S. In general, any feedback on property values in the UCD need to be
> handled that way, to make sure they get appropriate consideration by the
> UTC.
>

From richard.wordingham at ntlworld.com  Thu Dec 22 16:58:10 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 22 Dec 2016 22:58:10 +0000
Subject: UAX #29: Ambiguities in WB4, and contributing back testcases
In-Reply-To: <CAFOnWk=fhjW+yASJoLtOxRHUxj8wgnyTjx=YH8aZiM_6zVF09g@mail.gmail.com>
References: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>
 <20161222210822.6b07ff41@JRWUBU2>
 <CAFOnWk=fhjW+yASJoLtOxRHUxj8wgnyTjx=YH8aZiM_6zVF09g@mail.gmail.com>
Message-ID: <20161222225810.73000b59@JRWUBU2>

On Thu, 22 Dec 2016 14:05:18 -0800
Manish Goregaokar <manish at mozilla.com> wrote:

> I guess the confusion is, with ? rules, do we apply them globally, or
> only apply them when considering subsequent rules?

I would say the latter.  The logic is that you apply the whole set of
rules on either side of each character.

> I suspect the answer here is that you only apply them in order. The
> list of rules is not a list of precedences, but rather a list with the
> order in which the rules are applied. So a ? rule means "Treat the
> left side as if it were the right side in the context of all
> subsequent rules"

I would indeed say that you apply them in order.  The relevant example
in the test suite (file auxiliary/WordBreakTest.txt in the UCD) is:

? 000D ? 0308 ? 000A ?

Now, I am not sure if it is possible to automatically turn the rules
into an automatic break iterator based on regular expressions.  The last
time I looked, ICU was doing this by manual conversion.  I would
therefore deduce that such a conversion is impossible, difficult, or
produces highly inefficient code.  ICU has the added complication that
it also needs to invoke real Southeast Asian break iterators.  When I
looked, their interface was not returning appropriate
word-break properties for the characters, but was itself a break
iterator. 

Richard.


From martinmueller at northwestern.edu  Thu Dec 22 17:35:30 2016
From: martinmueller at northwestern.edu (Martin Mueller)
Date: Thu, 22 Dec 2016 23:35:30 +0000
Subject: a character for an unknown character
In-Reply-To: <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
Message-ID: <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>

These are very handsome and interesting. But for the purposes of my project, which involves folks here, there, and everywhere working on editorial problems relating to digital transcriptions of Early Modern texts, the cardinal requirement is that the character can be found on and deployed from any Windows, Linux, or OS 10 machin. We have used the black dot (\u25cf) as a kludge. Since it does not occur in the source data, there is no ambiguity. It is relatively easy to produce on a keyboard. From a visual perspective it is preferable to the diamond with a question mark?although that is semantically more obvious. But it is visually very disruptive, and it is much harder to find on a standard character map than the black dot, which is predictably located in geometrical shapes. 

It?s a kludge, but it works, and it looks to me superior to any of the alternatives. But I can be persuaded otherwise. 

With thanks for the help of all of you

MM

On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com> wrote:

    Martin Mueller wrote:
    
    > Is there a Unicode character that says ?I represent an alphanumerical character, but I don?t know which?.  This is a very common problem in the transcription of historical texts where you have lacunas. 
    
    I have been reading this thread with interest.
    
    I have produced nine designs for glyphs.
    
    If you so choose, you can assign specific meanings to one, some, or all of them. If you need more than nine designs please say.
    
    Please find attached nine .png files, one glyph design in each file.
    
    The size of each of the images and the names of the files follow the following specification.
    
    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e= 
    
    However the images are not congruently in accordance with those rules as there is a one pixel width transparent surround as the designs were made using filled rectangles upon a theoretical seven row by seven column arrangement of blocks, each block ten pixels by ten pixels. I used the Serif PagePlus X7 desktop publishing program.
    
    The characters are not intended as emoji, I just applied the above specification as it is convenient to make the designs compatible with that specification as far as possible.
    
    I have assigned Private Use Area code points of U+EA60 through to U+EA68 to the glyphs. The specific code point for each glyph is indicated in the file name of the image of that glyph.
    
    I have chosen those code points as the Alt codes for U+EA60 through to U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being that if the designs are implemented in fonts that those easy to remember Alt codes might be helpful to someone using the Microsoft WordPad program.
    
    I checked that those code points are not being used in the Medieval Unicode Font Initiative.
    
    https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e= 
    
    Readers who so choose are welcome to implement these glyphs in fonts.
    
    The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions licensing. For the avoidance of doubt these designs are free to share and use.
    
    A Private Use Area solution is not ideal, yet may be helpful in getting things started and could be helpful in establishing usage, which could help in getting the characters implemented into regular Unicode.
    
    I am attaching the images to this email. The nature of the email system is that the order of the images might not be in the order of the code points, yet each image has an indication of the code point within its name so that information should help to resolve any such problem in the transmission of the email attachments.
    
    William Overington
    
    Thursday 22 December 2016
    
    
From leobro at gmail.com  Thu Dec 22 18:31:47 2016
From: leobro at gmail.com (Leo Broukhis)
Date: Thu, 22 Dec 2016 16:31:47 -0800
Subject: a character for an unknown character
In-Reply-To: <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
Message-ID: <CAFmvRscmX3q-ZEDGii7E+x0MRuLkj=M_WteZcbN-2jQkJ8X-+A@mail.gmail.com>

You may want to consider U+2370 APL FUNCTIONAL SYMBOL QUAD QUESTION.

Leo

On Dec 22, 2016 15:35, "Martin Mueller" <martinmueller at northwestern.edu>
wrote:

These are very handsome and interesting. But for the purposes of my
project, which involves folks here, there, and everywhere working on
editorial problems relating to digital transcriptions of Early Modern
texts, the cardinal requirement is that the character can be found on and
deployed from any Windows, Linux, or OS 10 machin. We have used the black
dot (\u25cf) as a kludge. Since it does not occur in the source data, there
is no ambiguity. It is relatively easy to produce on a keyboard. From a
visual perspective it is preferable to the diamond with a question
mark?although that is semantically more obvious. But it is visually very
disruptive, and it is much harder to find on a standard character map than
the black dot, which is predictably located in geometrical shapes.

It?s a kludge, but it works, and it looks to me superior to any of the
alternatives. But I can be persuaded otherwise.

With thanks for the help of all of you

MM

On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com>
wrote:

    Martin Mueller wrote:

    > Is there a Unicode character that says ?I represent an alphanumerical
character, but I don?t know which?.  This is a very common problem in the
transcription of historical texts where you have lacunas.

    I have been reading this thread with interest.

    I have produced nine designs for glyphs.

    If you so choose, you can assign specific meanings to one, some, or all
of them. If you need more than nine designs please say.

    Please find attached nine .png files, one glyph design in each file.

    The size of each of the images and the names of the files follow the
following specification.

    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyV
XydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=

    However the images are not congruently in accordance with those rules
as there is a one pixel width transparent surround as the designs were made
using filled rectangles upon a theoretical seven row by seven column
arrangement of blocks, each block ten pixels by ten pixels. I used the
Serif PagePlus X7 desktop publishing program.

    The characters are not intended as emoji, I just applied the above
specification as it is convenient to make the designs compatible with that
specification as far as possible.

    I have assigned Private Use Area code points of U+EA60 through to
U+EA68 to the glyphs. The specific code point for each glyph is indicated
in the file name of the image of that glyph.

    I have chosen those code points as the Alt codes for U+EA60 through to
U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being
that if the designs are implemented in fonts that those easy to remember
Alt codes might be helpful to someone using the Microsoft WordPad program.

    I checked that those code points are not being used in the Medieval
Unicode Font Initiative.

    https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.
abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=
yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyV
XydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-
Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=

    Readers who so choose are welcome to implement these glyphs in fonts.

    The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyV
XydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions
licensing. For the avoidance of doubt these designs are free to share and
use.

    A Private Use Area solution is not ideal, yet may be helpful in getting
things started and could be helpful in establishing usage, which could help
in getting the characters implemented into regular Unicode.

    I am attaching the images to this email. The nature of the email system
is that the order of the images might not be in the order of the code
points, yet each image has an indication of the code point within its name
so that information should help to resolve any such problem in the
transmission of the email attachments.

    William Overington

    Thursday 22 December 2016
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161222/15f9cf58/attachment.html>

From martinmueller at northwestern.edu  Thu Dec 22 23:40:42 2016
From: martinmueller at northwestern.edu (Martin Mueller)
Date: Fri, 23 Dec 2016 05:40:42 +0000
Subject: a character for an unknown character
In-Reply-To: <CAFmvRscmX3q-ZEDGii7E+x0MRuLkj=M_WteZcbN-2jQkJ8X-+A@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAFmvRscmX3q-ZEDGii7E+x0MRuLkj=M_WteZcbN-2jQkJ8X-+A@mail.gmail.com>
Message-ID: <ED477576-7F4E-4A33-8279-9BEB5716147F@northwestern.edu>

Many thanks for this suggestion, which may do the trick. I assume that APL refers to the APL language and that this symbol comes from the math world. But to judge from the terse description, it would not be an abuse of this character to use it when you want to say ?I am quite sure that there is just one character missing here, and it is almost certainly from the same character set as the character surrounding it, but I have no idea which of them it represents.?  It is visually clear and unobtrusive, and you can cut and paste it from a standard keyboard map (I?ve tested it on a Mac.


From: Leo Broukhis <leobro at gmail.com>
Reply-To: "leob at mailcom.com" <leob at mailcom.com>
Date: Thursday, December 22, 2016 at 6:31 PM
To: Martin Mueller <martinmueller at northwestern.edu>
Cc: unicode Unicode Discussion <unicode at unicode.org>
Subject: Re: a character for an unknown character

You may want to consider U+2370 APL FUNCTIONAL SYMBOL QUAD QUESTION.

Leo

On Dec 22, 2016 15:35, "Martin Mueller" <martinmueller at northwestern.edu<mailto:martinmueller at northwestern.edu>> wrote:
These are very handsome and interesting. But for the purposes of my project, which involves folks here, there, and everywhere working on editorial problems relating to digital transcriptions of Early Modern texts, the cardinal requirement is that the character can be found on and deployed from any Windows, Linux, or OS 10 machin. We have used the black dot (\u25cf) as a kludge. Since it does not occur in the source data, there is no ambiguity. It is relatively easy to produce on a keyboard. From a visual perspective it is preferable to the diamond with a question mark?although that is semantically more obvious. But it is visually very disruptive, and it is much harder to find on a standard character map than the black dot, which is predictably located in geometrical shapes.

It?s a kludge, but it works, and it looks to me superior to any of the alternatives. But I can be persuaded otherwise.

With thanks for the help of all of you

MM

On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:

    Martin Mueller wrote:

    > Is there a Unicode character that says ?I represent an alphanumerical character, but I don?t know which?.  This is a very common problem in the transcription of historical texts where you have lacunas.

    I have been reading this thread with interest.

    I have produced nine designs for glyphs.

    If you so choose, you can assign specific meanings to one, some, or all of them. If you need more than nine designs please say.

    Please find attached nine .png files, one glyph design in each file.

    The size of each of the images and the names of the files follow the following specification.
    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=

    However the images are not congruently in accordance with those rules as there is a one pixel width transparent surround as the designs were made using filled rectangles upon a theoretical seven row by seven column arrangement of blocks, each block ten pixels by ten pixels. I used the Serif PagePlus X7 desktop publishing program.

    The characters are not intended as emoji, I just applied the above specification as it is convenient to make the designs compatible with that specification as far as possible.

    I have assigned Private Use Area code points of U+EA60 through to U+EA68 to the glyphs. The specific code point for each glyph is indicated in the file name of the image of that glyph.

    I have chosen those code points as the Alt codes for U+EA60 through to U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being that if the designs are implemented in fonts that those easy to remember Alt codes might be helpful to someone using the Microsoft WordPad program.

    I checked that those code points are not being used in the Medieval Unicode Font Initiative.
    https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=

    Readers who so choose are welcome to implement these glyphs in fonts.
    The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions licensing. For the avoidance of doubt these designs are free to share and use.

    A Private Use Area solution is not ideal, yet may be helpful in getting things started and could be helpful in establishing usage, which could help in getting the characters implemented into regular Unicode.

    I am attaching the images to this email. The nature of the email system is that the order of the images might not be in the order of the code points, yet each image has an indication of the code point within its name so that information should help to resolve any such problem in the transmission of the email attachments.

    William Overington

    Thursday 22 December 2016


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161223/97353135/attachment.html>

From daniel.buenzli at erratique.ch  Fri Dec 23 01:40:58 2016
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Fri, 23 Dec 2016 08:40:58 +0100
Subject: UAX #29: Ambiguities in WB4, and contributing back
 testcases
In-Reply-To: <CAFOnWk=fhjW+yASJoLtOxRHUxj8wgnyTjx=YH8aZiM_6zVF09g@mail.gmail.com>
References: <CAFOnWk=he2eNjuq+ugx9BZ_xtoPdnqFbi6kZXRwn=JbVdQDVEA@mail.gmail.com>
 <20161222210822.6b07ff41@JRWUBU2>
 <CAFOnWk=fhjW+yASJoLtOxRHUxj8wgnyTjx=YH8aZiM_6zVF09g@mail.gmail.com>
Message-ID: <8248670B665748B9A51CA0586F3FBB34@erratique.ch>

On Thursday 22 December 2016 at 23:05, Manish Goregaokar wrote:
> I guess the confusion is, with ? rules, do we apply them globally, or
> only apply them when considering subsequent rules?

This was discussed recently. See [1].

Best,  

Daniel

[1] http://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html


From mark at macchiato.com  Fri Dec 23 03:15:45 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 23 Dec 2016 10:15:45 +0100
Subject: Another UAX #29 bug: property tables need updating
In-Reply-To: <CAFOnWkmwzRgKmv4Zt5-aPD6hS05YuDqbYywPQfYu_qWF+Ax3_g@mail.gmail.com>
References: <CAFOnWkkta+dprjhOtkxA35p=cxOSA2W4U+nO1r77hBWJp_UyxQ@mail.gmail.com>
 <88485348-f439-aacb-9f30-f4cf9bdf171c@att.net>
 <CAFOnWkmwzRgKmv4Zt5-aPD6hS05YuDqbYywPQfYu_qWF+Ax3_g@mail.gmail.com>
Message-ID: <CAJ2xs_FFA1fEHYeO_MerC1Mp=QyX1QsiKcpzuu4m37qtTYvj-Q@mail.gmail.com>

Also, under http://unicode.org/reports/tr29/#Conformance see the following.
The wording could be stronger: the CLDR customizations are strongly
recommended.

   - Some changes to rules and data are needed for best segmentation
   behavior of additional emoji zwj sequences [UTR51
   <http://unicode.org/reports/tr41/tr41-19.html#UTR51>], prior to the
   eventual publication of Unicode 10.0. Such changes are planned for
   inclusion in CLDR Version 30 [CLDR
   <http://unicode.org/reports/tr41/tr41-19.html#CLDR>].


Mark

On Thu, Dec 22, 2016 at 11:20 PM, Manish Goregaokar <manish at mozilla.com>
wrote:

> Will do, thanks!
> -Manish
>
>
> On Thu, Dec 22, 2016 at 11:16 AM, Ken Whistler <kenwhistler at att.net>
> wrote:
> > Manish,
> >
> >
> > On 12/22/2016 10:35 AM, Manish Goregaokar wrote:
> >>
> >> The property table should include all role and gender modifiers as GAZ.
> >>
> >> Could this be updated?
> >>
> >
> > Property values cannot be updated for *published* versions of the
> standard.
> > What you should do is submit your feedback as part of the public review
> for
> > UAX #29 for version 10.0 of the standard. See:
> >
> > http://www.unicode.org/review/pri341/
> >
> > If you submit your feedback to UAX #29 (and its associated data files)
> > according to the directions there, that will ensure that it gets properly
> > considered during the review of UAX #29 at the next UTC meeting
> scheduled at
> > the end of January.
> >
> > --Ken
> >
> > P.S. In general, any feedback on property values in the UCD need to be
> > handled that way, to make sure they get appropriate consideration by the
> > UTC.
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161223/37483c74/attachment.html>

From verdy_p at wanadoo.fr  Fri Dec 23 13:35:07 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 23 Dec 2016 20:35:07 +0100
Subject: a character for an unknown character
In-Reply-To: <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
Message-ID: <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>

if you want something that is very unlikely to be present in original
texts, it would be preferable to avoid the black dot or any other bullets
which may be used as punctuation marks.

Consider using some geometric shape, notably those inherited from DOS code
pages, such as the filled square U+2588 (?). It is mapped in many common
fonts, only because it is part of legacy code page 437 (at position
0xDB=219 decimal) and most other codepages for MSDOS. It may be used in
legacy encoded texts for MSDOS but only for presentation purpose (using
monospaced fonts for text-only terminals) where it should not match any use
for missing/damaged parts of an original document printed/handwritten
document on paper (those DOS texts should have no original version on
paper, they are originately only in encoded files on computers).

It is easily entered on keyboards using Alt+219 (**not** Alt+0219) on
Windows (it works using the current OEM 8-bit codepage, which may be CP437,
CP850 or similar).

There's also the half-filled square U+2584 (?), at position 0xDC=218
decimal in CP437/CP850 (i.e. Alt+218 on Windows keyboards)  if you want to
avoid filling the full lineheight and being able to discriminate multiple
rows of text.

Or the filled squared with dark grey pattern U+2593 (?), at position
0xB2=178 (i.e. Alt+178 on Windows keyboards) if you want to still see it
with text selection. Its gray pattern is also intuitively meaning "missing
part".

All these geometric shapes are symbols, not punctuations, and very unlikely
to be used as bullet punctuations in documents and not confusable with any
other characters for actual text. They are also ignored in plain text
searches, i.e. not considered as variants of a significant dot, and there's
also a word break before and after them (so they won't collapse into
surrounding words written before or after them). They are also typically
used to replace words that have been voluntarily deleted/hidden from an
original document (becaue there's a need for keeping this info private).

But note that input fields for entering password or secret codes in
application forms/dialogs are typically using black bullets U+2022 (?) or
simply ASCII asterisks U+002A (*) to replace the entered characters: they
cannot be read, but the user knows what he is entering on his keyboard.


2016-12-23 0:35 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu>:

> These are very handsome and interesting. But for the purposes of my
> project, which involves folks here, there, and everywhere working on
> editorial problems relating to digital transcriptions of Early Modern
> texts, the cardinal requirement is that the character can be found on and
> deployed from any Windows, Linux, or OS 10 machin. We have used the black
> dot (\u25cf) as a kludge. Since it does not occur in the source data, there
> is no ambiguity. It is relatively easy to produce on a keyboard. From a
> visual perspective it is preferable to the diamond with a question
> mark?although that is semantically more obvious. But it is visually very
> disruptive, and it is much harder to find on a standard character map than
> the black dot, which is predictably located in geometrical shapes.
>
> It?s a kludge, but it works, and it looks to me superior to any of the
> alternatives. But I can be persuaded otherwise.
>
> With thanks for the help of all of you
>
> MM
>
> On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com>
> wrote:
>
>     Martin Mueller wrote:
>
>     > Is there a Unicode character that says ?I represent an
> alphanumerical character, but I don?t know which?.  This is a very common
> problem in the transcription of historical texts where you have lacunas.
>
>     I have been reading this thread with interest.
>
>     I have produced nine designs for glyphs.
>
>     If you so choose, you can assign specific meanings to one, some, or
> all of them. If you need more than nine designs please say.
>
>     Please find attached nine .png files, one glyph design in each file.
>
>     The size of each of the images and the names of the files follow the
> following specification.
>
>     https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=
>
>     However the images are not congruently in accordance with those rules
> as there is a one pixel width transparent surround as the designs were made
> using filled rectangles upon a theoretical seven row by seven column
> arrangement of blocks, each block ten pixels by ten pixels. I used the
> Serif PagePlus X7 desktop publishing program.
>
>     The characters are not intended as emoji, I just applied the above
> specification as it is convenient to make the designs compatible with that
> specification as far as possible.
>
>     I have assigned Private Use Area code points of U+EA60 through to
> U+EA68 to the glyphs. The specific code point for each glyph is indicated
> in the file name of the image of that glyph.
>
>     I have chosen those code points as the Alt codes for U+EA60 through to
> U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being
> that if the designs are implemented in fonts that those easy to remember
> Alt codes might be helpful to someone using the Microsoft WordPad program.
>
>     I checked that those code points are not being used in the Medieval
> Unicode Font Initiative.
>
>     https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.
> abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-
> Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=
>
>     Readers who so choose are welcome to implement these glyphs in fonts.
>
>     The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions
> licensing. For the avoidance of doubt these designs are free to share and
> use.
>
>     A Private Use Area solution is not ideal, yet may be helpful in
> getting things started and could be helpful in establishing usage, which
> could help in getting the characters implemented into regular Unicode.
>
>     I am attaching the images to this email. The nature of the email
> system is that the order of the images might not be in the order of the
> code points, yet each image has an indication of the code point within its
> name so that information should help to resolve any such problem in the
> transmission of the email attachments.
>
>     William Overington
>
>     Thursday 22 December 2016
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161223/3e213a93/attachment.html>

From martinmueller at northwestern.edu  Fri Dec 23 15:19:55 2016
From: martinmueller at northwestern.edu (Martin Mueller)
Date: Fri, 23 Dec 2016 21:19:55 +0000
Subject: a character for an unknown character
In-Reply-To: <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
Message-ID: <7A1BC089-81C9-44AB-9A4B-3B18EC93EF39@northwestern.edu>

That?s excellent advice. In our project somebody confused the bullet with the black circle. It didn?t matter, because 17th century texts don?t have bullet symbols?at least not the ones we?re dealing with. But following your advice would significantly reduce ambiguity

From: <verdyp at gmail.com> on behalf of Philippe Verdy <verdy_p at wanadoo.fr>
Reply-To: Philippe Verdy <verdy_p at wanadoo.fr>
Date: Friday, December 23, 2016 at 1:35 PM
To: Martin Mueller <martinmueller at northwestern.edu>
Cc: William_J_G Overington <wjgo_10009 at btinternet.com>, "unicode at unicode.org" <unicode at unicode.org>
Subject: Re: a character for an unknown character

if you want something that is very unlikely to be present in original texts, it would be preferable to avoid the black dot or any other bullets which may be used as punctuation marks.

Consider using some geometric shape, notably those inherited from DOS code pages, such as the filled square U+2588 (?). It is mapped in many common fonts, only because it is part of legacy code page 437 (at position 0xDB=219 decimal) and most other codepages for MSDOS. It may be used in legacy encoded texts for MSDOS but only for presentation purpose (using monospaced fonts for text-only terminals) where it should not match any use for missing/damaged parts of an original document printed/handwritten document on paper (those DOS texts should have no original version on paper, they are originately only in encoded files on computers).

It is easily entered on keyboards using Alt+219 (**not** Alt+0219) on Windows (it works using the current OEM 8-bit codepage, which may be CP437, CP850 or similar).

There's also the half-filled square U+2584 (?), at position 0xDC=218 decimal in CP437/CP850 (i.e. Alt+218 on Windows keyboards)  if you want to avoid filling the full lineheight and being able to discriminate multiple rows of text.

Or the filled squared with dark grey pattern U+2593 (?), at position 0xB2=178 (i.e. Alt+178 on Windows keyboards) if you want to still see it with text selection. Its gray pattern is also intuitively meaning "missing part".

All these geometric shapes are symbols, not punctuations, and very unlikely to be used as bullet punctuations in documents and not confusable with any other characters for actual text. They are also ignored in plain text searches, i.e. not considered as variants of a significant dot, and there's also a word break before and after them (so they won't collapse into surrounding words written before or after them). They are also typically used to replace words that have been voluntarily deleted/hidden from an original document (becaue there's a need for keeping this info private).

But note that input fields for entering password or secret codes in application forms/dialogs are typically using black bullets U+2022 (?) or simply ASCII asterisks U+002A (*) to replace the entered characters: they cannot be read, but the user knows what he is entering on his keyboard.


2016-12-23 0:35 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu<mailto:martinmueller at northwestern.edu>>:
These are very handsome and interesting. But for the purposes of my project, which involves folks here, there, and everywhere working on editorial problems relating to digital transcriptions of Early Modern texts, the cardinal requirement is that the character can be found on and deployed from any Windows, Linux, or OS 10 machin. We have used the black dot (\u25cf) as a kludge. Since it does not occur in the source data, there is no ambiguity. It is relatively easy to produce on a keyboard. From a visual perspective it is preferable to the diamond with a question mark?although that is semantically more obvious. But it is visually very disruptive, and it is much harder to find on a standard character map than the black dot, which is predictably located in geometrical shapes.

It?s a kludge, but it works, and it looks to me superior to any of the alternatives. But I can be persuaded otherwise.

With thanks for the help of all of you

MM

On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:

    Martin Mueller wrote:

    > Is there a Unicode character that says ?I represent an alphanumerical character, but I don?t know which?.  This is a very common problem in the transcription of historical texts where you have lacunas.

    I have been reading this thread with interest.

    I have produced nine designs for glyphs.

    If you so choose, you can assign specific meanings to one, some, or all of them. If you need more than nine designs please say.

    Please find attached nine .png files, one glyph design in each file.

    The size of each of the images and the names of the files follow the following specification.

    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=

    However the images are not congruently in accordance with those rules as there is a one pixel width transparent surround as the designs were made using filled rectangles upon a theoretical seven row by seven column arrangement of blocks, each block ten pixels by ten pixels. I used the Serif PagePlus X7 desktop publishing program.

    The characters are not intended as emoji, I just applied the above specification as it is convenient to make the designs compatible with that specification as far as possible.

    I have assigned Private Use Area code points of U+EA60 through to U+EA68 to the glyphs. The specific code point for each glyph is indicated in the file name of the image of that glyph.

    I have chosen those code points as the Alt codes for U+EA60 through to U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being that if the designs are implemented in fonts that those easy to remember Alt codes might be helpful to someone using the Microsoft WordPad program.

    I checked that those code points are not being used in the Medieval Unicode Font Initiative.

    https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=

    Readers who so choose are welcome to implement these glyphs in fonts.

    The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions licensing. For the avoidance of doubt these designs are free to share and use.

    A Private Use Area solution is not ideal, yet may be helpful in getting things started and could be helpful in establishing usage, which could help in getting the characters implemented into regular Unicode.

    I am attaching the images to this email. The nature of the email system is that the order of the images might not be in the order of the code points, yet each image has an indication of the code point within its name so that information should help to resolve any such problem in the transmission of the email attachments.

    William Overington

    Thursday 22 December 2016


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161223/0aa82562/attachment.html>

From richard.wordingham at ntlworld.com  Fri Dec 23 15:34:46 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 23 Dec 2016 21:34:46 +0000
Subject: a character for an unknown character
In-Reply-To: <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
Message-ID: <20161223213446.478feda6@JRWUBU2>

On Fri, 23 Dec 2016 20:35:07 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:


> But note that input fields for entering password or secret codes in
> application forms/dialogs are typically using black bullets U+2022
> (?) or simply ASCII asterisks U+002A (*) to replace the entered
> characters: they cannot be read, but the user knows what he is
> entering on his keyboard.

How?  All I know is which keys I've pressed!  This is a real problem
when unlocking the screen with the current erratic behaviour of IBus on
Ubuntu 16.04 (Xenial).

Richard.


From verdy_p at wanadoo.fr  Fri Dec 23 17:44:00 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 24 Dec 2016 00:44:00 +0100
Subject: a character for an unknown character
In-Reply-To: <20161223213446.478feda6@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
 <20161223213446.478feda6@JRWUBU2>
Message-ID: <CAGa7JC2-EazchSpoNvMTa8D6kFd0bvU+t6nHPPtkBS=qnYebxg@mail.gmail.com>

This is stil lthe standard and expected behavior for password input field
in HTML and many devices, to not display the actual characters but to
**display** them as bullets or asterisks.

If IBus cannot hide the input (i.e. generate the entered characters to the
application without displaying it, and let the output hints
(bullets/asterisks) be generated only by the application, then iBus has a
problem and may not be usable for compliant password inputs.


2016-12-23 22:34 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Fri, 23 Dec 2016 20:35:07 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
>
> > But note that input fields for entering password or secret codes in
> > application forms/dialogs are typically using black bullets U+2022
> > (?) or simply ASCII asterisks U+002A (*) to replace the entered
> > characters: they cannot be read, but the user knows what he is
> > entering on his keyboard.
>
> How?  All I know is which keys I've pressed!  This is a real problem
> when unlocking the screen with the current erratic behaviour of IBus on
> Ubuntu 16.04 (Xenial).
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161224/ad3691cd/attachment.html>

From verdy_p at wanadoo.fr  Fri Dec 23 17:53:35 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 24 Dec 2016 00:53:35 +0100
Subject: a character for an unknown character
In-Reply-To: <7A1BC089-81C9-44AB-9A4B-3B18EC93EF39@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
 <7A1BC089-81C9-44AB-9A4B-3B18EC93EF39@northwestern.edu>
Message-ID: <CAGa7JC14S6xsTdgdkuif8iu5eTJ0rnEvnfwFxw==LUL7b8zD9Q@mail.gmail.com>

I would not bet at all that 17th century texts do not have any bullets,
even if they were still not encoded in Unicode. In fact there are MANY
bullet like symbols in old manuscripts, and OCR could also "find" many of
them, by not being able to distinguish various dots. In fact I've seen
bullet like symbols used for meaning zero, or as placeholders meaning "N/A".

17th century manuscripts and books are full of various decorative glyphs.
And the bullets are so easy to confuse... Take an old Occitan manuscripts
you'll find them in various places. Consider Asian scripts, you could
confuse them with full stops, or other diacritics. Consider Arabic or
Hebrew and you'll probably confuse them with vowel points.

So if you really want to include a plaeholder for damaged/missing parts in
old documents, the large geometric shapes are probably best to use and will
be obviously unconfusable with other dots and readers will obviously know
that this is a placeholder for some missing/destroyed content.

An alternative commonly used is also to use "[...]", a convention often
used in citations when some parts of the sentence are voluntarily omitted
by the redactor.

2016-12-23 22:19 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu>:

> That?s excellent advice. In our project somebody confused the bullet with
> the black circle. It didn?t matter, because 17th century texts don?t have
> bullet symbols?at least not the ones we?re dealing with. But following your
> advice would significantly reduce ambiguity
>
>
>
> *From: *<verdyp at gmail.com> on behalf of Philippe Verdy <verdy_p at wanadoo.fr
> >
> *Reply-To: *Philippe Verdy <verdy_p at wanadoo.fr>
> *Date: *Friday, December 23, 2016 at 1:35 PM
> *To: *Martin Mueller <martinmueller at northwestern.edu>
> *Cc: *William_J_G Overington <wjgo_10009 at btinternet.com>, "
> unicode at unicode.org" <unicode at unicode.org>
> *Subject: *Re: a character for an unknown character
>
>
>
> if you want something that is very unlikely to be present in original
> texts, it would be preferable to avoid the black dot or any other bullets
> which may be used as punctuation marks.
>
>
>
> Consider using some geometric shape, notably those inherited from DOS code
> pages, such as the filled square U+2588 (?). It is mapped in many common
> fonts, only because it is part of legacy code page 437 (at position
> 0xDB=219 decimal) and most other codepages for MSDOS. It may be used in
> legacy encoded texts for MSDOS but only for presentation purpose (using
> monospaced fonts for text-only terminals) where it should not match any use
> for missing/damaged parts of an original document printed/handwritten
> document on paper (those DOS texts should have no original version on
> paper, they are originately only in encoded files on computers).
>
>
>
> It is easily entered on keyboards using Alt+219 (**not** Alt+0219) on
> Windows (it works using the current OEM 8-bit codepage, which may be CP437,
> CP850 or similar).
>
>
>
> There's also the half-filled square U+2584 (?), at position 0xDC=218
> decimal in CP437/CP850 (i.e. Alt+218 on Windows keyboards)  if you want to
> avoid filling the full lineheight and being able to discriminate multiple
> rows of text.
>
>
>
> Or the filled squared with dark grey pattern U+2593 (?), at position
> 0xB2=178 (i.e. Alt+178 on Windows keyboards) if you want to still see it
> with text selection. Its gray pattern is also intuitively meaning "missing
> part".
>
>
>
> All these geometric shapes are symbols, not punctuations, and very
> unlikely to be used as bullet punctuations in documents and not confusable
> with any other characters for actual text. They are also ignored in plain
> text searches, i.e. not considered as variants of a significant dot, and
> there's also a word break before and after them (so they won't collapse
> into surrounding words written before or after them). They are also
> typically used to replace words that have been voluntarily deleted/hidden
> from an original document (becaue there's a need for keeping this info
> private).
>
>
>
> But note that input fields for entering password or secret codes in
> application forms/dialogs are typically using black bullets U+2022 (?) or
> simply ASCII asterisks U+002A (*) to replace the entered characters: they
> cannot be read, but the user knows what he is entering on his keyboard.
>
>
>
>
>
>
>
>
>
>
>
> 2016-12-23 0:35 GMT+01:00 Martin Mueller <martinmueller at northwestern.edu>:
>
> These are very handsome and interesting. But for the purposes of my
> project, which involves folks here, there, and everywhere working on
> editorial problems relating to digital transcriptions of Early Modern
> texts, the cardinal requirement is that the character can be found on and
> deployed from any Windows, Linux, or OS 10 machin. We have used the black
> dot (\u25cf) as a kludge. Since it does not occur in the source data, there
> is no ambiguity. It is relatively easy to produce on a keyboard. From a
> visual perspective it is preferable to the diamond with a question
> mark?although that is semantically more obvious. But it is visually very
> disruptive, and it is much harder to find on a standard character map than
> the black dot, which is predictably located in geometrical shapes.
>
> It?s a kludge, but it works, and it looks to me superior to any of the
> alternatives. But I can be persuaded otherwise.
>
> With thanks for the help of all of you
>
> MM
>
> On 12/22/16, 6:03 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com>
> wrote:
>
>     Martin Mueller wrote:
>
>     > Is there a Unicode character that says ?I represent an
> alphanumerical character, but I don?t know which?.  This is a very common
> problem in the transcription of historical texts where you have lacunas.
>
>     I have been reading this thread with interest.
>
>     I have produced nine designs for glyphs.
>
>     If you so choose, you can assign specific meanings to one, some, or
> all of them. If you need more than nine designs please say.
>
>     Please find attached nine .png files, one glyph design in each file.
>
>     The size of each of the images and the names of the files follow the
> following specification.
>
>     https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=
>
>     However the images are not congruently in accordance with those rules
> as there is a one pixel width transparent surround as the designs were made
> using filled rectangles upon a theoretical seven row by seven column
> arrangement of blocks, each block ten pixels by ten pixels. I used the
> Serif PagePlus X7 desktop publishing program.
>
>     The characters are not intended as emoji, I just applied the above
> specification as it is convenient to make the designs compatible with that
> specification as far as possible.
>
>     I have assigned Private Use Area code points of U+EA60 through to
> U+EA68 to the glyphs. The specific code point for each glyph is indicated
> in the file name of the image of that glyph.
>
>     I have chosen those code points as the Alt codes for U+EA60 through to
> U+EA68 are Alt 60000 through to Alt 60008 respectively. My thinking being
> that if the designs are implemented in fonts that those easy to remember
> Alt codes might be helpful to someone using the Microsoft WordPad program.
>
>     I checked that those code points are not being used in the Medieval
> Unicode Font Initiative.
>
>     https://urldefense.proofpoint.com/v2/url?u=http-3A__skaldic.
> abdn.ac.uk_db.php-3Fcp-3DEA-26if-3Dmufi-26table-3Dmufi-5Fchar&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=z5-
> Sl6Aw2Dr0dYsoZ9xgzqCpXjzoot1TnwUrJKqNHpo&e=
>
>     Readers who so choose are welcome to implement these glyphs in fonts.
>
>     The https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> unicode.org_emoji_selection.html-23images&d=CwIFaQ&c=
> yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=
> rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=
> rtTUf0iueQJPUWv8oFWfDyJBHafFPYQJ5mZelPYN_mE&s=
> VMzwU8ONTcLHvFcK5hcR9yj5TT3SzYSs-YYB8IGRq_A&e=  specification mentions
> licensing. For the avoidance of doubt these designs are free to share and
> use.
>
>
>     A Private Use Area solution is not ideal, yet may be helpful in
> getting things started and could be helpful in establishing usage, which
> could help in getting the characters implemented into regular Unicode.
>
>     I am attaching the images to this email. The nature of the email
> system is that the order of the images might not be in the order of the
> code points, yet each image has an indication of the code point within its
> name so that information should help to resolve any such problem in the
> transmission of the email attachments.
>
>     William Overington
>
>     Thursday 22 December 2016
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161224/d6f17f22/attachment.html>

From richard.wordingham at ntlworld.com  Fri Dec 23 19:33:52 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 24 Dec 2016 01:33:52 +0000
Subject: a character for an unknown character
In-Reply-To: <CAGa7JC2-EazchSpoNvMTa8D6kFd0bvU+t6nHPPtkBS=qnYebxg@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
 <20161223213446.478feda6@JRWUBU2>
 <CAGa7JC2-EazchSpoNvMTa8D6kFd0bvU+t6nHPPtkBS=qnYebxg@mail.gmail.com>
Message-ID: <20161224013352.64376a6d@JRWUBU2>

On Sat, 24 Dec 2016 00:44:00 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> If IBus cannot hide the input (i.e. generate the entered characters
> to the application without displaying it, and let the output hints
> (bullets/asterisks) be generated only by the application, then iBus
> has a problem and may not be usable for compliant password inputs.

The problem is that I can't tell what characters are being entered!
I'm not even sure the problem is related to IBus - IBus-dependnent
keyboards are not offered when entering passwords.

Richard. 

From verdy_p at wanadoo.fr  Fri Dec 23 20:24:17 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 24 Dec 2016 03:24:17 +0100
Subject: a character for an unknown character
In-Reply-To: <20161224013352.64376a6d@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
 <CAGa7JC2OK3Qzj5jzso1dbUh7Fco6krMa3Qk40g5G_jU_zMyEKA@mail.gmail.com>
 <20161223213446.478feda6@JRWUBU2>
 <CAGa7JC2-EazchSpoNvMTa8D6kFd0bvU+t6nHPPtkBS=qnYebxg@mail.gmail.com>
 <20161224013352.64376a6d@JRWUBU2>
Message-ID: <CAGa7JC3Grmvuxe8-Y9D8DO0OunWu0E8rDZnsKPfhfE8z8hHktg@mail.gmail.com>

you only get a hint that some character has been hit because you see an
additional bullet or asterisk. This is per design. Such input field should
be limited to short input that is easy for you to type from your keyboard.
But many input forms will also include a clickable icon/button that can be
used to view temporarily you input, and that will hide immediately if you
release that icon/button Once you're trained to enter the password, you
know how to type it, and you just need a visual hint that you have
effectively pressed a key and a character was added to the input.
On all devices I know for entering a PIN code (payment terminals, cash
dispensers... these input are hidden and there's not even any way to view
the actual code (because the code is short and the keyboard is simple,
there's little risk of error on , unless you don't know that code at all
and there's an unlimited number of trials authorized). On most OSes, the
logon screen also hides the password by default. It is a critical need for
accessing some secured/private environments/applications. It is of course
not needed for every other kind of input.
Note: this discussion is no longer is scope. We were just talking about
plaeholders usable to replace some text. Not about if and when placeholders
or actual text should be rendered in applications

2016-12-24 2:33 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 24 Dec 2016 00:44:00 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > If IBus cannot hide the input (i.e. generate the entered characters
> > to the application without displaying it, and let the output hints
> > (bullets/asterisks) be generated only by the application, then iBus
> > has a problem and may not be usable for compliant password inputs.
>
> The problem is that I can't tell what characters are being entered!
> I'm not even sure the problem is related to IBus - IBus-dependnent
> keyboards are not offered when entering passwords.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161224/08ce6428/attachment.html>

From jkorpela at cs.tut.fi  Sun Dec 25 11:31:28 2016
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Sun, 25 Dec 2016 19:31:28 +0200
Subject: a character for an unknown character
In-Reply-To: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
Message-ID: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>

21.12.2016, 4:29, Martin Mueller wrote:

> Is there a Unicode character that says ?I represent an alphanumerical
> character, but I don?t know which?.

I think including such a ?character? in Unicode would not fit into the 
the idea of Unicode as a system for encoding plain text characters. You 
seem to be asking for a symbol that is not a graphic or control 
character but information about uncertainty regarding a character a data 
stream. So I think this does not fall into the category of plain text, 
and the information should be expressed at a higher protocol level, e.g. 
in markup or as out-of-band information.

When it is not certain what character there is in some text to be 
encoded, there is a wide range of possible situations. For example, it 
might be a thing like ?there is letter U or letter V, probably the 
latter? or ?there is some Latin letter but no hint of what it might be? 
or even ?there is an alphanumerical character? (though I find it 
difficult to imagine such a situation). Such things can hardly be 
described using new characters; rather, they need to be expressed using 
verbal descriptions (which are about the encoded text, not part of it) 
or some formal notations or both.

> This is a very common problem in
> the transcription of historical texts where you have lacunas. Often, the
> extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by
> <gap/> elements with attributes about this or that. This is efficient
> when it comes to pages, very inefficient when it comes to individual
> characters.

Efficient in what sense? Saving bytes can hardly be an issue here. And 
if various attributes are needed to describe the case, then it would 
become awkward to try to do the same with encoded characters (or 
?characters?, Unicode code points).

> In the TCP project, various code points from the Geometrical were used
> to represent lacunae. The black circle (\u25cf) has been used as the
> character for a missing character.This is OK and unambiguous in its
> context.

If some graphic symbol is by convention used to represent a lacuna, then 
the issue, as regards to Unicode, is simply whether that symbol exists 
as an encoded character or whether there is need to add that graphic 
symbol to Unicode. But it would be a matter of encoding graphic 
characters (irrespectively of their meaning in some content), not about 
encoding abstract ideas like ?an unrecognized character?.

> But would be nice to have a special character for just that
> purpose

Various symbols are used in different contexts to indicate situations 
like ?there is a written symbol that cannot be recognized as a specific 
character?. Perhaps there should be a universal convention about this, 
but it is unrealistic to expect that to happen. The Unicode Standard can 
hardly standardize such things. And if there were such a universal 
symbol, it would surely have been encoded in Unicode?not because of its 
meaning, but because of its consistent use as a character in plain text.

So I think the conclusion is that you should use established 
conventions, if they exist, about using some symbol for such situations, 
or define a convention as needed. You should not expect the character to 
be recognized in this special meaning without such a higher-level 
convention.

There?s a theoretical (?) problem with this. Let us assume that you 
decide to use a particular character to represent ?unknown character? in 
your data, when working with some type of written texts. What happens 
when you encounter, in the study of those text, a graphic symbol that is 
best identified as the character you decided to use in that special 
meaning? Well, I think you can decide to solve that problem if it ever 
appears.

Yucca


From 747.neutron at gmail.com  Sun Dec 25 10:33:31 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Mon, 26 Dec 2016 01:33:31 +0900
Subject: About standardized variants of characters in Dingbat block
Message-ID: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>

Hi,

I'm curious about the reason why U+270C VICTORY HAND ? has
standardized text and emoji styles defined but not with U+270A RAISED
FIST ? and U+270B RAISED HAND ?.
http://www.unicode.org/Public/9.0.0/ucd/StandardizedVariants.txt

I personally can't think of their usage disparity, so what is the
rationale behind it?

Thank you.


From 747.neutron at gmail.com  Sun Dec 25 10:58:39 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Mon, 26 Dec 2016 01:58:39 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
Message-ID: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>

Please excuse my serial posting.

I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
Q in the following document (at A7AF) is "Letter for representation of
morpheme in Japanese".
http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf

However, to my knowledge, the letter is required for describing a
"phoneme" of Japanese that isn't tied to specific "morphemes" (~
"words"). I have contacted the original writer of the proposal:
http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
and he agrees with me in this regard.

Thus I suppose "Letter for Japanese phonology" would be more desired a
heading for this character, though subheads are not normative. What
are your thoughts?

From markus.icu at gmail.com  Sun Dec 25 13:29:25 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Sun, 25 Dec 2016 11:29:25 -0800
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
Message-ID: <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>

On Sun, Dec 25, 2016 at 8:33 AM, Yif?n W?ng <747.neutron at gmail.com> wrote:

> I'm curious about the reason why U+270C VICTORY HAND ? has
> standardized text and emoji styles defined but not with U+270A RAISED
> FIST ? and U+270B RAISED HAND ?.
> http://www.unicode.org/Public/9.0.0/ucd/StandardizedVariants.txt
>
> I personally can't think of their usage disparity, so what is the
> rationale behind it?
>

As far as I remember, the victory hand is an original Dingbat symbol and
got also unified with an emoji from the Japanese carrier sets. The
variation selectors let you pick Dingbat style vs. emoji style.

The other two were not Dingbats but only came from the Japanese carrier
sets, for playing rock-paper-scissors.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu2700-%5Cu270f%5D&g=age&i=

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161225/397dd00d/attachment.html>

From leoboiko at gmail.com  Sun Dec 25 14:03:48 2016
From: leoboiko at gmail.com (Leonardo Boiko)
Date: Sun, 25 Dec 2016 18:03:48 -0200
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
Message-ID: <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>

Agreed with Yif?n W?ng... But I wonder about the need for the character in
the first place. Are we going to add a full small-caps set, too, given its
use in morphological glosses? Isn't it enough to use a regular 'Q' in
plain-text, and style to small caps in rich text?

I can see the rationale for mathematical bold, given that a regular-weight
plain-text character would stand for a different thing in mathematical
formul?. But there's no way a capital Q would ever be confused as anything
other than the phoneme, in a Japanese phonological transcription.

2016/12/25 17:56 "Yif?n W?ng" <747.neutron at gmail.com>:

Please excuse my serial posting.

I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
Q in the following document (at A7AF) is "Letter for representation of
morpheme in Japanese".
http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf

However, to my knowledge, the letter is required for describing a
"phoneme" of Japanese that isn't tied to specific "morphemes" (~
"words"). I have contacted the original writer of the proposal:
http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
and he agrees with me in this regard.

Thus I suppose "Letter for Japanese phonology" would be more desired a
heading for this character, though subheads are not normative. What
are your thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161225/53cf4104/attachment.html>

From gwalla at gmail.com  Sun Dec 25 18:18:24 2016
From: gwalla at gmail.com (Garth Wallace)
Date: Sun, 25 Dec 2016 16:18:24 -0800
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
Message-ID: <CA+p4_H2TmSetipZB9_wEnfmF5niuWiPkgX7ueZKLWTwhiHysjA@mail.gmail.com>

On Sun, Dec 25, 2016 at 8:58 AM, Yif?n W?ng <747.neutron at gmail.com> wrote:

> Please excuse my serial posting.
>
> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
> Q in the following document (at A7AF) is "Letter for representation of
> morpheme in Japanese".
> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>
> However, to my knowledge, the letter is required for describing a
> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
> "words"). I have contacted the original writer of the proposal:
> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
> and he agrees with me in this regard.
>
> Thus I suppose "Letter for Japanese phonology" would be more desired a
> heading for this character, though subheads are not normative. What
> are your thoughts?
>

AIUI it's not really a phoneme either; it represents gemination of a
following consonant. A chroneme?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161225/6e3ca300/attachment.html>

From 747.neutron at gmail.com  Sun Dec 25 23:27:42 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Mon, 26 Dec 2016 14:27:42 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
Message-ID: <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>

> Agreed with Yif?n W?ng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?

No, it's not in "morphological glosses" but phonological notations
such as /yuQkuri/. In morphological discussions, phonological details
are usually ignored and they just write down the surface forms.

> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formul?. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.

I don't think Q is, but it should be in unison with its fellows /?/,
/?/, /?/ etc. Some books make all of them capitals, but others all
small capitals.
Making into small capitals avoids possible confusions with variables
like /C/ or /V/.

2016-12-26 5:03 GMT+09:00 Leonardo Boiko <leoboiko at gmail.com>:
> Agreed with Yif?n W?ng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?
>
> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formul?. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.
>
> 2016/12/25 17:56 "Yif?n W?ng" <747.neutron at gmail.com>:
>
> Please excuse my serial posting.
>
> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
> Q in the following document (at A7AF) is "Letter for representation of
> morpheme in Japanese".
> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>
> However, to my knowledge, the letter is required for describing a
> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
> "words"). I have contacted the original writer of the proposal:
> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
> and he agrees with me in this regard.
>
> Thus I suppose "Letter for Japanese phonology" would be more desired a
> heading for this character, though subheads are not normative. What
> are your thoughts?
>
>


From 747.neutron at gmail.com  Sun Dec 25 23:40:32 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Mon, 26 Dec 2016 14:40:32 +0900
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
Message-ID: <CAF5KyEz0LqipyF5kumUvUO9G1YgZKXic+aKz9e2Pgo76QkbJFw@mail.gmail.com>

Wow, thank you.

I was amazed to know how those symbols slipped in the gap so perfectly as
if they were there from the beginning. I looked back at Unicode 2.0 chart
and there was only reserved blank after Zapf Dingbats got unified with
Wingdings symbols.

2016-12-26 4:29 GMT+09:00 Markus Scherer <markus.icu at gmail.com>:

> On Sun, Dec 25, 2016 at 8:33 AM, Yif?n W?ng <747.neutron at gmail.com> wrote:
>
>> I'm curious about the reason why U+270C VICTORY HAND ? has
>> standardized text and emoji styles defined but not with U+270A RAISED
>> FIST ? and U+270B RAISED HAND ?.
>> http://www.unicode.org/Public/9.0.0/ucd/StandardizedVariants.txt
>>
>> I personally can't think of their usage disparity, so what is the
>> rationale behind it?
>>
>
> As far as I remember, the victory hand is an original Dingbat symbol and
> got also unified with an emoji from the Japanese carrier sets. The
> variation selectors let you pick Dingbat style vs. emoji style.
>
> The other two were not Dingbats but only came from the Japanese carrier
> sets, for playing rock-paper-scissors.
>
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=
> %5B%5Cu2700-%5Cu270f%5D&g=age&i=
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161226/3043ebc7/attachment.html>

From 747.neutron at gmail.com  Sun Dec 25 23:59:02 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Mon, 26 Dec 2016 14:59:02 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CA+p4_H2TmSetipZB9_wEnfmF5niuWiPkgX7ueZKLWTwhiHysjA@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CA+p4_H2TmSetipZB9_wEnfmF5niuWiPkgX7ueZKLWTwhiHysjA@mail.gmail.com>
Message-ID: <CAF5KyExGhN29R4GU3nqYgEekSg4x1BLgPCmkoWnoikAR35i8hQ@mail.gmail.com>

> AIUI it's not really a phoneme either; it represents gemination of a
> following consonant. A chroneme?

Perhaps you can argue that it is a "moreme", but they usually analyze
it as a segmental, mostly architypal entity or "moraic phoneme".
It isn't a simple modifier but occupies a mora for its own. /a|sa|ri/
and /a|Q|sa|ri/ differ in length in Japanese prosody.

2016-12-26 9:18 GMT+09:00 Garth Wallace <gwalla at gmail.com>:
> On Sun, Dec 25, 2016 at 8:58 AM, Yif?n W?ng <747.neutron at gmail.com> wrote:
>>
>> Please excuse my serial posting.
>>
>> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
>> Q in the following document (at A7AF) is "Letter for representation of
>> morpheme in Japanese".
>> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>>
>> However, to my knowledge, the letter is required for describing a
>> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
>> "words"). I have contacted the original writer of the proposal:
>> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
>> and he agrees with me in this regard.
>>
>> Thus I suppose "Letter for Japanese phonology" would be more desired a
>> heading for this character, though subheads are not normative. What
>> are your thoughts?
>
>
> AIUI it's not really a phoneme either; it represents gemination of a
> following consonant. A chroneme?


From leoboiko at gmail.com  Mon Dec 26 03:38:05 2016
From: leoboiko at gmail.com (Leonardo Boiko)
Date: Mon, 26 Dec 2016 07:38:05 -0200
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
Message-ID: <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>

I meant that morphological glosses (such as the Leipzig standard) style
tags in small-caps. Like this:

    yukkuri-ni yom-i-mas-i-ta
    carefully-ADV read-CON-POL-CON-PRF

These are traditionally set in small-caps, not capitals. If the
phonologists are getting small-caps into plain text, why not the
morphologists? If the only argument for Q is that there is an  /?/, why not
the full set, and then you can write any morphological tag? The chance of
confusing "CON" with a word is greater than that of /Q/ or [Q], if anything.

2016/12/26 3:28 "Yif?n W?ng" <747.neutron at gmail.com>:

> Agreed with Yif?n W?ng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?

No, it's not in "morphological glosses" but phonological notations
such as /yuQkuri/. In morphological discussions, phonological details
are usually ignored and they just write down the surface forms.

> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formul?. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.

I don't think Q is, but it should be in unison with its fellows /?/,
/?/, /?/ etc. Some books make all of them capitals, but others all
small capitals.
Making into small capitals avoids possible confusions with variables
like /C/ or /V/.

2016-12-26 5:03 GMT+09:00 Leonardo Boiko <leoboiko at gmail.com>:
> Agreed with Yif?n W?ng... But I wonder about the need for the character in
> the first place. Are we going to add a full small-caps set, too, given its
> use in morphological glosses? Isn't it enough to use a regular 'Q' in
> plain-text, and style to small caps in rich text?
>
> I can see the rationale for mathematical bold, given that a regular-weight
> plain-text character would stand for a different thing in mathematical
> formul?. But there's no way a capital Q would ever be confused as anything
> other than the phoneme, in a Japanese phonological transcription.
>
> 2016/12/25 17:56 "Yif?n W?ng" <747.neutron at gmail.com>:
>
> Please excuse my serial posting.
>
> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
> Q in the following document (at A7AF) is "Letter for representation of
> morpheme in Japanese".
> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>
> However, to my knowledge, the letter is required for describing a
> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
> "words"). I have contacted the original writer of the proposal:
> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
> and he agrees with me in this regard.
>
> Thus I suppose "Letter for Japanese phonology" would be more desired a
> heading for this character, though subheads are not normative. What
> are your thoughts?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161226/0a9f1a7e/attachment.html>

From 747.neutron at gmail.com  Mon Dec 26 09:45:07 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Tue, 27 Dec 2016 00:45:07 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
Message-ID: <CAF5KyEyS=7KeV3vqZaQ0UZHL_r7HXswYswaKZw8yKv0h6uPwgg@mail.gmail.com>

> These are traditionally set in small-caps, not capitals. If the phonologists
> are getting small-caps into plain text, why not the morphologists? If the
> only argument for Q is that there is an  /?/, why not the full set, and then
> you can write any morphological tag? The chance of confusing "CON" with a
> word is greater than that of /Q/ or [Q], if anything.

Let me tidy it up a bit.

You may be under impression that the letter has something to do with
morphology, but my argument is that the original "Letter for
representation of morpheme in Japanese" is a misnomer and this letter
is totally unrelated to morphological context.

This kind of letters are used to describe phonological representation
of words like /de?ko?seQka/ in the same way we describe an English
word /b?t?l??p/. The letters are represented in small capital because
they are different from the sound what we usually associate with N, H
or Q. (Actually, their phonetic values vary wildly according to
adjacent phonemes.) You can substitute ordinary upper cases for them,
but they are merely substitution in the same way you type /?/ instead
of /?/, or upside-down G instead of /?/, because of the lack of
typographical assets.

In morphological literature, few authors bother to use these notations
since they don't matter in this level of discussion. The word form I
mentioned above would be just transcribed as "denk?sekka" and glossed
in the next line.

Finally, small capitals in the Leipzig rules, I believe, are just
stylistic alteration. For example, when you write ??? (all small
capital), the letters still stand for ordinary A, D and V, for this is
obviously the abbreviation of "adverb". It's more like the whole
sequence ADV made shrunken in "small caps" mode or style, which is a
parallel operation to italicization or boldification. Since the
semantic difference is not inherent to the character itself, I don't
think Unicode people would treat them as another set of letters in
this case.


2016-12-26 18:38 GMT+09:00 Leonardo Boiko <leoboiko at gmail.com>:
> I meant that morphological glosses (such as the Leipzig standard) style tags
> in small-caps. Like this:
>
>     yukkuri-ni yom-i-mas-i-ta
>     carefully-ADV read-CON-POL-CON-PRF
>
> These are traditionally set in small-caps, not capitals. If the phonologists
> are getting small-caps into plain text, why not the morphologists? If the
> only argument for Q is that there is an  /?/, why not the full set, and then
> you can write any morphological tag? The chance of confusing "CON" with a
> word is greater than that of /Q/ or [Q], if anything.
>
> 2016/12/26 3:28 "Yif?n W?ng" <747.neutron at gmail.com>:
>
>> Agreed with Yif?n W?ng... But I wonder about the need for the character in
>> the first place. Are we going to add a full small-caps set, too, given its
>> use in morphological glosses? Isn't it enough to use a regular 'Q' in
>> plain-text, and style to small caps in rich text?
>
> No, it's not in "morphological glosses" but phonological notations
> such as /yuQkuri/. In morphological discussions, phonological details
> are usually ignored and they just write down the surface forms.
>
>> I can see the rationale for mathematical bold, given that a regular-weight
>> plain-text character would stand for a different thing in mathematical
>> formul?. But there's no way a capital Q would ever be confused as anything
>> other than the phoneme, in a Japanese phonological transcription.
>
> I don't think Q is, but it should be in unison with its fellows /?/,
> /?/, /?/ etc. Some books make all of them capitals, but others all
> small capitals.
> Making into small capitals avoids possible confusions with variables
> like /C/ or /V/.
>
> 2016-12-26 5:03 GMT+09:00 Leonardo Boiko <leoboiko at gmail.com>:
>> Agreed with Yif?n W?ng... But I wonder about the need for the character in
>> the first place. Are we going to add a full small-caps set, too, given its
>> use in morphological glosses? Isn't it enough to use a regular 'Q' in
>> plain-text, and style to small caps in rich text?
>>
>> I can see the rationale for mathematical bold, given that a regular-weight
>> plain-text character would stand for a different thing in mathematical
>> formul?. But there's no way a capital Q would ever be confused as anything
>> other than the phoneme, in a Japanese phonological transcription.
>>
>> 2016/12/25 17:56 "Yif?n W?ng" <747.neutron at gmail.com>:
>>
>> Please excuse my serial posting.
>>
>> I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
>> Q in the following document (at A7AF) is "Letter for representation of
>> morpheme in Japanese".
>> http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
>>
>> However, to my knowledge, the letter is required for describing a
>> "phoneme" of Japanese that isn't tied to specific "morphemes" (~
>> "words"). I have contacted the original writer of the proposal:
>> http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
>> and he agrees with me in this regard.
>>
>> Thus I suppose "Letter for Japanese phonology" would be more desired a
>> heading for this character, though subheads are not normative. What
>> are your thoughts?
>>
>>
>
>


From leoboiko at gmail.com  Mon Dec 26 09:58:56 2016
From: leoboiko at gmail.com (Leonardo Boiko)
Date: Mon, 26 Dec 2016 13:58:56 -0200
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAF5KyEyS=7KeV3vqZaQ0UZHL_r7HXswYswaKZw8yKv0h6uPwgg@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAF5KyEyS=7KeV3vqZaQ0UZHL_r7HXswYswaKZw8yKv0h6uPwgg@mail.gmail.com>
Message-ID: <CAJ6uix6Fe4yJ=rzs=vYAkotpOLbO6ngNU3YU8OWqFw-esBQ5Ag@mail.gmail.com>

2016-12-26 13:45 GMT-02:00 Yif?n W?ng <747.neutron at gmail.com>:

> You may be under impression that the letter has something to do with
> morphology, but my argument is that the original "Letter for
> representation of morpheme in Japanese" is a misnomer and this letter
> is totally unrelated to morphological context.
>

I agree, and already said I agreed in my first email.  I know how Japanese
is represented in IPA and how /Q/ and /N/ are used.  My point is that I
don't think phonologists' small-caps Q has *more *justification to be in
Unicode than morphologists? small-caps everything.


> For example, when you write ??? (all small
> capital), the letters still stand for ordinary A, D and V, for this is
> obviously the abbreviation of "adverb".  It's more like the whole
> sequence ADV made shrunken in "small caps" mode or style, which is a
> parallel operation to italicization or boldification.


Which is parallel to how bold and italics are used in mathematics, which
was the argument to get them into Unicode, as I've also pointed earlier.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161226/c50eb820/attachment.html>

From wjgo_10009 at btinternet.com  Mon Dec 26 05:31:54 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 26 Dec 2016 11:31:54 +0000 (GMT)
Subject: a character for an unknown character
In-Reply-To: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
Message-ID: <13169249.6611.1482751914111.JavaMail.defaultUser@defaultHost>

Jukka K. Korpela wrote:

> So I think this does not fall into the category of plain text, and the information should be expressed at a higher protocol level, e.g. in markup or as out-of-band information.

I opine that requiring the use of a higher level protocol needlessly makes encoding a document more complicated than it need be. Using plain text would allow the transcript to be encoded into a Portable Document Format (PDF) document when publishing the transcript.

> Such things can hardly be described using new characters; ....

I opine that they can and that it would be straightforward to do that.

Certainly the situation that it can be done does not necessarily mean that it will be done. It depends upon what people choose to do.

For example, I designed some glyphs that could be used for such characters.

http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0071.html

Other glyphs could be designed if more are needed. Not necessarily designed by me.

The example that is quoted, namely ?there is letter U or letter V, probably the latter? may need to place the U and the V between two of the new characters, yet that is not a great problem.

For example, maybe use two of the designs attached to the http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0071.html post, for example, a character based on the glyph in transcribe_ea65.png before the U and a character based on the glyph in transcribe_ea66.png after the V. Thus a four character sequence.

If desired, a character based on the glyph in transcribe_ea67.png could be placed before a note made by a transcriber and a character based on the glyph in transcribe_ea68.png could be placed after a note made by a transcriber. That could be helpful while transcribing a document. Maybe later the note could be moved to be a footnote in a book, yet the characters could be useful when actually transcribing.

> If some graphic symbol is by convention used to represent a lacuna, then the issue, as regards to Unicode, is simply whether that symbol exists as an encoded character or whether there is need to add that graphic symbol to Unicode. But it would be a matter of encoding graphic characters (irrespectively of their meaning in some content), not about encoding abstract ideas like ?an unrecognized character?.

Well, I opine that the situation needs to be assessed based upon the needs of today, not based on rules made long ago before today's needs arose.

> Perhaps there should be a universal convention about this, but it is unrealistic to expect that to happen.

I opine that it is realistic for it to happen. For example, I have published in this list nine new glyphs. In this post I have suggested meanings for four of them. I have suggested some Private Use Area code points. There are open source fonts around which people can, if they so choose, copy one of them, give it a new font name and a new file name and add in nine new characters at U+EA60 through to U+EA68 based on my designs as in the above-mentioned post. If that is what people want to do it could happen very quickly, maybe in a few days. The new open source font could then be made available on a web site and a person interested in using the font could gather it from the web and install the font on his or her own computer. The fact that the Private Use Area is part of Unicode and that fonts are organized on many computers so as not to be dependent upon with which software program they are being used means that a Private Use Area encoding can be used very effectively.

I accept that that may not be what people want to do and that it may well not happen, yet it is realistic that it could happen if people want to do it.

If it does happen then after that it would be a process to decide if the characters were used sufficiently for them to become encoded into regular Unicode.

> The Unicode Standard can hardly standardize such things.

I opine that The Unicode Standard could standardize such things if the Unicode Technical Committee decides that it wants to do that.

> And if there were such a universal symbol, it would surely have been encoded in Unicode?not because of its meaning, but because of its consistent use as a character in plain text.

Well, there are new characters being encoded every year, some of which have not existed in plain text before. Progress happens when it happens, new ideas can arise and be applied today.

> You should not expect the character to be recognized in this special meaning without such a higher-level convention.

Well, I opine that if the character is a new character designed and defined as to meaning for the purpose then such an expectation would be reasonable.

> There?s a theoretical (?) problem with this. Let us assume that you decide to use a particular character to represent ?unknown character? in 
your data, when working with some type of written texts. What happens when you encounter, in the study of those text, a graphic symbol that is 
best identified as the character you decided to use in that special meaning? Well, I think you can decide to solve that problem if it ever appears.

An advantage of having new characters designed and defined as to meaning specifically for the purpose is that that should avoid such a problem arising - though one can never be absolutely sure about that.

William Overington

Monday 26 December 2016


----Original message----
>From : jkorpela at cs.tut.fi
Date : 25/12/2016 - 17:31 (GMTST)
To : unicode at unicode.org
Subject : Re: a character for an unknown character

21.12.2016, 4:29, Martin Mueller wrote:

> Is there a Unicode character that says ?I represent an alphanumerical
> character, but I don?t know which?.

I think including such a ?character? in Unicode would not fit into the 
the idea of Unicode as a system for encoding plain text characters. You 
seem to be asking for a symbol that is not a graphic or control 
character but information about uncertainty regarding a character a data 
stream. So I think this does not fall into the category of plain text, 
and the information should be expressed at a higher protocol level, e.g. 
in markup or as out-of-band information.

When it is not certain what character there is in some text to be 
encoded, there is a wide range of possible situations. For example, it 
might be a thing like ?there is letter U or letter V, probably the 
latter? or ?there is some Latin letter but no hint of what it might be? 
or even ?there is an alphanumerical character? (though I find it 
difficult to imagine such a situation). Such things can hardly be 
described using new characters; rather, they need to be expressed using 
verbal descriptions (which are about the encoded text, not part of it) 
or some formal notations or both.

> This is a very common problem in
> the transcription of historical texts where you have lacunas. Often, the
> extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by
> <gap/> elements with attributes about this or that. This is efficient
> when it comes to pages, very inefficient when it comes to individual
> characters.

Efficient in what sense? Saving bytes can hardly be an issue here. And 
if various attributes are needed to describe the case, then it would 
become awkward to try to do the same with encoded characters (or 
?characters?, Unicode code points).

> In the TCP project, various code points from the Geometrical were used
> to represent lacunae. The black circle (\u25cf) has been used as the
> character for a missing character.This is OK and unambiguous in its
> context.

If some graphic symbol is by convention used to represent a lacuna, then 
the issue, as regards to Unicode, is simply whether that symbol exists 
as an encoded character or whether there is need to add that graphic 
symbol to Unicode. But it would be a matter of encoding graphic 
characters (irrespectively of their meaning in some content), not about 
encoding abstract ideas like ?an unrecognized character?.

> But would be nice to have a special character for just that
> purpose

Various symbols are used in different contexts to indicate situations 
like ?there is a written symbol that cannot be recognized as a specific 
character?. Perhaps there should be a universal convention about this, 
but it is unrealistic to expect that to happen. The Unicode Standard can 
hardly standardize such things. And if there were such a universal 
symbol, it would surely have been encoded in Unicode?not because of its 
meaning, but because of its consistent use as a character in plain text.

So I think the conclusion is that you should use established 
conventions, if they exist, about using some symbol for such situations, 
or define a convention as needed. You should not expect the character to 
be recognized in this special meaning without such a higher-level 
convention.

There?s a theoretical (?) problem with this. Let us assume that you 
decide to use a particular character to represent ?unknown character? in 
your data, when working with some type of written texts. What happens 
when you encounter, in the study of those text, a graphic symbol that is 
best identified as the character you decided to use in that special 
meaning? Well, I think you can decide to solve that problem if it ever 
appears.

Yucca


From verdy_p at wanadoo.fr  Mon Dec 26 14:52:55 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 26 Dec 2016 21:52:55 +0100
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
Message-ID: <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>

This still does not mean that these smallcaps are different from the normal
Latin characters they represent in this case : "ADV" for example is an
abbreviated form of the standard English word "ADVERB", just using
smallcaps as a stylistic variation not really borrowed by individual
characters "A", "D" or "V" themselves, but for the whole abbreviation.
Compare this to the stylistic convention of using superscripts for final
letters "st" of ordinal numeric values such as "1st". or the extra use of
bold, italic, underlining to makup some intersegmental semantics. Isolately
on each character, these style carry no semantic at all, they are only
meaningful in contexts spanning larger segments of text (syllable or
words/morphemes, independantly of letters composing them, which may be in
any script and may use all other possible combining diacritics.

To encode the character, you need to demonstrate that the distinct style is
meaningful for the isolated character itself and sets its own semantic.

Otherwise we would need to reencode all existing base letters (also
precomposed letters using diacritics) in many variants: superscript,
subscript, bold, italic, underlined, or a combination of these, which would
mean hundreds of thousands new characters. Segmental notations carried by
style variants applied to ranges of characters are out of scope of the
standard because this is in fact a national convention (and in fact it is
not a standard and there are multiple choices to represent these
distinctions, without affecting the meaning of indicidual letters in these
spans).

So just use some style markup in an external protocol and let's keep the
characters unified with their normal style version. Exceptions were made
and accepted in Unicode only because of
* roundtrip compability with legacy encoding standards, e.g. with
superscript "a" and "o" (masculine and feminine ordinal terminators), or
the precombined abbreviation No. for "numero" ,
* or for maths symbols, which are not really letters and that need to be
distinguished from normal human languages without even being affected by
their grammar or orthography or altered by tools such as spelling
correctors.
The cost for desunifying many letters is worse than using a document-local,
or language-specific convention for using the proper markup styles.

There will unavoidably exist documents that will reuse the few encoded
variants, but in my opinion these documnets are just hacking the standard
when they should better use style or semantic markup.


2016-12-26 10:38 GMT+01:00 Leonardo Boiko <leoboiko at gmail.com>:

> I meant that morphological glosses (such as the Leipzig standard) style
> tags in small-caps. Like this:
>
>     yukkuri-ni yom-i-mas-i-ta
>     carefully-ADV read-CON-POL-CON-PRF
>
> These are traditionally set in small-caps, not capitals. If the
> phonologists are getting small-caps into plain text, why not the
> morphologists? If the only argument for Q is that there is an  /?/, why
> not the full set, and then you can write any morphological tag? The chance
> of confusing "CON" with a word is greater than that of /Q/ or [Q], if
> anything.
>
> 2016/12/26 3:28 "Yif?n W?ng" <747.neutron at gmail.com>:
>
> > Agreed with Yif?n W?ng... But I wonder about the need for the character
> in
> > the first place. Are we going to add a full small-caps set, too, given
> its
> > use in morphological glosses? Isn't it enough to use a regular 'Q' in
> > plain-text, and style to small caps in rich text?
>
> No, it's not in "morphological glosses" but phonological notations
> such as /yuQkuri/. In morphological discussions, phonological details
> are usually ignored and they just write down the surface forms.
>
> > I can see the rationale for mathematical bold, given that a
> regular-weight
> > plain-text character would stand for a different thing in mathematical
> > formul?. But there's no way a capital Q would ever be confused as
> anything
> > other than the phoneme, in a Japanese phonological transcription.
>
> I don't think Q is, but it should be in unison with its fellows /?/,
> /?/, /?/ etc. Some books make all of them capitals, but others all
> small capitals.
> Making into small capitals avoids possible confusions with variables
> like /C/ or /V/.
>
> 2016-12-26 5:03 GMT+09:00 Leonardo Boiko <leoboiko at gmail.com>:
> > Agreed with Yif?n W?ng... But I wonder about the need for the character
> in
> > the first place. Are we going to add a full small-caps set, too, given
> its
> > use in morphological glosses? Isn't it enough to use a regular 'Q' in
> > plain-text, and style to small caps in rich text?
> >
> > I can see the rationale for mathematical bold, given that a
> regular-weight
> > plain-text character would stand for a different thing in mathematical
> > formul?. But there's no way a capital Q would ever be confused as
> anything
> > other than the phoneme, in a Japanese phonological transcription.
> >
> > 2016/12/25 17:56 "Yif?n W?ng" <747.neutron at gmail.com>:
> >
> > Please excuse my serial posting.
> >
> > I recently noticed the subhead given to the LATIN LETTER SMALL CAPITAL
> > Q in the following document (at A7AF) is "Letter for representation of
> > morpheme in Japanese".
> > http://www.unicode.org/L2/L2016/16381-n4778r-pdam1-2-charts.pdf
> >
> > However, to my knowledge, the letter is required for describing a
> > "phoneme" of Japanese that isn't tied to specific "morphemes" (~
> > "words"). I have contacted the original writer of the proposal:
> > http://www.unicode.org/L2/L2015/15241-small-cap-q.pdf
> > and he agrees with me in this regard.
> >
> > Thus I suppose "Letter for Japanese phonology" would be more desired a
> > heading for this character, though subheads are not normative. What
> > are your thoughts?
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161226/369aa6f0/attachment.html>

From richard.wordingham at ntlworld.com  Mon Dec 26 18:05:22 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 27 Dec 2016 00:05:22 +0000
Subject: a character for an unknown character
In-Reply-To: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
Message-ID: <20161227000522.7bb95f3e@JRWUBU2>

On Sun, 25 Dec 2016 19:31:28 +0200
"Jukka K. Korpela" <jkorpela at cs.tut.fi> wrote:

> When it is not certain what character there is in some text to be 
> encoded, there is a wide range of possible situations. For example,
> it might be a thing like ?there is letter U or letter V, probably the 
> latter? or ?there is some Latin letter but no hint of what it might
> be? or even ?there is an alphanumerical character? (though I find it 
> difficult to imagine such a situation). Such things can hardly be 
> described using new characters; rather, they need to be expressed
> using verbal descriptions (which are about the encoded text, not part
> of it) or some formal notations or both.

This does not appear to be the situation we are being asked about.  I
suspect the context is rather that of a document damaged by fire or
mould.

> If some graphic symbol is by convention used to represent a lacuna,
> then the issue, as regards to Unicode, is simply whether that symbol
> exists as an encoded character or whether there is need to add that
> graphic symbol to Unicode. But it would be a matter of encoding
> graphic characters (irrespectively of their meaning in some content),
> not about encoding abstract ideas like ?an unrecognized character?.

Unicode encodes pictograms, directives and abstract characters, not
glyphs.  There are few, if any characters, that have no semantics,
though several characters can be ambiguous and context-sensitive as to
what semantics they occur.  If it was just a matter of appearance,
then U+26C6 RAIN would be the character to use.  It has the graphic
used for characters in damaged inscriptions.

Of course, there is one character that is already widely used in this
r?le - U+003F QUESTION MARK.  Some of its Unicode properties are not
suitable, and its informal 'unknown character' semantic conflicts with
its r?le as a punctuation mark.

If I understand correctly, these issues are already addressed by the
Leiden Conventions.  Why do they not suffice?

Richard.


From moyogo at gmail.com  Tue Dec 27 03:54:35 2016
From: moyogo at gmail.com (Denis Jacquerye)
Date: Tue, 27 Dec 2016 09:54:35 +0000
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
Message-ID: <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>

For what it?s worth, the small capital q was used as an IPA symbol for a
while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale? in
the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in some
1895 IPA charts and later charts) then as a ?consonne fricative bronchiale
sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA
chart in 1912. It was eventually replaced with the reversed glottal stop ?,
for example in the 1932 IPA chart or later charts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161227/e4826048/attachment.html>

From christoph.paeper at crissov.de  Tue Dec 27 07:51:18 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 27 Dec 2016 14:51:18 +0100
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
Message-ID: <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>

Markus Scherer <markus.icu at gmail.com>:
> 
> The other two were not Dingbats but only came from the Japanese carrier sets, for playing rock-paper-scissors.

Since U+1F596 ?? has been added, people are still missing the lizard. ;)

There are, of course, many more Roshambo variants and extensions. The reference to one,  <http://umop.com/rps25.htm>, has been mangled in my 7 Nov feedback archived at <http://www.unicode.org/L2/L2016/16327-pubrev.html#Encoding_Feedback>.

<https://en.wikipedia.org/wiki/Rock?paper?scissors>
<https://en.wikipedia.org/wiki/Sansukumi-ken>

From mark at macchiato.com  Tue Dec 27 09:16:24 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 27 Dec 2016 16:16:24 +0100
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
 <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
Message-ID: <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>

> people are still missing the lizard. ;)

http://unicode.org/emoji/charts-beta/emoji-list.html#1f98e

Mark

On Tue, Dec 27, 2016 at 2:51 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> Markus Scherer <markus.icu at gmail.com>:
> >
> > The other two were not Dingbats but only came from the Japanese carrier
> sets, for playing rock-paper-scissors.
>
> Since U+1F596 ?? has been added, people are still missing the lizard. ;)
>
> There are, of course, many more Roshambo variants and extensions. The
> reference to one,  <http://umop.com/rps25.htm>, has been mangled in my 7
> Nov feedback archived at <http://www.unicode.org/L2/
> L2016/16327-pubrev.html#Encoding_Feedback>.
>
> <https://en.wikipedia.org/wiki/Rock?paper?scissors>
> <https://en.wikipedia.org/wiki/Sansukumi-ken>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161227/a6789e36/attachment.html>

From jsbien at mimuw.edu.pl  Tue Dec 27 09:21:53 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Tue, 27 Dec 2016 16:21:53 +0100
Subject: a character for an unknown character
In-Reply-To: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> (Jukka
 K. Korpela's message of "Sun, 25 Dec 2016 19:31:28 +0200")
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
Message-ID: <867f6lwg6m.fsf@mimuw.edu.pl>

On Sun, Dec 25 2016 at 18:31 CET, jkorpela at cs.tut.fi writes:
> 21.12.2016, 4:29, Martin Mueller wrote:
>
>> Is there a Unicode character that says ?I represent an alphanumerical
>> character, but I don?t know which?.
>
> I think including such a ?character? in Unicode would not fit into the
> the idea of Unicode as a system for encoding plain text
> characters. You seem to be asking for a symbol that is not a graphic
> or control character but information about uncertainty regarding a
> character a data stream. So I think this does not fall into the
> category of plain text, and the information should be expressed at a
> higher protocol level, e.g. in markup or as out-of-band information.

The situation you describe is not the situation we are talking about, at
least as far as I am concerned.

A historical corpus uses of course a markup, in our case it is the
TEI-based XML Corpus Encoding Standard. We are dealing not with data
streams, but with XML plain Unicode texts. Words/tokens are just Unicode
strings indexed by the search engine. An unreadable letter in a
word/token has to be encoded somehow without breaking the segmentation
and searching. The best way seems to use a special character to
represent it.

>
> When it is not certain what character there is in some text to be
> encoded, there is a wide range of possible situations. For example, it
> might be a thing like ?there is letter U or letter V, probably the
> latter? or ?there is some Latin letter but no hint of what it might
> be? or even ?there is an alphanumerical character? (though I find it
> difficult to imagine such a situation). Such things can hardly be
> described using new characters;

You are of course right, but has anybody proposed such an idea?

> rather, they need to be expressed
> using verbal descriptions (which are about the encoded text, not part
> of it) or some formal notations or both.

Again you are right, but it does not seem relevant to the problem.

>
>> This is a very common problem in
>> the transcription of historical texts where you have lacunas. Often, the
>> extent of the lacuna is known, and the alphabet is known as well. The
>> EEBO TCP transcriptions of English texts before 1700 are good examples.
>> They are SGML transcriptions, where missing stuff is represented by
>> <gap/> elements with attributes about this or that. This is efficient
>> when it comes to pages, very inefficient when it comes to individual
>> characters.
>
> Efficient in what sense?

It's not clear for me too. 

> Saving bytes can hardly be an issue here. And
> if various attributes are needed to describe the case, then it would
> become awkward to try to do the same with encoded characters (or
> ?characters?, Unicode code points).
>
>> In the TCP project, various code points from the Geometrical were used
>> to represent lacunae. The black circle (\u25cf) has been used as the
>> character for a missing character.This is OK and unambiguous in its
>> context.
>
> If some graphic symbol is by convention used to represent a lacuna,
> then the issue, as regards to Unicode, is simply whether that symbol
> exists as an encoded character or whether there is need to add that
> graphic symbol to Unicode. But it would be a matter of encoding
> graphic characters (irrespectively of their meaning in some content),
> not about encoding abstract ideas like ?an unrecognized character?.

I don't think it is so simple. Besides the character meaning in some
content, we have a Unicode specific meaning in the form of properties,
e.g. being a letter.

>
>> But would be nice to have a special character for just that
>> purpose
>
> Various symbols are used in different contexts to indicate situations
> like ?there is a written symbol that cannot be recognized as a
> specific character?.

Can you provide some examples of these various symbols?

> Perhaps there should be a universal convention about this, but it is
> unrealistic to expect that to happen.  The Unicode Standard can hardly
> standardize such things. And if there were such a universal symbol, it
> would surely have been encoded in Unicode?not because of its meaning,
> but because of its consistent use as a character in plain text.

You are right there is no such universal symbol at the very moment, but
your other claims IMHO are controversial.

>
> So I think the conclusion is that you should use established
> conventions, if they exist, about using some symbol for such
> situations, or define a convention as needed. You should not expect
> the character to be recognized in this special meaning without such a
> higher-level convention.

We had already defined a convention and can live with it :-) But why not
improving it?

> There?s a theoretical (?) problem with this. Let us assume that you
> decide to use a particular character to represent ?unknown character?
> in your data, when working with some type of written texts. What
> happens when you encounter, in the study of those text, a graphic
> symbol that is best identified as the character you decided to use in
> that special meaning? Well, I think you can decide to solve that
> problem if it ever appears.

What about character and glyph distinction? :-)

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From charupdate at orange.fr  Tue Dec 27 10:03:44 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 27 Dec 2016 17:03:44 +0100 (CET)
Subject: a character for an unknown character
In-Reply-To: <20161227000522.7bb95f3e@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
Message-ID: <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>

On 27/12/16 01:11, Richard Wordingham wrote:
> 
> On Sun, 25 Dec 2016 19:31:28 +0200
> "Jukka K. Korpela"  wrote:
[?]
> > If some graphic symbol is by convention used to represent a lacuna,
> > then the issue, as regards to Unicode, is simply whether that symbol
> > exists as an encoded character or whether there is need to add that
> > graphic symbol to Unicode. But it would be a matter of encoding
> > graphic characters (irrespectively of their meaning in some content),
> > not about encoding abstract ideas like ?an unrecognized character?.
> 
> Unicode encodes pictograms, directives and abstract characters, not
> glyphs. There are few, if any characters, that have no semantics,
> though several characters can be ambiguous and context-sensitive as to
> what semantics they occur. If it was just a matter of appearance,
> then U+26C6 RAIN would be the character to use. It has the graphic
> used for characters in damaged inscriptions.

As far as my today?s understanding of Unicode goes, I believe that the 
?not encode glyphs but abstract characters? principle has a counterpart 
that makes Unicode characters polysemic by design, as results from 
TUS 3.3, D2. This compromise led to abandon the initially considered 
extensive disunification policy in favor of reasonable unifications that 
provided a correct benefit-cost ratio, Mark Davis explained on this List:

http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0145.html

TUS 3.2, C4 and C5 (Conformance Requirements: Interpretation) seems to me 
to be specifying that the meanings of a given character are free and may be 
defined by any human convention, provided that they don?t conflict with 
the Unicode character properties of that character.

> 
> Of course, there is one character that is already widely used in this
> r?le - U+003F QUESTION MARK. Some of its Unicode properties are not
> suitable, and its informal 'unknown character' semantic conflicts with
> its r?le as a punctuation mark.

Effectively this use of QUESTION MARK is a plague that messes up almost 
every Unicode string dropped into an ANSI-encoded document. 
The only reason I can see for its use is that amidst the ASCII characters, 
this is the one that comes closest to the intended meaning. 

RAIN seems to me best fit for the discussed usage, and I can?t see any 
problem in using it with this semantics. If I?m wrong, how about this:
U+25A8 SQUARE WITH UPPER RIGHT TO LOWER LEFT FILL

> 
> If I understand correctly, these issues are already addressed by the
> Leiden Conventions. Why do they not suffice?

I believe that they work well in historic texts that don?t use the specified 
meta language characters. The Leiden Conventions could be settled because 
brackets and parentheses aren?t found in old sources. Perhaps modern ones 
that do use these characters are never damaged and to be restored this way.

On the other hand, editors might wish to avoid mixing ASCII characters into 
original scripts. So the RAIN pictograph may be neutral enough. 
If so, the Leiden Conventions could eventually be extended to include it.

Marcel


From christoph.paeper at crissov.de  Tue Dec 27 12:15:35 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 27 Dec 2016 19:15:35 +0100
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
 <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
 <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>
Message-ID: <E6A692A4-770C-4496-8FAB-45C062F312BA@crissov.de>

Mark Davis ?? <mark at macchiato.com>:
> 
> > people are still missing the lizard. ;)
> 
> http://unicode.org/emoji/charts-beta/emoji-list.html#1f98e

Um, ????????? 

The Lizard is a hand sign where the tips of thumb and index finger touch and point sidewards. The other 3 fingers can either also touch the thumb tip (~ Bird) or are hold in parallel to the index finger (~ Snake, where the thumb may touch a lower phalanx). If at least the middle finger was raised a bit and bend to form an eye, it would become a Duck, Dog or Fox (with raised pinkie for an ear). We have then reached the gray area between Roshambo and shadow play.

Sadly, I can?t draw well enough to illustrate a respective emoji proposal myself (and I don?t care enough either to already have written up one without graphics).

From mark at macchiato.com  Tue Dec 27 14:05:50 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 27 Dec 2016 21:05:50 +0100
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <E6A692A4-770C-4496-8FAB-45C062F312BA@crissov.de>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
 <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
 <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>
 <E6A692A4-770C-4496-8FAB-45C062F312BA@crissov.de>
Message-ID: <CAJ2xs_GGO6LWNuKaJBaq=PYjNs3CK+-iu-CKUgYvY6NUGnsHsA@mail.gmail.com>

On Tue, Dec 27, 2016 at 7:15 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> ?????????
>

?I'd use:?
[image: ?] <http://unicode.org/emoji/charts-beta/full-emoji-list.html#26f0>[image:
??][image: ?][image: ??]
<http://unicode.org/emoji/charts-beta/full-emoji-list.html#1f98e>[image: ??]

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161227/08994b20/attachment.html>

From everson at evertype.com  Tue Dec 27 14:50:00 2016
From: everson at evertype.com (Michael Everson)
Date: Tue, 27 Dec 2016 20:50:00 +0000
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <CAJ2xs_GGO6LWNuKaJBaq=PYjNs3CK+-iu-CKUgYvY6NUGnsHsA@mail.gmail.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
 <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
 <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>
 <E6A692A4-770C-4496-8FAB-45C062F312BA@crissov.de>
 <CAJ2xs_GGO6LWNuKaJBaq=PYjNs3CK+-iu-CKUgYvY6NUGnsHsA@mail.gmail.com>
Message-ID: <85AC2F2C-F8A4-4E4F-884C-F27FB82FF1DB@evertype.com>


> On 27 Dec 2016, at 20:05, Mark Davis ?? <mark at macchiato.com> wrote:
> 
> 
> On Tue, Dec 27, 2016 at 7:15 PM, Christoph P?per <christoph.paeper at crissov.de> wrote:
> ?????????
> 
> ?I'd use:?

Yes, but most people expect hand-gestures (which is why three were in the Japanese telco set) so we?re missing the lizard-hand.

Michael

From 747.neutron at gmail.com  Tue Dec 27 22:44:18 2016
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Wed, 28 Dec 2016 13:44:18 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
 <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>
Message-ID: <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>

Now I start to wonder if the description would be "Letter for
phonetics and Japanese phonology" or "Letter for scholarly
transcription" etc.

2016-12-27 18:54 GMT+09:00 Denis Jacquerye <moyogo at gmail.com>:
> For what it?s worth, the small capital q was used as an IPA symbol for a
> while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale? in
> the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in some
> 1895 IPA charts and later charts) then as a ?consonne fricative bronchiale
> sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA
> chart in 1912. It was eventually replaced with the reversed glottal stop ?,
> for example in the 1932 IPA chart or later charts.


From asmusf at ix.netcom.com  Tue Dec 27 23:33:32 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 27 Dec 2016 21:33:32 -0800
Subject: a character for an unknown character
In-Reply-To: <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
Message-ID: <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>

On 12/27/2016 8:03 AM, Marcel Schneider wrote:
> On 27/12/16 01:11, Richard Wordingham wrote:
>> On Sun, 25 Dec 2016 19:31:28 +0200
>> "Jukka K. Korpela"  wrote:
> [?]
>>> If some graphic symbol is by convention used to represent a lacuna,
>>> then the issue, as regards to Unicode, is simply whether that symbol
>>> exists as an encoded character or whether there is need to add that
>>> graphic symbol to Unicode. But it would be a matter of encoding
>>> graphic characters (irrespectively of their meaning in some content),
>>> not about encoding abstract ideas like ?an unrecognized character?.
>> Unicode encodes pictograms, directives and abstract characters, not
>> glyphs. There are few, if any characters, that have no semantics,
>> though several characters can be ambiguous and context-sensitive as to
>> what semantics they occur. If it was just a matter of appearance,
>> then U+26C6 RAIN would be the character to use. It has the graphic
>> used for characters in damaged inscriptions.
> As far as my today?s understanding of Unicode goes, I believe that the
> ?not encode glyphs but abstract characters? principle has a counterpart
> that makes Unicode characters polysemic by design, as results from
> TUS 3.3, D2. This compromise led to abandon the initially considered
> extensive disunification policy in favor of reasonable unifications that
> provided a correct benefit-cost ratio, Mark Davis explained on this List:
>
> http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0145.html
>
> TUS 3.2, C4 and C5 (Conformance Requirements: Interpretation) seems to me
> to be specifying that the meanings of a given character are free and may be
> defined by any human convention, provided that they don?t conflict with
> the Unicode character properties of that character.

(Most) character properties can be adjusted, so the statement above 
would need to
be drawn much more narrowly.

The generic issue that Unicode runs into is that there are things like 
"letters" that have
well-defined identities (the letter A), but, perhaps because of that, 
have a very wide
ranging set of real images - some of the fanciful ones may bear scant 
relation to
the archetypal shape. However, because they are members of bounded, and 
extremely
well-known sets (alphabets) users are tolerant of artistic license. In 
addition, they are
generally used in longer contexts (words) where their identity is 
reaffirmed, independent
of their shape, by occurring in the expected juxtapositions (and mostly 
not occurring in
other, unexpected ones).

However, the conventions where and when to use one of these letters are 
not fixed,
not even their phonetic equivalents.

Contrast that with many marks. The really common ones, like the period, 
are well-
known enough that fonts can substitute small squares or other shapes without
impeding their use in normal text. However, outside standard sentence 
punctuation,
they can be re-used for many other purposes. Some such uses, like the 
Swedish use
of ":" in the middle of an abbreviation, may be unusual enough to not 
readily be
catered to by all text-processing software (e.g. in word-segmentation).

Nevertheless, the same thing applies as with letters: where and when to 
use one of
these marks is not fixed as part of their encoding, not even their 
functions.

Many other "simple" marks: lines, circles, triangles, hooks, and 
squares, or groups
of them, are likewise subject to frequent reuse. Some of them may have been
incorrectly encoded more than once. Like the standard punctuation marks, 
both
their precise shapes and precise functions are subject to stylistic or 
other conventions.

When it comes to marks (or symbols) of less generic or more complex 
shapes, the
presumption that the mark only has "one" shape may be more common, and 
examples of the mark
being repurposed may be less common.  Not being as common, fewer readers 
will
recognize all stylistic variations as being "the same thing". A variant 
form will be more
likely to be understood as a related, but not identical symbol. That in 
turn fuels the
misperception that Unicode somehow encodes symbols based on a single
conventional usage.

A./


From christoph.paeper at crissov.de  Wed Dec 28 06:57:47 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Wed, 28 Dec 2016 13:57:47 +0100
Subject: About standardized variants of characters in Dingbat block
In-Reply-To: <85AC2F2C-F8A4-4E4F-884C-F27FB82FF1DB@evertype.com>
References: <CAF5KyEzN9heKuVpR0RkWozdT5+COf5Ffm3X2y0bh-q-njA6Fhw@mail.gmail.com>
 <CAN49p6pJU9B7+B1FsoOQLcuU6jtWSjRRiKQ9v6a6632=94Ki+w@mail.gmail.com>
 <27B4508E-38A8-4872-BB56-43304AF6E20E@crissov.de>
 <CAJ2xs_E4ojRgxTP2Mt5+LfpLgj-8Gxy1BsxCf-J5eRg049icxg@mail.gmail.com>
 <E6A692A4-770C-4496-8FAB-45C062F312BA@crissov.de>
 <CAJ2xs_GGO6LWNuKaJBaq=PYjNs3CK+-iu-CKUgYvY6NUGnsHsA@mail.gmail.com>
 <85AC2F2C-F8A4-4E4F-884C-F27FB82FF1DB@evertype.com>
Message-ID: <9B091764-CD28-4FA5-A241-ABAB975F4E80@crissov.de>

Michael Everson <everson at evertype.com>:
> On 27 Dec 2016, at 20:05, Mark Davis ?? <mark at macchiato.com> wrote:
>> On Tue, Dec 27, 2016 at 7:15 PM, Christoph P?per <christoph.paeper at crissov.de> wrote:
>>> ?????????
>> ?I'd use:?

I guess I?d prefer ?? in combination with ??????? ? or ?? U+1F9DD Elf perhaps. OK Sign may be an acceptable substitute, pending future additions: ????????, because it differs enough from the other gestures.

<http://www.samkass.com/theories/RPSSL.html>

> Yes, but most people expect hand-gestures (which is why three were in the Japanese telco set) so we?re missing the lizard-hand.

The question ? like always with new emojis ? is where to start and stop. Using just the info from the (improvable) English Wikipedia article, you?d need at least these for common variants and extensions:

 * Malaysia/Singapore: rock-water-bird = ????? = ??? ? ????
 * France/Germany: rock-paper-scissors-well-bull = ???????? = ?????? ? ????????
 * China/Japan (mushi-ken): slug-frog-snake = ??????/?????? = ?????/????? = pinkie-thumb-index

Lizard, Bird and Well differ substantially from each other, but can all be approximated by the same emoji ??, because they aren?t used together in a game, i.e. they don?t form minimal pairs in practice. The Bull also looks slightly different from the Devil and Fox signs (used elsewhere), but can be represented well enough by the ?? emoji for the same reasons.

The only gesture that cannot be approximated by existing emojis is the raised pinkie finger for a slug or caterpillar.

There are, of course, many other gestures that could be (and probably somewhere are being) used to play hand games, see <http://www.umop.com/rps.htm> for instance. Any of them (and others) may also acquire meaning outside games to visually augment or substitute acoustic forms of communication. In some, mostly informal kinds of computer-mediated textual communication (i.e. messaging, texting, chatting, tweeting, advertising ?), users employ traditional characters (?letters?, ?words?) for transcribing the oral-aural part, but emojis of manual gestures and facial expressions to transmit their otherwise lost ?body language?, as well as pictorial objects and symbols for deixis, abbreviation, tagging and feelings. Roughly speaking. 

I believe there should be separate hand emojis for all conventionalized paralinguistic gestures that are or were being used with spoken language and could be used with written language as well. Similar looking (e.g. turned/flipped/rotated/reversed) ones and chiral variants may be unified *unless* they are likely to appear in the same context with different meaning, cf. vertical ?? Call Me vs. horizontal Shaka/Hang Loose or vertical L/Loser vs. horizontal Finger Gun.

From charupdate at orange.fr  Wed Dec 28 09:25:48 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 28 Dec 2016 16:25:48 +0100 (CET)
Subject: French Superscript Abbreviations Fit Plain Text Requirements (was:
 Re: a character for an unknown character)
In-Reply-To: <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
Message-ID: <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>

?
I?m gladly surprised that this thread has unexpectedly come to a point 
where I?d be able to spin off a topic written up in 1st draft 18 days ago.
?
On Tue, 27 Dec 2016 21:33:32 -0800, Asmus Freytag wrote:
>
[quoted mails]
> 
> (Most) character properties can be adjusted, so the statement above would 
> need to be drawn much more narrowly. 
>
> The generic issue that Unicode runs into is that there are things like 
> "letters" that have well-defined identities (the letter A), but, perhaps 
> because of that, have a very wide ranging set of real images - some of the 
> fanciful ones may bear scant relation to the archetypal shape. However, 
> because they are members of bounded, an extremely well-known sets 
> (alphabets) users are tolerant of artistic license. In addition, they are 
> generally used in longer contexts (words) where their identity is 
> reaffirmed, independent of their shape, by occurring in the expected 
> juxtapositions (and mostly not occurring in other, unexpected ones). 
> 
> However, the conventions where and when to use one of these letters are not 
> fixed, not even their phonetic equivalents.
?
As far as I understand the issue until here, Unicode does not fix usage 
conventions, but merely gives some hints to the Code Charts and TUS reader 
as of the original encoding rationale, and sometimes some later added contexts 
where the character is found. For instance, U+202F NARROW NO-BREAK SPACE has 
been encoded for Mongolian, but is also used in French, where it is preferred 
with certain punctuations. TUS 9.0, ?6.2, Space Characters, Narrow No-Break Space, 
p. 269, says it ?can be used to represent the narrow space occurring around 
punctuation characters in French typography, which is called an ?espace fine 
ins?cable.?? 
?
> 
> Contrast that with many marks. The really common ones, like the period, are 
> well- known enough that fonts can substitute small squares or other shapes 
> without impeding their use in normal text. However, outside standard 
> sentence punctuation, they can be re-used for many other purposes. Some 
> such uses, like the Swedish use of ":" in the middle of an abbreviation, 
> may be unusual enough to not readily be catered to by all text-processing 
> software (e.g. in word-segmentation).
?
>From this I conclude that a given character can be used following any convention, 
regardless of the percentage of software that isn?t yet up-to-date to handle it 
correctly in every circumstance, including but not limited to equivalence classes. 
?
This is relevant to the representation of abbreviations in French, that doesn?t 
use a colon or an (in-word) period when it comes to abbreviate for example the 
word for 'numbers', or ordinals like '2nd', '3rd', '4th'. Before the recent threads, 
this has been discussed a decade ago, so I?ll pick out some highlights below.
?
> 
> Nevertheless, the same thing applies as with letters: where and when to use 
> one of these marks is not fixed as part of their encoding, not even their 
> functions.
?
So definitely the use of superscript Latin letters can scarcely be limited to 
IPA, though most of them were initially intended for (i.e. encoded for) phonetic 
transcriptions. But cross-checking the relevant parts of the Standard [1][2] 
leads to conclude that their use must be necessary for an unambigous representation 
in plain text, following the Unicode definition of plain text: ?/Plain text must 
contain enough information to permit the text to be rendered legibly, and 
nothing more./ \r\n The Unicode Standard encodes plain text.? (TUS 9.0, p. 19.)
?
Applied to the French abbreviation of ?num?ros? (numbers), that means that the 
abbreviation?s final letters 'os' *must not* be formatted as superscript: Since 
?the extra information in rich text can always be stripped away to reveal the 
?pure? text underneath? (TUS, ibid.), 'n^{os}' would end up as 'nos' (?our?, 
with a plural noun). Consequently, best practice is to represent it using the 
Unicode superscript ?modifier letters?: 'n??'.
?
> 
> Many other "simple" marks: lines, circles, triangles, hooks, and squares, 
> or groups of them, are likewise subject to frequent reuse. Some of them may 
> have been incorrectly encoded more than once. Like the standard punctuation 
> marks, both their precise shapes and precise functions are subject to 
> stylistic or other conventions.
?
>From this, it seems doubtful whether encoding the superscript small letter e 
more than once would be accepted, since the possible rationale is mere fine-
tuning of the vertical alignment (modifier letters being typically less raised 
than formatted superscripts).
?
> 
> When it comes to marks (or symbols) of less generic or more complex shapes, 
> the presumption that the mark only has "one" shape may be more common, and 
> examples of the mark being repurposed may be less common. Not being as 
> common, fewer readers will recognize all stylistic variations as being "the 
> same thing". A variant form will be more likely to be understood as a 
> related, but not identical symbol. That in turn fuels the misperception 
> that Unicode somehow encodes symbols based on a single conventional usage. 
?
I?m likely to believe that this settles all objections to the use of modifier 
letters as superscripts wherever appropriate, as being ?non-standard?, ?a hack? 
and the like. Such a narrow reading of the Unicode documentation is thus due 
to a misperception that is fueled by a current user experience, additionally to 
the TUS disclaimer that these characters are not intended to replace generic 
formatting. The very reason why this guideline is applied to French abbreviations, 
seems to be that relegating their correct representation into the realm of 
higher-level protocols has been the way (why not calling it the ?hack?), along 
with the use of other available means (mainly the degree sign), to represent 
them unambiguously even before Unicode provided the superscript letters. 
?
The unambiguous and coherent representation of abbreviations with superscript 
letters from plain text on upwards, gives eventually the French language the 
status of an exception, admitting that in English, superscripting in this 
context is a mere styling issue. But there seems not to be so much of an 
exception, since ordinal indicators have been encoded for a small set of 
languages in earlier standards. Adding one more exception will have very few 
consequences on Unicode?s side. Presumably they won?t exceed the encoding of 
MODIFIER LETTER SMALL Q, that was and is already the subject or a part of 
past and eventual proposals, most of them not implying French. Please note again 
that abbreviations like '2^{?me}' are officially deprecated in favor of '2^{e}' 
style forms, so there is very scarce need of diacritics in French abbreviations.
Examples include 'S^{t?}' (?Soci?t??, Corporation). So updating fonts to support 
combining diacritics here would be handy.

Above all, adding a corresponding statement in the Core Specifications, like for 
the ??espace fine ins?cable??, would be nice, to make everybody at ease.
?

?
Additionally I?d like to cite the related _2006 thread_, quoting some snippets 
that seem particularly interesting to me, some ? but not all ? of which are 
referring to French abbreviations (a topic that already spun off another one):
?
http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0270.html
On Fri Mar 24 2006 - 17:15:22 CST, Kent Karlsson wrote:
> Antoine Leca wrote: 
[?]
> > [?] but it probably has to be done outside of the 
> > codepoints 
> That would be too frail, and not reliable. 
?
http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0290.html
On Mon Mar 27 2006 - 09:03:48 CST, Antoine Leca wrote:
> Kent Karlsson wrote: 
> > Antoine Leca wrote: 
[?]
> > > Are you intending to say that if I wrote "Mme" (Mrs in 
> > > French), I should differentiate, in a not yet standardised 
> > > way, the fact that I write it with superscript characters 
> > > or not? Saying it is a "spelling" difference? 
> > Definitely. In this particular case one may debate whether to use 
> > markup or to (ab)use U+1D50 MODIFIER LETTER SMALL M and 
> > U+1D49 MODIFIER LETTER SMALL E. 
> Put it in clear: to write the French equivalent of Mrs, I can: 
> - either write the slightly incorrect Mme 
> - or write the more "correct" M[][] (where [] represent the empty box that 
> everybody except four cats will effectively see). 
> Somewhere I am thinking this is *not* a working solution. 
?
http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0291.html
On Mon Mar 27 2006 - 09:36:31 CST, Doug Ewell replied to Antoine Leca:
[?]
> This is what I consider the Great Unicode Conundrum. We want to use the 
> rich character repertoire that Unicode provides, but we also want to 
> avoid displaying mojibake on the user's screen, causing him to mumble 
> and curse about "that stupid Unicode" or giving him security concerns. 
> The problem persists because popular fonts are not always updated 
> quickly and inexpensively to support new and rare Unicode characters. 
> So we avoid using rare and -- more importantly -- newly added 
> characters, preferring ASCII fallbacks of the sort Unicode was intended 
> to replace. 
?
http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0293.html
On Tue Mar 28 2006 - 02:19:49 CST, Antoine Leca replied to Doug Ewell:
[?]
> While I agree with your pertinent remark on a general way, in THIS case I 
> believe this is not adequate. Those two characters (U+1D50 and U+1D49, ??) 
> do not seem to me to be intended for French abbreviations (or any written 
> language typographics effects), but rather for phonetics. As a result, it 
> seems difficult to me to ask French people to have phonetics-specialized 
> fonts, in order to read something as common as the abbreviation for Mrs, 
> just because it caught the attention of someone that those characters almost 
> fit that particular needs. 
> I can be wrong though. 
?
This may have been right by the time, while today most current fonts support them.
The mail is continued; interested readers are welcome to read more in the archive.
And a last one:
?
http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0294.html
On Tue Mar 28 2006 - 04:12:10 CST, Keutgen, Walter replied:
[?]
> In real every day usage, in French 'mechanical typewriting', 'PC typewriting' and 
> *hand writing* one did/does not superscript the endings of the abbreviations 
> 'Mme, Mmes, Mlle, Mlles, Dr, Drs, Ir, Irs'. 
> In hand writing one always uses superscripts for ordinal numbers, which is not 
> possible in flat text PC writing and required some fumbling whith the mechanical 
> typewriter. I.e. '1er', '2?me' or '2e' etc require superscripts, likewise the forms 
> derived from the Latin wording '1o, 2o' etc for which one uses the '?'. One also 
> often sees Me (ma?tre = master in law) with a superscripted e. Now as to know 
> whether '?' is a superscripted 'o' or the degree sign, my keyboard does not tell 
> me. I would bed however that the '?' often is smaller than the superscripted e. 
?

Users can thus feel free to follow an already existing practice, by upgrading 
whether the keyboard layout, or the autocorrect or AutoHotKey or whatever. 
Cf. the Bing search results for "?":
https://www.bing.com/search?q=%E1%B5%89&PC=U316&FORM=CHROMN
?
Best regards,
?
Marcel
?
[1] TUS 9.0, ?7.8, p. 327:
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762
?
[2] TUS 9.0, ?22.4, p. 786:
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931


From wjgo_10009 at btinternet.com  Wed Dec 28 08:23:03 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 28 Dec 2016 14:23:03 +0000 (GMT)
Subject: Emoji as Art
In-Reply-To: <23843016.51094.1481914235485.JavaMail.defaultUser@defaultHost>
References: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
 <23843016.51094.1481914235485.JavaMail.defaultUser@defaultHost>
Message-ID: <12613307.13913.1482934983042.JavaMail.defaultUser@defaultHost>

I have been looking again at the images in the http://www.users.globalnet.co.uk/~ngo/emoji_installation_at_MoMA.htm web page with a view to trying to write more about the installation.

I noticed that, although the images of the emoji are in colour, none of the glyphs has more than one colour used within it.

So I am wondering whether there could be a second installation about emoji with a name such as the following.

Inbox 2: Emoji with more than one colour in each glyph

Today, emoji often, even usually, each have more than one colour in the glyph.

So I am wondering when did emoji with more than one colour in them first come into use.

I have tried to find an answer but have not yet found one.

Does anyone know please?

William Overington

Wednesday 28 December 2016


----Original message----
>From : wjgo_10009 at btinternet.com
Date : 16/12/2016 - 18:50 (GMTST)
To : unicode at unicode.org
Subject : Re: Emoji as Art

Here is a link to a web page that has some pictures of the emoji installation at the Museum of Modern Art, MoMA, in New York, the pictures shown at one quarter of the size of the original pictures that were kindly supplied by MoMA. Thank you to MoMA for the pictures. 

http://www.users.globalnet.co.uk/~ngo/emoji_installation_at_MoMA.htm

William Overington

Friday 16 December 2016


From asmusf at ix.netcom.com  Wed Dec 28 15:47:00 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 28 Dec 2016 13:47:00 -0800
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
Message-ID: <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161228/9a9c6302/attachment.html>

From asmusf at ix.netcom.com  Wed Dec 28 15:55:26 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 28 Dec 2016 13:55:26 -0800
Subject: Emoji as Art
In-Reply-To: <12613307.13913.1482934983042.JavaMail.defaultUser@defaultHost>
References: <5022612.29417.1481726866273.JavaMail.defaultUser@defaultHost>
 <23843016.51094.1481914235485.JavaMail.defaultUser@defaultHost>
 <12613307.13913.1482934983042.JavaMail.defaultUser@defaultHost>
Message-ID: <0dc27d11-b46c-a718-e21b-9f40a95cff12@ix.netcom.com>

On 12/28/2016 6:23 AM, William_J_G Overington wrote:
> I have been looking again at the images in the http://www.users.globalnet.co.uk/~ngo/emoji_installation_at_MoMA.htm web page with a view to trying to write more about the installation.
>
> I noticed that, although the images of the emoji are in colour, none of the glyphs has more than one colour used within it.
>
> So I am wondering whether there could be a second installation about emoji with a name such as the following.
>
> Inbox 2: Emoji with more than one colour in each glyph
>
> Today, emoji often, even usually, each have more than one colour in the glyph.
>
> So I am wondering when did emoji with more than one colour in them first come into use.
>
> I have tried to find an answer but have not yet found one.
>
> Does anyone know please?
I think you might be missing the point.

There were severe limitations in early years about what could be 
presented on small devices; the selections shows compromises typical to 
that era.

A./
>
> William Overington
>
> Wednesday 28 December 2016
>
>
> ----Original message----
> >From : wjgo_10009 at btinternet.com
> Date : 16/12/2016 - 18:50 (GMTST)
> To : unicode at unicode.org
> Subject : Re: Emoji as Art
>
> Here is a link to a web page that has some pictures of the emoji installation at the Museum of Modern Art, MoMA, in New York, the pictures shown at one quarter of the size of the original pictures that were kindly supplied by MoMA. Thank you to MoMA for the pictures.
>
> http://www.users.globalnet.co.uk/~ngo/emoji_installation_at_MoMA.htm
>
> William Overington
>
> Friday 16 December 2016
>
>


From richard.wordingham at ntlworld.com  Wed Dec 28 19:47:59 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 29 Dec 2016 01:47:59 +0000
Subject: a character for an unknown character
In-Reply-To: <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
Message-ID: <20161229014759.5a51c747@JRWUBU2>

On Tue, 27 Dec 2016 21:33:32 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> When it comes to marks (or symbols) of less generic or more complex 
> shapes, the
> presumption that the mark only has "one" shape may be more common,
> and examples of the mark
> being repurposed may be less common.  Not being as common, fewer
> readers will
> recognize all stylistic variations as being "the same thing". A
> variant form will be more
> likely to be understood as a related, but not identical symbol. That
> in turn fuels the
> misperception that Unicode somehow encodes symbols based on a single
> conventional usage.

The idea of a single conventional usage is also fuelled by a number of
practices and policies:

1) A letter belongs to a single script (not to be confused with
writing system)

2) Distinction of punctuation and modifier letters, e.g. the highly
confusing distinction between U+2019 RIGHT SINGLE QUOTATION MARK and
U+02BC MODIFIER LETTER APOSTROPHE

3) The resolution of U+002D HYPHEN-MINUS into U+2010 HYPHEN, U+2212
MINUS SIGN and a few minor punctuation marks

4) Distinction between decimal digits and letters

5) The nightmare of spacing single and double dots.

Ideal solutions can also be defeated by limited keyboard layouts.  As a
result, I have no idea whether the singular of "fithp" (one of Larry
Niven's alien species) should be spelt with U+02BC or U+2019, though in
ASCII I can just write "fi'".

Richard.


From asmusf at ix.netcom.com  Wed Dec 28 21:05:17 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 28 Dec 2016 19:05:17 -0800
Subject: a character for an unknown character
In-Reply-To: <20161229014759.5a51c747@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
Message-ID: <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>

On 12/28/2016 5:47 PM, Richard Wordingham wrote:
> On Tue, 27 Dec 2016 21:33:32 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> When it comes to marks (or symbols) of less generic or more complex
>> shapes, the
>> presumption that the mark only has "one" shape may be more common,
>> and examples of the mark
>> being repurposed may be less common.  Not being as common, fewer
>> readers will
>> recognize all stylistic variations as being "the same thing". A
>> variant form will be more
>> likely to be understood as a related, but not identical symbol. That
>> in turn fuels the
>> misperception that Unicode somehow encodes symbols based on a single
>> conventional usage.
> The idea of a single conventional usage is also fuelled by a number of
> practices and policies:
>
> 1) A letter belongs to a single script (not to be confused with
> writing system)
Making or not making that distinction makes some stuff easier and other 
stuff
harder to support in software. Overall, I think Unicode got this one right.
>
> 2) Distinction of punctuation and modifier letters, e.g. the highly
> confusing distinction between U+2019 RIGHT SINGLE QUOTATION MARK and
> U+02BC MODIFIER LETTER APOSTROPHE

I'm beginning to thing that 02BC is closer to a mistake than a correct 
solution;
there are places where it has to be treated on the same footing as 2019 
even
though the idea was to give it different properties.

>
> 3) The resolution of U+002D HYPHEN-MINUS into U+2010 HYPHEN, U+2212
> MINUS SIGN and a few minor punctuation marks
HYPHEN-MINUS is a bad example, because it's a conflation of several
quite distinct elements of type a single key for purposes of type writers.


>
> 4) Distinction between decimal digits and letters
>
> 5) The nightmare of spacing single and double dots.
?  spacing vs. combining? Not sure what you mean.
>
> Ideal solutions can also be defeated by limited keyboard layouts.  As a
> result, I have no idea whether the singular of "fithp" (one of Larry
> Niven's alien species) should be spelt with U+02BC or U+2019, though in
> ASCII I can just write "fi'".

The only place where "uni" doesn't apply in Unicode is that there's 
never just
a single principle that applies, but always multiple ones that are in 
tension --- and in
the edge cases, the tension can be felt keenly.

A./
>
> Richard.
>
>
>
>
>
>


From verdy_p at wanadoo.fr  Thu Dec 29 02:35:54 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 29 Dec 2016 09:35:54 +0100
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
Message-ID: <CAGa7JC0TiB7Nt9ecHybhOfb+_gnuqtz6mYH0yG0z5PVueu9UhA@mail.gmail.com>

I agree. Even for the abbreviation "N<sup>os</sup>" or "n<sup>os</sup>",
there's no ambiguity due to the grammar (in a sentence the abbreviation
would be preceded by an article ("les nos 2 et 3") or a noun ("les articles
nos 2 et 3) and followed by numerals and this cannot be analyzed like the
possessive "nos" which cannot appear after an article or noun.

If you want to represent only plaintext the typographic superscripts could
still be replaced by inserting an abbreviation dot ("les n.os 2 et 3) or by
not abbreviating it at all ("les num?ros 2 et 3"). These superscript are
presentational only. The same applies to other abbreviations such as "Mgr"
("Monseigneur", which can be typeset as "M<sup>gr</sup>", "Bd"
("Boulevard", typeset as "B<sup>d</sup>"), "Mlle" ("Mademoiselle", typeset
as "M<sup>lle</sup>") and many, many abbreviations suffixing the last
letters. of a word that are preferably typeset using superscripts, but that
are still normal Latin letters, including letters with accents (notably "?"
which is frequent at end of French participles or nouns and which has no
encoded superscript variant).

Adding superscript variants (or other typographic variants) in Unicode for
that use would mean reencoding thousands letters in many scripts and in a
dozen of stylistic variants. This is not the way to go.

Plain text documents have their constraints, if clarity is needed they are
necessarily modified with additional text, but converting a rich text to
plain text and dropping all styles is destructive and may cause ambiguity
in some rare cases. But language semantics and grammar most often resolve
them to give sense to that text and abbreviations in plain text will still
be readable in most cases.

2016-12-28 22:47 GMT+01:00 Asmus Freytag <asmusf at ix.netcom.com>:

> On 12/28/2016 7:25 AM, Marcel Schneider wrote:
>
> Applied to the French abbreviation of ?num?ros? (numbers), that means that the
> abbreviation?s final letters 'os' **must not** be formatted as superscript: Since
> ?the extra information in rich text can always be stripped away to reveal the
> ?pure? text underneath? (TUS, ibid.), 'n^{os}' would end up as 'nos' (?our?,
> with a plural noun). Consequently, best practice is to represent it using the
> Unicode superscript ?modifier letters?: 'n??'.
>
> This is seriously overstating the plain text principle.
>
> There are many places where formatting affects the reading (and not just
> the presentation) of the text. In some cases, it is appropriate to encode
> characters for that, in other places the conclusion is simply that plain
> text is not sufficient.
>
> In English, superscript is used for ordinal numbers. The fallback without
> superscript tends to be functional, because of the alternation between
> digits and letters, but there's nothing "pure" about it.
>
> Some sentences in English can be parsed ambiguously; the convention in
> print has been to italicize the word intended to take the stress. Here, the
> plain-text fallback is less functional, as it re-introduces the ambiguity.
>
> There is no rule that says that *all* content information *must* be
> expressible on the plain text level. Some edge cases exist, where other
> layers, by necessity, participate.
>
> Mathematical notation is a good example of such a mixed case: while
> ordinary variables can be expressed in plain text with the help of
> mathematical alphabets, the proper display of formulas requires markup.
> Even Murray Sargent's plain text math is markup, albeit a very clever one
> that re-uses conventions used for the inline presentation of mathematical
> expression. (Where that is insufficient, it introduces additional
> conventions, clearly extraneous to the content, and hence markup).
>
> The encoding conventions (principles) chosen by Unicode stipulate that for
> ordinary text (not notations) any part of the content that requires
> alternate presentation (italics, superscript, etc) is to supplied via
> styles, not coded characters. In contrast, for scholarly or technical
> notation, that requirement is relaxed.
>
> As long as French is ordinary text, the abbreviations require styled
> (rich) text.
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161229/4dc045a9/attachment.html>

From wjgo_10009 at btinternet.com  Thu Dec 29 09:20:11 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 29 Dec 2016 15:20:11 +0000 (GMT)
Subject: a character for an unknown character
In-Reply-To: <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <16243692.18677.1482408231135.JavaMail.defaultUser@defaultHost>
 <4AFB4506-093F-4045-9553-AF26D3A79E66@northwestern.edu>
Message-ID: <24703196.21991.1483024811178.JavaMail.defaultUser@defaultHost>

Martin Mueller wrote: 

> But for the purposes of my project, which involves folks here, there, and everywhere working on editorial problems relating to digital transcriptions of Early Modern texts, ....

Whilst recognising that I am going somewhat off the specific topic of this thread, yet still on the topic of digital transcriptions of Early Modern texts and hoping that such a discussion may also be of interest to some readers of this mailing list, could you possibly say some more about your project please? For example, are the digital transcriptions of handwritten texts or of printed texts or both please?

A particular matter about digital transcription of printed texts from the 15th to 18th Centuries that interests me is as to how one transcribes each of a ligature; a swash letter; the probable use of a logotype such as Que used, for example, in the word Queen where the tail of the Q goes as far as beneath the first e of the word. I have seen such a use of Que in a 17th Century printed book in an exhibition, in the body text, using a typeface at about a 12 point or 14 point size (here I use point size of a typeface in the traditional sense, as the vertical size of the whole piece of metal type). The typeface was roman, not italic.

In Unicode a ligature can be specified using the U+200D ZERO WIDTH JOINER character yet that is a format character: what should be used to indicate to a reader of a transcript that two or more characters are ligated?

How should the fact that a swash italic character has been used be indicated?

How should one indicate the probable use of a logotype so that the information of that use becomes conserved in the transcript?

William Overington

Thursday 29 December 2016


From doug at ewellic.org  Thu Dec 29 15:05:42 2016
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 29 Dec 2016 14:05:42 -0700
Subject: Hmong orthographies
Message-ID: <20161229140542.665a7a7059d7ee80bb4d670165c8327d.350e45b982.wbe@email03.godaddy.com>

Martin J. D?rst wrote:

> If in general, people (e.g. for searches) care mostly about whether
> something is written in Latin or Pahawh Hmong, and the variant is only
> icing on the cake, then using tags like mww-Hmng-pahawh2 is advisable,
> because such documents will be found with searches for mww-Hmng. If on
> the other hand, the variants are what counts (e.g. if there are people
> who can read one of the variants but would be baffled by another) then
> the script can be easily left out.

If the Prefix for 'pahawh2' in the Registry is "mww", users can write
either "mww-pahawh2" or "mww-Hmng-pahawh2" with equal conformity to BCP
47.

If the Prefix is "mww-Hmng", then using "mww" alone before the variant
(i.e. "mww-pahawh2") would be regarded as valid, but less appropriate or
suitable.

Prefix values should be as restrictive as necessary to encourage good
usage, _and no more so_.


--
Doug Ewell | Thornton, CO, US | ewellic.org


From charupdate at orange.fr  Thu Dec 29 15:20:24 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 29 Dec 2016 22:20:24 +0100 (CET)
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
Message-ID: <497101423.8896.1483046424026.JavaMail.www@wwinf1h16>

Thank you for your answers and advice. Some points however remain 
still unclear to me.
?
On Wed, 28 Dec 2016 13:47:00 -0800, Asmus Freytag wrote:
> 
> On 12/28/2016 7:25 AM, Marcel Schneider wrote:
> >
> > Applied to the French abbreviation of ?num?ros? (numbers), that means that the 
> > abbreviation?s final letters 'os' *must not* be formatted as superscript: Since 
> > ?the extra information in rich text can always be stripped away to reveal the 
> > ?pure? text underneath? (TUS, ibid.), 'n^{os}' would end up as 'nos' (?our?, 
> > with a plural noun). Consequently, best practice is to represent it using the 
> > Unicode superscript ?modifier letters?: 'n??'.
> 
> This is seriously overstating the plain text principle.
> 
> There are many places where formatting affects the reading (and not just 
> the presentation) of the text. In some cases, it is appropriate to encode 
> characters for that, in other places the conclusion is simply that plain 
> text is not sufficient.
> 
> In English, superscript is used for ordinal numbers. The fallback without 
> superscript tends to be functional, because of the alternation between 
> digits and letters, but there's nothing "pure" about it.
> 
> Some sentences in English can be parsed ambiguously; the convention in 
> print has been to italicize the word intended to take the stress. Here, the 
> plain-text fallback is less functional, as it re-introduces the ambiguity.
> 
> There is no rule that says that *all* content information *must* be 
> expressible on the plain text level. Some edge cases exist, where other 
> layers, by necessity, participate.
> 
> Mathematical notation is a good example of such a mixed case: while 
> ordinary variables can be expressed in plain text with the help of 
> mathematical alphabets, the proper display of formulas requires markup. 
> Even Murray Sargent's plain text math is markup, albeit a very clever one 
> that re-uses conventions used for the inline presentation of mathematical 
> expression. (Where that is insufficient, it introduces additional 
> conventions, clearly extraneous to the content, and hence markup).
> 
> The encoding conventions (principles) chosen by Unicode stipulate that for 
> ordinary text (not notations) any part of the content that requires 
> alternate presentation (italics, superscript, etc) is to supplied via 
> styles, not coded characters. In contrast, for scholarly or technical 
> notation, that requirement is relaxed.
> 
> As long as French is ordinary text, the abbreviations require styled (rich) 
> text.
?
I see that this makes for a much more streamlined implementation, because of 
the thousands of decorative fonts that don?t supply the modifier letters. So 
my ?*must not*? was too harsh. On the other hand, I see an issue about whether 
to stick with legacy practice, or to allow the user to choose an alternate way. 
?
According to TUS (9.0, ?22.4, p. 786), vertical alignment in '1^{st}' and in 
'DC00_{16}' is to be handled with markup. In the latter case, this Unicode 
recommendation leads to content corruption when the related markup is stripped 
off. That may occur sooner than expected, e.g. in Word (2010) when a character 
style is applied. In the former case, if there's nothing "pure" about '1st' and 
the other English ordinal plain ASCII fallbacks, the actual Unicode recommendation 
can hardly be the last word here neither. Distinguishing mathematical notation 
(that the base of the numeral system seems to to be considered to belong to) and 
technical notation may also add to the problem. Writing '1??' and 'DC00??' could 
be a way to solve it.
?
Another?admittedly much more straightforward?way to solve the problem is to stick 
with baseline letters and punctuation. German and French may denote stress with 
titlecase (?Nur Eine m?gliche L?sung? [?Only one possible solution?], sometimes 
considered obsolete; ?? la Une? [?On the cover page?], current French), while 
Dutch uses the (combining) acute accent (subject of a recent thread). 
As of italics, they can be avoided in English and French if the sentence is worded 
differently (as in ?Superscript /can/ be used in abbreviations, but in some 
languages this is not mandatory? becomes ?Albeit superscript can be used [?].?
Effectively in Spanish there seems to be a move from superscript to baseline 
letters in abbreviations, so that ?Se?or? and ?Se?ora? shorten to ?Sr.? and ?Sra.?, 
preferredly to the (obsolete) ?S.^r? and ?S.^a? (the latter sometimes written using 
the feminine ordinal indicator: ?N? S?? [?Our Lady?]). 
?
On Thu, 29 Dec 2016 09:35:54 +0100, Philippe Verdy wrote:
> 
> I agree. Even for the abbreviation "N<sup>os</sup>" or "n<sup>os</sup>",
> there's no ambiguity due to the grammar (in a sentence the abbreviation
> would be preceded by an article ("les nos 2 et 3") or a noun ("les articles
> nos 2 et 3) and followed by numerals and this cannot be analyzed like the
> possessive "nos" which cannot appear after an article or noun.
?
You are plain right, my demonstration was too abstract. ?Nos nos 2 et 3? 
[?Our nos. 2 and 3?] would be a bit confusing, though.
?
> 
> If you want to represent only plaintext the typographic superscripts could
> still be replaced by inserting an abbreviation dot ("les n.os 2 et 3) or by
> not abbreviating it at all ("les num?ros 2 et 3"). These superscript are
> presentational only. The same applies to other abbreviations such as "Mgr"
> ("Monseigneur", which can be typeset as "M<sup>gr</sup>", "Bd"
> ("Boulevard", typeset as "B<sup>d</sup>"), "Mlle" ("Mademoiselle", typeset
> as "M<sup>lle</sup>") and many, many abbreviations suffixing the last
> letters. of a word that are preferably typeset using superscripts, but that
> are still normal Latin letters, including letters with accents (notably "?"
> which is frequent at end of French participles or nouns and which has no
> encoded superscript variant).
?
The idea is mainly that if one is bound to plain text and nevertheless wants to 
follow high-end presentation rules as a mark of respect, then it would be a pity 
not to use the means that are already available in Unicode and current fonts.
?
Another advantage of the use of modifier letters is stability. A source text that 
has these superscripts hard-coded, can either be used as-is, or it can be parsed 
for these modifier letters to be replaced with styled baseline letters. 
Precomposed superscripts are not needed (and will never be encoded), given that 
future practice may direct font design to support combining diacritics herein.
?
> 
> Adding superscript variants (or other typographic variants) in Unicode for
> that use would mean reencoding thousands letters in many scripts and in a
> dozen of stylistic variants. This is not the way to go.
?
Sure. I don?t see however how many languages and scripts do effectively use 
superscripts the way a few Latin script using languages do. My feeling is that 
there are none, but I?m at risk of being wrong. 
?
As of other stylistic variants, the mathematical alphabets are effectively used 
outside mathematics. Google Search is already able to handle them as if they were 
plain ASCII.
?
> 
> Plain text documents have their constraints, if clarity is needed they are
> necessarily modified with additional text, but converting a rich text to
> plain text and dropping all styles is destructive and may cause ambiguity
> in some rare cases. But language semantics and grammar most often resolve
> them to give sense to that text and abbreviations in plain text will still
> be readable in most cases. 
?
That is comforting.
?
Marcel


From charupdate at orange.fr  Thu Dec 29 18:23:55 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 30 Dec 2016 01:23:55 +0100 (CET)
Subject: a character for an unknown character
In-Reply-To: <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
Message-ID: <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>

On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote:
> On 12/28/2016 5:47 PM, Richard Wordingham wrote: 
> > On Tue, 27 Dec 2016 21:33:32 -0800 
> > Asmus Freytag  wrote: 
> > > 
> > > When it comes to marks (or symbols) of less generic or more complex 
> > > shapes, the 
> > > presumption that the mark only has "one" shape may be more common, 
> > > and examples of the mark 
> > > being repurposed may be less common. Not being as common, fewer 
> > > readers will 
> > > recognize all stylistic variations as being "the same thing". A 
> > > variant form will be more 
> > > likely to be understood as a related, but not identical symbol. That 
> > > in turn fuels the 
> > > misperception that Unicode somehow encodes symbols based on a single 
> > > conventional usage. 
> > The idea of a single conventional usage is also fuelled by a number of 
> > practices and policies: 
> > 
> > 1) A letter belongs to a single script (not to be confused with 
> > writing system) 
> Making or not making that distinction makes some stuff easier and other stuff 
> harder to support in software. Overall, I think Unicode got this one right. 
> > 
> > 2) Distinction of punctuation and modifier letters, e.g. the highly 
> > confusing distinction between U+2019 RIGHT SINGLE QUOTATION MARK and 
> > U+02BC MODIFIER LETTER APOSTROPHE 
> I'm beginning to thing that 02BC is closer to a mistake than a correct solution; 
> there are places where it has to be treated on the same footing as 2019 even 
> though the idea was to give it different properties. 

U+02BC being shifted from a letter to a punctuation must have been anticipated at 
encoding, since the original recommendation was to use it as apostrophe throughout.
Unifying the letter apostrophe and the punctuation apostrophe made IMO more 
sense?despite of the conflicting properties?than the unification of the apostrophe 
with a quotation mark, because of the downside in text processing (cf. past year?s 
thread). The most proper solution was IMO to encode all three separately, the same 
way as the COMMA has not been unified with the SINGLE LOW-9 QUOTATION MARK, 
despite of the latter being often informally referred to as a ?comma.?

> > 
> > 3) The resolution of U+002D HYPHEN-MINUS into U+2010 HYPHEN, U+2212 
> > MINUS SIGN and a few minor punctuation marks 
> HYPHEN-MINUS is a bad example, because it's a conflation of several 
> quite distinct elements of type a single key for purposes of type writers. 

Confusingly, that typewriter legacy is hanging far into the computer era, while 
all other parts of computer science and computer practice are constantly updated.

> > 
> > 4) Distinction between decimal digits and letters 

Perhaps the letters for hexadecimal digits should have been encoded separately?

> > 
> > 5) The nightmare of spacing single and double dots. 
> ? spacing vs. combining? Not sure what you mean.

I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT LEADER, along 
with U+002E FULL STOP.

> > 
> > Ideal solutions can also be defeated by limited keyboard layouts.

Yeah that?s the point. Programming, deploying and customizing keyboard layouts 
should become quite trivial. Yet it seems that this is not a part of the curricula.

> > As a result, I have no idea whether the singular of "fithp" (one of 
> > Larry Niven's alien species) should be spelt with U+02BC or U+2019, 
> > though in ASCII I can just write "fi'". 

Normally on an English or French keyboard layout, all three are accessed on 
live keys.

> The only place where "uni" doesn't apply in Unicode is that there's never just 
> a single principle that applies, but always multiple ones that are in tension --- 
> and in the edge cases, the tension can be felt keenly. 

Sorry I cannot follow. Perhaps an example would make the issue clearer.
As of the apostrophe issue, this is IMO an exception, due to the lack of 
a character (the punctuation apostrophe), a lack that in turn seems to have 
been triggered by an atrophy of the analytical memory in favor of the visual 
memory, that is bugged when faced with three ?squiggles.? But repeatedly, 
there are two comma-like characters, and four that generate period-like 
appearances. Not sure whether the lack of an apostrophe reduces that nightmare 
by half.

Marcel


From asmusf at ix.netcom.com  Thu Dec 29 18:29:08 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 29 Dec 2016 16:29:08 -0800
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <497101423.8896.1483046424026.JavaMail.www@wwinf1h16>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
 <497101423.8896.1483046424026.JavaMail.www@wwinf1h16>
Message-ID: <3f152e04-2b8f-032c-713f-201106d55aff@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161229/99a8ffd9/attachment.html>

From charupdate at orange.fr  Fri Dec 30 05:18:32 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 30 Dec 2016 12:18:32 +0100 (CET)
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <3f152e04-2b8f-032c-713f-201106d55aff@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
 <497101423.8896.1483046424026.JavaMail.www@wwinf1h16>
 <3f152e04-2b8f-032c-713f-201106d55aff@ix.netcom.com>
Message-ID: <1615169409.4622.1483096712616.JavaMail.www@wwinf1p12>

On Thu, 29 Dec 2016 16:29:08 -0800, Asmus Freytag wrote:
> 
> On 12/29/2016 1:20 PM, Marcel Schneider wrote:
> > 
> > this Unicode 
> > recommendation leads to content corruption when the related markup is stripped 
> > off. That may occur sooner than expected, e.g. in Word (2010) when a character 
> > style is applied. 
> 
> The solution is to find better ways to apply character styles.
> 
> For example, if a style definition is intended to ensure that something is in 
> italics (for emphasis), then changing all formatting for the text is overkill. 
> There may be very valid reasons to not remove super/subscript, font bindings 
> (other than switch from regular to italic version of a font) or even text color. 
> Finally, in changing text to an italic style, any words already in italics might be 
> toggled rather than blindly switched - that's what human typesetters tend to do.
> 
> Looking at it this way, your beef is not with character encoding, but with 
> limitations (needlessly) imposed by software.

Right, that /has/ been my ?beef? at a time? Hence my reminiscence :) 
Now I?d better not have cited this one. It?s not where I?d like to come. 
I?m pretty sure this is the way things do work in Publisher. I see that 
superscripting is stable in LibreOffice 4.2.4.2 (2014), and I guess it is 
in actual vesions of Word. 

The official concern here in France (not solely my personal one) is that while 
it would be handy to have the most current ordinal indicator in French (^e) 
right on the (on-coming standard) keyboard (layout), this character is considered 
not to have been encoded yet. Hence a reserved allocation, in expectation of 
future encoding. 

According to the relevant page of the French Academy [1], four letters are used 
in French ordinal abbreviations: d, e, r, s. This is detailed on a private 
website dedicated to French abbreviations [2]. The out-of-the-box solution is 
to use the already existing Unicode modifier letters ?, ?, ?, ?, but this has 
a downside on display level: The only fonts I know where it has the habitual look 
are Consolas (modifier letters resembling to legacy ordinal indicators without the 
underline) and Lucida Console (modfier letters resembling to formatted superscipt). 
Already in Lucida Sans Unicode, modifier letters are less raised than ordinal 
indicators and formatted superscript.

My personal opinion is that these differences in display are not worth considering. 
If there is a demand for plain text ordinal indicators in French, the Unicode 
modifier letters should be used in this context. They are ready to use, and much 
more important issues are out there. Anyway, the current practice (using rich text) 
remains recommended by Unicode. (But for the?out of topic?representation of 
numeral system bases, Unicode subscript digits still seem to me a better way.)

These issues seem _relatively_ important to me, as they affect the French keyboard 
standard that is now being engineered, for public enquiry and publication in 2017.

Marcel

[1] Abr?viations des adjectifs num?raux [?Abbreviations of ordinals?]:
http://www.academie-francaise.fr/abreviations-des-adjectifs-numerauxhttp://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux

[2] Abr?viations des adjectifs num?raux ordinaux:
http://www.les-abreviations.com/adjectifs.html


From charupdate at orange.fr  Fri Dec 30 05:32:25 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 30 Dec 2016 12:32:25 +0100 (CET)
Subject: French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <3f152e04-2b8f-032c-713f-201106d55aff@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
 <497101423.8896.1483046424026.JavaMail.www@wwinf1h16>
 <3f152e04-2b8f-032c-713f-201106d55aff@ix.netcom.com>
Message-ID: <2020463919.4813.1483097545047.JavaMail.www@wwinf1p12>

I?m sorry for the broken link, messed by redundant pasting.
Please disregard my previous e-mail.

On Thu, 29 Dec 2016 16:29:08 -0800, Asmus Freytag wrote:
> 
> On 12/29/2016 1:20 PM, Marcel Schneider wrote:
> > 
> > this Unicode 
> > recommendation leads to content corruption when the related markup is stripped 
> > off. That may occur sooner than expected, e.g. in Word (2010) when a character 
> > style is applied. 
> 
> The solution is to find better ways to apply character styles.
> 
> For example, if a style definition is intended to ensure that something is in 
> italics (for emphasis), then changing all formatting for the text is overkill. 
> There may be very valid reasons to not remove super/subscript, font bindings 
> (other than switch from regular to italic version of a font) or even text color. 
> Finally, in changing text to an italic style, any words already in italics might be 
> toggled rather than blindly switched - that's what human typesetters tend to do.
> 
> Looking at it this way, your beef is not with character encoding, but with 
> limitations (needlessly) imposed by software.

Right, that /has/ been my ?beef? at a time? Hence my reminiscence :) 
Now I?d better not have cited this one. It?s not where I?d like to come. 
I?m pretty sure this is the way things do work in Publisher. I see that 
superscripting is stable in LibreOffice 4.2.4.2 (2014), and I guess it is 
in actual vesions of Word. 

The official concern here in France (not solely my personal one) is that while 
it would be handy to have the most current ordinal indicator in French (^e) 
right on the (on-coming standard) keyboard (layout), this character is considered 
not to have been encoded yet. Hence a reserved allocation, in expectation of 
future encoding. 

According to the relevant page of the French Academy [1], four letters are used 
in French ordinal abbreviations: d, e, r, s. This is detailed on a private 
website dedicated to French abbreviations [2]. The out-of-the-box solution is 
to use the already existing Unicode modifier letters ?, ?, ?, ?, but this has 
a downside on display level: The only fonts I know where it has the habitual look 
are Consolas (modifier letters resembling to legacy ordinal indicators without the 
underline) and Lucida Console (modfier letters resembling to formatted superscipt). 
Already in Lucida Sans Unicode, modifier letters are less raised than ordinal 
indicators and formatted superscript.

My personal opinion is that these differences in display are not worth considering. 
If there is a demand for plain text ordinal indicators in French, the Unicode 
modifier letters should be used in this context. They are ready to use, and much 
more important issues are out there. Anyway, the current practice (using rich text) 
remains recommended by Unicode. (But for the?out of topic?representation of 
numeral system bases, Unicode subscript digits still seem to me a better way.)

These issues seem _relatively_ important to me, as they affect the French keyboard 
standard that is now being engineered, for public enquiry and publication in 2017.

Marcel

[1] Abr?viations des adjectifs num?raux [?Abbreviations of ordinals?]:
http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux

[2] Abr?viations des adjectifs num?raux ordinaux:
http://www.les-abreviations.com/adjectifs.html


From richard.wordingham at ntlworld.com  Fri Dec 30 06:37:27 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 30 Dec 2016 12:37:27 +0000
Subject: a character for an unknown character
In-Reply-To: <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
Message-ID: <20161230123727.52a00633@JRWUBU2>

On Fri, 30 Dec 2016 01:23:55 +0100 (CET)
Marcel Schneider <charupdate at orange.fr> wrote:

> On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote:
> > On 12/28/2016 5:47 PM, Richard Wordingham wrote:   

> U+02BC being shifted from a letter to a punctuation must have been
> anticipated at encoding, since the original recommendation was to use
> it as apostrophe throughout. Unifying the letter apostrophe and the
> punctuation apostrophe made IMO more sense?despite of the conflicting
> properties

What conflicts?  Both prototypically mark absences.

The rationale seems to be that English uses both the punctuation
apostrophe and the U+2019 RIGHT SINGLE QUOTATION MARK.  If users aren't
being trained to use U+2212 MINUS SIGN, and habitually disable grammar
and spell-checking, most won't make the right choice between U+02BC and
U+2019.

> Perhaps the letters for hexadecimal digits should have been encoded
> separately?

The idea has been rejected several times.

> > > 5) The nightmare of spacing single and double dots.   
> > ? spacing vs. combining? Not sure what you mean.  

> I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT
> LEADER, along with U+002E FULL STOP.

That's not the half of it.  For starters, just look at the confusables
for U+00B7 MIDDLE DOT:

U+2022 BULLET
U+2027 HYPHENATION POINT
U+2219 BULLET OPERATOR
U+22C5 DOT OPERATOR
U+2E31 WORD SEPARATOR MIDDLE DOT
U+30FB KATAKANA MIDDLE DOT

There's an argument that the unification of U+00B7 and U+0387 ANO
TELEIA is a unification too far.  A font for Greek may need to work out
which it is to position it correctly.

For double dots, there're the confusables for U+003A COLON:
U+05C3 HEBREW PUNCTUATION SOF PASUQ
U+2236 RATIO

There's a whole raft of visargas, some of which match and some of
which don't. What happened to the principle that diacritics are unified
by form?  I suspect the answer is that encoding was established while
principles were still developing.

> > > As a result, I have no idea whether the singular of "fithp" (one
> > > of Larry Niven's alien species) should be spelt with U+02BC or
> > > U+2019, though in ASCII I can just write "fi'".   
> 
> Normally on an English or French keyboard layout, all three are
> accessed on live keys.

That accessibility is news to me - normally I just have to fight a word
processor if I want U+0027.  However, I still don't know whether to
spell the word ?fi?? or ?fi??.  I've only seen it in print.

Richard.


From doug at ewellic.org  Fri Dec 30 11:33:45 2016
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 30 Dec 2016 10:33:45 -0700
Subject: Sorry
Message-ID: <20161230103345.665a7a7059d7ee80bb4d670165c8327d.bd1bfb40c3.wbe@email03.godaddy.com>

I apologize for sending a message to this list yesterday that was
intended for another list.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From charupdate at orange.fr  Fri Dec 30 13:13:41 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 30 Dec 2016 20:13:41 +0100 (CET)
Subject: a character for an unknown character
In-Reply-To: <20161230123727.52a00633@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
Message-ID: <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>

On Fri, 30 Dec 2016 12:37:27 +0000, Richard Wordingham wrote:
> On Fri, 30 Dec 2016 01:23:55 +0100 (CET) Marcel Schneider wrote:
> > On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote: 
> > > On 12/28/2016 5:47 PM, Richard Wordingham wrote: 
> > U+02BC being shifted from a letter to a punctuation must have been 
> > anticipated at encoding, since the original recommendation was to use 
> > it as apostrophe throughout. Unifying the letter apostrophe and the 
> > punctuation apostrophe made IMO more sense?despite of the conflicting 
> > properties 
> What conflicts? Both prototypically mark absences. 

I meant the gc=Lm of U+02BC vs U+2019 that has gc=Pf.

> The rationale seems to be that English uses both the punctuation 
> apostrophe and the U+2019 RIGHT SINGLE QUOTATION MARK. If users aren't 
> being trained to use U+2212 MINUS SIGN, and habitually disable grammar 
> and spell-checking, most won't make the right choice between U+02BC and 
> U+2019. 

I don?t see well why so. The MINUS SIGN should be on the MINUS key at the 
same level as the PLUS SIGN. (That brings the necessity to at least shift 
the underscore at 0x10/Option/AltGr level, where the PLUS key might have 
the PLUS-MINUS SIGN.) The unability to determine whether a punctuation is 
an apostrphe or a quotation mark, is most found in computers, not humans.

> > Perhaps the letters for hexadecimal digits should have been encoded 
> > separately? 
> The idea has been rejected several times. 

Indeed that would have been useless. Where confusable, hex digits are 
prefixed or suffixed.

> > > > 5) The nightmare of spacing single and double dots. 
> > > ? spacing vs. combining? Not sure what you mean. 
> > I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT 
> > LEADER, along with U+002E FULL STOP. 
> That's not the half of it. For starters, just look at the confusables 
> for U+00B7 MIDDLE DOT: 
> U+2022 BULLET 

This is included out of an abundance of caution. Visually, ? vs ? are distinct.
I?d be sad if there were no bullet; I use it also manually.

> U+2027 HYPHENATION POINT 

?his is distinct in that, it is centred on the lowercase letters, while the 
middle dot is centred on the uppercase letters.

> U+2219 BULLET OPERATOR 
> U+22C5 DOT OPERATOR 

According to Wikipedia, "Interpunct", these are often silently replaced by U+00B7.
But these have gc=Sm. And U+2219 seems to be centred on digits, U+22C5 on lowercase.

> U+2E31 WORD SEPARATOR MIDDLE DOT 
> U+30FB KATAKANA MIDDLE DOT 

These seem to me identical to U+00B7 and U+2022 respectively. Perhaps we?re here 
faced with two examples of what Asmus referred to as ?incorrectly encoded more 
than once? (talking of ?Many other "simple" marks: lines, circles, triangles, 
hooks, and squares, or groups of them?). 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0115.html
I believe however that it was correct to make them dedicated characters for 
precise scripts, somehow like ANO TELEIA.

> There's an argument that the unification of U+00B7 and U+0387 ANO 
> TELEIA is a unification too far. A font for Greek may need to work out 
> which it is to position it correctly. 
> For double dots, there're the confusables for U+003A COLON: 
> U+05C3 HEBREW PUNCTUATION SOF PASUQ 
> U+2236 RATIO 
> There's a whole raft of visargas, some of which match and some of 
> which don't. What happened to the principle that diacritics are unified 
> by form? I suspect the answer is that encoding was established while 
> principles were still developing. 
> > > > As a result, I have no idea whether the singular of "fithp" (one 
> > > > of Larry Niven's alien species) should be spelt with U+02BC or 
> > > > U+2019, though in ASCII I can just write "fi'". 
> > 
> > Normally on an English or French keyboard layout, all three are 
> > accessed on live keys. 
> That accessibility is news to me -

It is really new for me. The sort of keyboard layout update. ?Normally? means here 
the way that it *should* be normal on English and French keyboards. On French ones 
because U+02BC is preferred in Breton language; on English ones because U+02BC 
is preferred by a current of spelling English.

> normally I just have to fight a word processor if I want U+0027.

To improve the user experience here, one needs to move this from the autocorrect 
to the keyboard layout. In Word, one may disable the bundle but add a custom 
autocorrect that transforms U+0027 always immediately to U+2019 or to U+02BC. 
Then, hitting Backspace brings first U+0027 back. Quotation marks require then 
adding other autocorrect items, using digraphs.

> However, I still don't know whether to spell the word ?fi?? or ?fi??.
> I've only seen it in print. 

That depends on the spelling convention. If the apostrophe and the single comma 
quote are disunified, then U+02BC is used to spell the word ?fi?? (your first 
option). You might also wish to ask the publisher, but I?m unsure whether he 
will appreciate to have to join publicly one or the other spelling current.
(As of me, I normally use U+02BC in English, and U+2019 in French, given the 
diverging quotation mark usages and apostrophe semantics. In French mode, I have 
the latter in the Base shift state, and the former in the Shift shift state on 
the same key, but I?m developing another model where the letter apostrophe is 
in 0x10/Option/AltGr on a letter key in both French and Languages mode.)

Marcel


From asmusf at ix.netcom.com  Fri Dec 30 15:10:31 2016
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 30 Dec 2016 13:10:31 -0800
Subject: a character for an unknown character
In-Reply-To: <20161230123727.52a00633@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
Message-ID: <b7f0db69-44d3-c72f-93fd-f7a69b83d6ad@ix.netcom.com>

On 12/30/2016 4:37 AM, Richard Wordingham wrote:
> On Fri, 30 Dec 2016 01:23:55 +0100 (CET)
> Marcel Schneider <charupdate at orange.fr> wrote:
>
>> On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote:
>>> On 12/28/2016 5:47 PM, Richard Wordingham wrote:
>> U+02BC being shifted from a letter to a punctuation must have been
>> anticipated at encoding, since the original recommendation was to use
>> it as apostrophe throughout. Unifying the letter apostrophe and the
>> punctuation apostrophe made IMO more sense?despite of the conflicting
>> properties
> What conflicts?  Both prototypically mark absences.
>
> The rationale seems to be that English uses both the punctuation
> apostrophe and the U+2019 RIGHT SINGLE QUOTATION MARK.  If users aren't
> being trained to use U+2212 MINUS SIGN, and habitually disable grammar
> and spell-checking, most won't make the right choice between U+02BC and
> U+2019.

Evidence seems to indicate that users in languages that were supposed to 
use U+02BC
tend to freely substitute U+0027 and to some degree U+2019.

To the point that U+02BC is being ruled out altogether in the case of 
more selective
policies for domain names, e.g. for the DNS root zone or the reference 
tables for
the second level.

Despite having formally been given the letter property, in practice, the 
fact that it
is visually indistinguishable does not allow it to be treated as a 
letter in all contexts.
>
>> Perhaps the letters for hexadecimal digits should have been encoded
>> separately?
> The idea has been rejected several times.
>
>>>> 5) The nightmare of spacing single and double dots.
>>> ? spacing vs. combining? Not sure what you mean.
>> I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT
>> LEADER, along with U+002E FULL STOP.
> That's not the half of it.  For starters, just look at the confusables
> for U+00B7 MIDDLE DOT:
>
> U+2022 BULLET
> U+2027 HYPHENATION POINT
> U+2219 BULLET OPERATOR
> U+22C5 DOT OPERATOR
> U+2E31 WORD SEPARATOR MIDDLE DOT
> U+30FB KATAKANA MIDDLE DOT
>
> There's an argument that the unification of U+00B7 and U+0387 ANO
> TELEIA is a unification too far.  A font for Greek may need to work out
> which it is to position it correctly.
>
> For double dots, there're the confusables for U+003A COLON:
> U+05C3 HEBREW PUNCTUATION SOF PASUQ
> U+2236 RATIO
>
> There's a whole raft of visargas, some of which match and some of
> which don't. What happened to the principle that diacritics are unified
> by form?  I suspect the answer is that encoding was established while
> principles were still developing.
>
>>>> As a result, I have no idea whether the singular of "fithp" (one
>>>> of Larry Niven's alien species) should be spelt with U+02BC or
>>>> U+2019, though in ASCII I can just write "fi'".
>> Normally on an English or French keyboard layout, all three are
>> accessed on live keys.
> That accessibility is news to me - normally I just have to fight a word
> processor if I want U+0027.  However, I still don't know whether to
> spell the word ?fi?? or ?fi??.  I've only seen it in print.
>
> Richard.
>
>


From richard.wordingham at ntlworld.com  Fri Dec 30 16:17:12 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 30 Dec 2016 22:17:12 +0000
Subject: a character for an unknown character
In-Reply-To: <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>
Message-ID: <20161230221712.77fb5a3a@JRWUBU2>

On Fri, 30 Dec 2016 20:13:41 +0100 (CET)
Marcel Schneider <charupdate at orange.fr> wrote:

> > U+2E31 WORD SEPARATOR MIDDLE DOT 
> > U+30FB KATAKANA MIDDLE DOT   

> These seem to me identical to U+00B7 and U+2022 respectively. Perhaps
> we?re here faced with two examples of what Asmus referred to as
> ?incorrectly encoded more than once? (talking of ?Many other "simple"
> marks: lines, circles, triangles, hooks, and squares, or groups of
> them?).

I was talking about what "fuels the misperception that Unicode somehow
encodes symbols based on a single conventional usage".

> > However, I still don't know whether to spell the word ?fi?? or
> > ?fi??. I've only seen it in print.   

> That depends on the spelling convention.

That's the problem.  I'm not aware of any literature in the language of
the Chtaptisk Fithp in any Terran script - there's more Punic in
Plautus's Poenulus.

> If the apostrophe and the
> single comma quote are disunified, then U+02BC is used to spell the
> word ?fi?? (your first option). You might also wish to ask the
> publisher, but I?m unsure whether he will appreciate to have to join
> publicly one or the other spelling current.

You obviously haven't read the story's discussion of whether the fithp
would honour a peace treaty!

I think the general understanding of the difference is very limited.
For instance, the English wikipedia article about Klingon says, "The
apostrophe, denoting the glottal stop, is considered a letter, not a
punctuation mark", and then goes on to encode it as U+2019!  The French
wikipedia also uses U+2019.

Richard.


From charupdate at orange.fr  Fri Dec 30 19:09:12 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 31 Dec 2016 02:09:12 +0100 (CET)
Subject: a character for an unknown character
In-Reply-To: <20161230221712.77fb5a3a@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>
 <20161230221712.77fb5a3a@JRWUBU2>
Message-ID: <694657162.15012.1483146552587.JavaMail.www@wwinf1p10>

On Fri, 30 Dec 2016 22:17:12 +0000, Richard Wordingham wrote: 
> 
> On Fri, 30 Dec 2016 20:13:41 +0100 (CET)
> Marcel Schneider  wrote:
[?]
> > If the apostrophe and the
> > single comma quote are disunified, then U+02BC is used to spell the
> > word ?fi?? (your first option). You might also wish to ask the
> > publisher, but I?m unsure whether he will appreciate to have to join
> > publicly one or the other spelling current.
> 
> You obviously haven't read the story's discussion of whether the fithp
> would honour a peace treaty!

I haven?t read nor watched Star Trek (nor Star Wars).

> 
> I think the general understanding of the difference is very limited.
> For instance, the English wikipedia article about Klingon says, "The
> apostrophe, denoting the glottal stop, is considered a letter, not a
> punctuation mark", and then goes on to encode it as U+2019! 

I?m unable to find the quoted sentence in the cited article.

> The French wikipedia also uses U+2019.

In the "Klingon" article, mostly U+0027 is used, like to some extent in 
the whole French Wikip?dia. U+2019 occurs indeed (rarely) in that article. 
French cannot use U+02BC as apostrophe, because it needs a punctuation, 
not a letter, for correct word boundaries. Fortunately, French can actually 
afford to use U+2019 as apostrophe, because it scarcely uses it as quotation 
mark. The single one are chevrons, and the comma-quotes are used as 
scare quotes (or abusively as nested quotes), thus almost always double. 
BTW, in German, U+2019 is no quotation mark at all, only apostrophe.

Marcel


From richard.wordingham at ntlworld.com  Sat Dec 31 03:20:30 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 31 Dec 2016 09:20:30 +0000
Subject: a character for an unknown character
In-Reply-To: <694657162.15012.1483146552587.JavaMail.www@wwinf1p10>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>
 <20161230221712.77fb5a3a@JRWUBU2>
 <694657162.15012.1483146552587.JavaMail.www@wwinf1p10>
Message-ID: <20161231092030.44373279@JRWUBU2>

On Sat, 31 Dec 2016 02:09:12 +0100 (CET) Marcel Schneider wrote:
> On Fri, 30 Dec 2016 22:17:12 +0000, Richard Wordingham wrote: 

> > You obviously haven't read the story's discussion of whether the
> > fithp would honour a peace treaty!  

> I haven?t read nor watched Star Trek (nor Star Wars).

It's in a different universe, restricted to one book, namely Footfall.

I brought up Klingon because it seemed a good place to use U+02BC in
a fictitious alien or foreign language, and there is a lot of Klingon
around. For a lot of these fictitious languages, the realisation used by
English speakers is as a neutral vowel rather than as a glottal stop,
so the case for using U+02BC feels less compelling.

> > I think the general understanding of the difference is very limited.
> > For instance, the English wikipedia article about Klingon says, "The
> > apostrophe, denoting the glottal stop, is considered a letter, not a
> > punctuation mark", and then goes on to encode it as U+2019!   

> I?m unable to find the quoted sentence in the cited article.

Did you look in the article about Klingon, namely
https://en.wikipedia.org/wiki/Klingon_language , or
in the article about Klingons, namely
https://en.wikipedia.org/wiki/Klingon ? The quote is from the former.

The English wikipedia 'house style' is given at
https://en.wikipedia.org/wiki/Template:Tt-Klingon ; this specifies the
use of U+2019. I should have spotted that yesterday.

Richard.


From christoph.paeper at crissov.de  Sat Dec 31 04:01:16 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sat, 31 Dec 2016 11:01:16 +0100
Subject: a character for an unknown character
In-Reply-To: <20161230123727.52a00633@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
Message-ID: <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>

Richard Wordingham <richard.wordingham at ntlworld.com>:
> 
>> Perhaps the letters for hexadecimal digits should have been encoded
>> separately?
> 
> The idea has been rejected several times.

It has indeed. That?s why two different technologies have to be used to get typographically harmonic hexadecimal numbers, e.g. in CSS:

  .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;}
  .hex {font-variant-numeric: lining-nums;   text-transform: uppercase;}

This works well enough for ?01ef? or ?01EF?, but will fail for conventions like ?0x01ef? and ?01EFh?. Hence:

  .hex::before {content: "0x"; text-transform: none;}
  .hex::after  {content: "h";  text-transform: none;}
  .hex::after  {content: "?";}
  .hex::after  {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;}
  .hex::after  {content: "16"; font-variant-position: sub;}
  .hex::after  {content: "??";}


From charupdate at orange.fr  Sat Dec 31 04:45:02 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 31 Dec 2016 11:45:02 +0100 (CET)
Subject: Re-use of Modifier Letters for Superscript Abbreviations (was: Re:
 a character for an unknown character)
In-Reply-To: <20161231092030.44373279@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <2127317905.12907.1483125221693.JavaMail.www@wwinf1p10>
 <20161230221712.77fb5a3a@JRWUBU2>
 <694657162.15012.1483146552587.JavaMail.www@wwinf1p10>
 <20161231092030.44373279@JRWUBU2>
Message-ID: <1963362297.3763.1483181102220.JavaMail.www@wwinf1p10>

On Sat, 31 Dec 2016 09:20:30 +0000, Richard Wordingham wrote:
[?]
> It's in a different universe, restricted to one book, namely Footfall. 

Thank you for the reference.

[?]
> Did you look in the article about Klingon, namely 
> https://en.wikipedia.org/wiki/Klingon_language , or 
> in the article about Klingons, namely 
> https://en.wikipedia.org/wiki/Klingon ? The quote is from the former. 

I?ve looked up the wrong one, didn?t think of the language article. 
Thanks for the link.

I?m now looking back at another quotation of yours, to spin off a new thread again 
about the topic that I urgently need to gather more information about:

On Fri, 30 Dec 2016 22:17:12 +0000, Richard Wordingham wrote:
> 
> On Fri, 30 Dec 2016 20:13:41 +0100 (CET) Marcel Schneider wrote: 
> >
> > > U+2E31 WORD SEPARATOR MIDDLE DOT 
> > > U+30FB KATAKANA MIDDLE DOT 
> >
> > These seem to me identical to U+00B7 and U+2022 respectively. Perhaps 
> > we?re here faced with two examples of what Asmus referred to as 
> > ?incorrectly encoded more than once? (talking of ?Many other "simple" 
> > marks: lines, circles, triangles, hooks, and squares, or groups of 
> > them?). 
> 
> I was talking about what "fuels the misperception that Unicode somehow 
> encodes symbols based on a single conventional usage". 

I persist believing that particular scripts like Avestan and Samaritan Aramaic 
can require special characters like the WORD SEPARATOR MIDDLE DOT. Not fueling 
a misperception of Unicode character encoding could?t drive the UTC to reject this 
(for version 5.2). The KATAKANA MIDDLE DOT in turn is a part of the standard since 
the beginning, like the BULLET. I imagine that a generic bullet may not be suitable 
for Katakana.

To get an idea of how character encoding works, people won?t look at scripts they 
don?t know. Given that there is a misperception, one way to not fuel it could be 
to encourage character re-use. Actually this is rather discouraged, as in the 
example of Latin modifier letters that are (basically) preformatted superscripts. 
TUS states that there is no functional difference between those that have the word 
SUPERSCRIPT in their name, and those that don?t:

TUS 9.0, ?7.8, p. 327:
| The superscript forms of the i and n letters can be found in the
| Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter 
| two letters contain the word ?superscript? in their names instead of ?modifier 
| letter? is an historical artifact of original sources for the characters, and 
| is not intended to convey a functional distinction in the use of these 
| characters in the Unicode Standard.
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762

Probably that is intended to discourage their use as superscripts. 
Superscript digits too are confined to phonetics, and the use of superscript two 
and three in measurement units is merely tolerated, not encouraged:

TUS 9.0, ?22.4, p. 786:
| In addition, superscript digits are used to indicate tone in transliteration 
| of many languages. The use of superscript two and superscript three is common 
| legacy practice when referring to units of area and volume in general texts.
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931

Cnnsequently, the notation of the acceleration unit 'ms??' doesn?t seem to be
sustained by Unicode. Though this may be considered a technical notation, so 
that there would be a reason to allow it.

These examples are intended to demonstrate the ambiguity of the recommendation 
to use markup and rich text format whenever vertical alignment matters, except 
in phonetics. I suspect that political correctness with respect to non-Latin 
scripts could eventually have biased Unicode?s policy, whereas Western Arabic 
digits and Latin letters are probably the only characters to be used extensively 
in super- and subscript position.

As a result, the misperception of Unicode as a one-codepoint-per-usage standard 
is even more fueled, and I can now better understand why our NB intended to have 
French ordinal indicator(s) encoded in Unicode aside the already existing 
superscript Latin small letter(s). 

But admitting that encoding new French ordinal indicators is a really good idea, 
I?m curious of the response of the UTC. However, given that the regular process 
will take two years, would Unicode agree that in the meantime, the modifier 
letters be put in their place on the on-coming keyboard layout?


Marcel


From richard.wordingham at ntlworld.com  Sat Dec 31 08:18:15 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 31 Dec 2016 14:18:15 +0000
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
 <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>
 <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>
Message-ID: <20161231141815.0f3beaa6@JRWUBU2>

On Wed, 28 Dec 2016 13:44:18 +0900
Yif?n W?ng <747.neutron at gmail.com> wrote:

> Now I start to wonder if the description would be "Letter for
> phonetics and Japanese phonology" or "Letter for scholarly
> transcription" etc.

"Letter for phonetics or phonology (esp. Japanese)" appeals.  The
new letter wouldn't be out of place in transcribing oral coda stops in
Pali.

Richard.


From charupdate at orange.fr  Sat Dec 31 15:04:02 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 31 Dec 2016 22:04:02 +0100 (CET)
Subject: a character for an unknown character
In-Reply-To: <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
Message-ID: <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>

On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote:
> 
> Richard Wordingham :
> > 
> >> Perhaps the letters for hexadecimal digits should have been encoded
> >> separately?
> > 
> > The idea has been rejected several times.
> 
> It has indeed. That?s why two different technologies have to be used to get 
> typographically harmonic hexadecimal numbers, e.g. in CSS:
> 
> .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;}
> .hex {font-variant-numeric: lining-nums; text-transform: uppercase;}
> 
> This works well enough for ?01ef? or ?01EF?, but will fail for conventions like 
> ?0x01ef? and ?01EFh?. Hence:
> 
> .hex::before {content: "0x"; text-transform: none;}
> .hex::after {content: "h"; text-transform: none;}
> .hex::after {content: "?";}
> .hex::after {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;}
> .hex::after {content: "16"; font-variant-position: sub;}
> .hex::after {content: "??";}

Thank you for the code. I didn?t know this, so I?ve tried and found that 
the automatic prefixes/suffixes cannot be copied from the web page. 
That seems to me a disadvantage.

Among the possibilities, you include Unicode subscripts. Is this current 
practice? That seems to me very interesting to follow up, as it documents 
that the stable representation scheme is already adopted. I?m curious to 
what extent it is so. 

The font-variant-numeric: oldstyle-nums seems not to work with any font. 
To get oldstyle numbers and lining numbers both displayed depending on 
the active option, I had to add font-family: Constantia, that has oldstyle 
digits.

I note that the "U+" prefix is missing in the list, obviously because it 
denotes more than just a hexadecimal number, and is to be hard-coded. 
This is easy when the keyboard includes an emulated numpad with hex letters 
and the "U+" as a sequence on a live key. I know that developers appreciate 
being able to type hex numbers on a numerical keypad, and consider typing 
them on the alphanumerical block, or on both (letters left-hand, digits 
right-hand) a suboptimal workaround. Further it is handy to have superscript 
and subscript digits accessed on the numpad too.

Marcel