From unicode at unicode.org  Thu Nov  2 11:11:07 2017
From: unicode at unicode.org (Rostislav via Unicode)
Date: Thu, 02 Nov 2017 19:11:07 +0300
Subject: A criteria for Emoji property assignment?
In-Reply-To: <mailman.0.1509632840.19858.unicode@unicode.org>
References: <mailman.0.1509632840.19858.unicode@unicode.org>
Message-ID: <24621509639067@web11g.yandex.ru>

I wonder what reason lies behind Unicode Consortium?s declaring some decorative characters as emojis while leaving some other in the state of regular characters.

For example:

1. Four arrows (????, 2190?2193) are not emojis, while the four diagonal arrows in the same Unicode block (????, 2196?2199) are emojis.

2. 23F9 (?) and 23FA (?) are emojis, but the next two characters 23FB (?) and 23FC (?) aren?t.

3. In the Geometric Shapes block, only two characters (25AA ? and 25AB ?) are considered emojis, while other 94 aren?t. While did just these two little squares deserve the honor of bearing Emoji property, in contrast to all other geometric shapes?

4. In the Miscellaneous Symbols block, there is a suspicion that the characters were appointed emojis randomly. Two snowmen (2603 ? and 26C4 ?) are emojis, but the third one (26C7 ?) is not; the up-pointing finger (261D ?) is an emoji, the down-pointing one (261F ?) is not: a cloud without rain (2601 ?) and with rain (26C8 ?) are emojis, but a rain without cloud (26C6 ?) isn?t. Of the characters originated from the single source (namely ARIB, L2/07-391), some became emojis, some not?without any apparent logic.

5. More strange, on the first page of Miscellaneous Symbols and Pictographs (1F300?1F3FF) almost all characters are emojis, except for 10 that are gnawed out inexplicably (e.g. 1F395 ?? and 1F3F2 ??). A similar situation is in the Supplemental Symbols and Pictographs block, where a rifle (1F946 ??) is excluded from emojis, though almost all other characters have Emoji property.

On the whole, almost every Unicode emoji raises a question, why some or many other similar characters aren?t emojis like this one; and lots of non-emojis also rise questions why they aren?t. The assignment of Emoji property to characters seems to be inconsistent, arbitrary and unexplainable,

Or is there an unified explanation of criteria for Emoji property assignment?

--

From unicode at unicode.org  Thu Nov  2 11:39:38 2017
From: unicode at unicode.org (Rick McGowan via Unicode)
Date: Thu, 02 Nov 2017 09:39:38 -0700
Subject: Emoji candidate chart update
Message-ID: <59FB4A4A.9020406@unicode.org>

Hi Everyone,

Just FYI... The new Unicode emoji candidate charts, with updates from 
the UTC #153 meeting are now posted at: 
http://www.unicode.org/emoji/future/emoji-candidates.html

R


From unicode at unicode.org  Thu Nov  2 11:52:01 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 2 Nov 2017 09:52:01 -0700
Subject: A criteria for Emoji property assignment?
In-Reply-To: <24621509639067@web11g.yandex.ru>
References: <mailman.0.1509632840.19858.unicode@unicode.org>
 <24621509639067@web11g.yandex.ru>
Message-ID: <21c5bccf-680b-92e1-0970-aa8b1886571c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171102/54de50b6/attachment.html>

From unicode at unicode.org  Fri Nov  3 04:13:50 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Fri, 3 Nov 2017 09:13:50 +0000
Subject: ASCII v Unicode
Message-ID: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>


You may find https://twitter.com/andreschappo/status/926163719331176450 amusing ??

Andr? Schappo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171103/d831f7c4/attachment.html>

From unicode at unicode.org  Fri Nov  3 04:36:43 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 3 Nov 2017 02:36:43 -0700
Subject: ASCII v Unicode
In-Reply-To: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
Message-ID: <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171103/47514fb6/attachment.html>

From unicode at unicode.org  Fri Nov  3 06:29:31 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Fri, 3 Nov 2017 11:29:31 +0000
Subject: ASCII v Unicode
In-Reply-To: <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
Message-ID: <EEF466D1-3200-4D2E-A0AA-9F1C85F5B738@lboro.ac.uk>


On 3 Nov 2017, at 09:36, Asmus Freytag via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:

You may find https://twitter.com/andreschappo/status/926163719331176450 amusing ??

Andr? Schappo


You're wildly off in your page count.

The "book" part of Unicode (Core Specification) alone is 1,500 pages. I haven't looked at the single file code charts in a while, but I believe you get at least that number again. Then add the dozen or so "Annexes" for a few hundred additional pages and be happy that nobody prints the Unicode Character Database (or the Unihan Database for that matter).

A./

Yes, I agree, my page count is much lower than it should be for Unicode, if I was being literal. I was being figurative rather than literal. I was just making a point to the ASCII developers/programmers and ASCII Academics ??

Prior to tweeting I did consider other numbers. My considerations included 1000, 5000 and 10000. But in my mind "Unicode is a 500 page book" seemed to flow better. I don't know why.

Actually, it probably for the best that I wrote "500 page" because otherwise ASCII developers/programmers and ASCII Academics would not even start reading the Unicode book if they thought it was (say) 5000 pages long.

Let's now look at it literally and here is a template "Unicode is a X page book".

My guess would be "Unicode is a 10000+ page book"

Anyone care to estimate X?

Andr? Schappo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171103/11433b60/attachment.html>

From unicode at unicode.org  Fri Nov  3 07:50:10 2017
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Fri, 3 Nov 2017 20:50:10 +0800
Subject: ASCII v Unicode
In-Reply-To: <EEF466D1-3200-4D2E-A0AA-9F1C85F5B738@lboro.ac.uk>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
 <EEF466D1-3200-4D2E-A0AA-9F1C85F5B738@lboro.ac.uk>
Message-ID: <CAGHjPPJiHAFmiFcaDgZvq9xE=SoF_7Xz+_LQ4NZi4BMrTmXbzg@mail.gmail.com>

The entire Unicode can also be printed onto a single page if you use a very
huge paper coupled with smaller font size! ?I think a football field sized
paper could possibly do the job?

2017-11-03 19:29 GMT+08:00 Andre Schappo via Unicode <unicode at unicode.org>:

>
> On 3 Nov 2017, at 09:36, Asmus Freytag via Unicode <unicode at unicode.org>
> wrote:
>
> On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:
>
>
> You may find https://twitter.com/andreschappo/status/926163719331176450 amusing
> ??
>
> Andr? Schappo
>
> You're wildly off in your page count.
>
> The "book" part of Unicode (Core Specification) alone is 1,500 pages. I
> haven't looked at the single file code charts in a while, but I believe you
> get at least that number again. Then add the dozen or so "Annexes" for a
> few hundred additional pages and be happy that nobody prints the Unicode
> Character Database (or the Unihan Database for that matter).
>
> A./
>
>
> Yes, I agree, my page count is much lower than it should be for Unicode,
> if I was being literal. I was being figurative rather than literal. I was
> just making a point to the ASCII developers/programmers and ASCII Academics
> ??
>
> Prior to tweeting I did consider other numbers. My considerations included
> 1000, 5000 and 10000. But in my mind "Unicode is a 500 page book" seemed to
> flow better. I don't know why.
>
> Actually, it probably for the best that I wrote "500 page" because
> otherwise ASCII developers/programmers and ASCII Academics would not even
> start reading the Unicode book if they thought it was (say) 5000 pages long.
>
> Let's now look at it literally and here is a template "Unicode is a X page
> book".
>
> My guess would be "Unicode is a 10000+ page book"
>
> Anyone care to estimate X?
>
> Andr? Schappo
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171103/e637ad00/attachment.html>

From unicode at unicode.org  Fri Nov  3 08:44:46 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Fri, 3 Nov 2017 13:44:46 +0000
Subject: ASCII v Unicode
In-Reply-To: <CAGHjPPJiHAFmiFcaDgZvq9xE=SoF_7Xz+_LQ4NZi4BMrTmXbzg@mail.gmail.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
 <EEF466D1-3200-4D2E-A0AA-9F1C85F5B738@lboro.ac.uk>
 <CAGHjPPJiHAFmiFcaDgZvq9xE=SoF_7Xz+_LQ4NZi4BMrTmXbzg@mail.gmail.com>
Message-ID: <165C5859-D374-45F6-B882-E887515885FD@lboro.ac.uk>


hmmmm.... I think the only way we can resolve this "X page Unicode book" issue is to recruit an infinite number of monkeys  ???????????????????????????????

Andr? Schappo

On 3 Nov 2017, at 12:50, Phake Nick <c933103 at gmail.com<mailto:c933103 at gmail.com>> wrote:

The entire Unicode can also be printed onto a single page if you use a very huge paper coupled with smaller font size! ?I think a football field sized paper could possibly do the job?

2017-11-03 19:29 GMT+08:00 Andre Schappo via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>:

On 3 Nov 2017, at 09:36, Asmus Freytag via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:

You may find https://twitter.com/andreschappo/status/926163719331176450 amusing ??

Andr? Schappo


You're wildly off in your page count.

The "book" part of Unicode (Core Specification) alone is 1,500 pages. I haven't looked at the single file code charts in a while, but I believe you get at least that number again. Then add the dozen or so "Annexes" for a few hundred additional pages and be happy that nobody prints the Unicode Character Database (or the Unihan Database for that matter).

A./

Yes, I agree, my page count is much lower than it should be for Unicode, if I was being literal. I was being figurative rather than literal. I was just making a point to the ASCII developers/programmers and ASCII Academics ??

Prior to tweeting I did consider other numbers. My considerations included 1000, 5000 and 10000. But in my mind "Unicode is a 500 page book" seemed to flow better. I don't know why.

Actually, it probably for the best that I wrote "500 page" because otherwise ASCII developers/programmers and ASCII Academics would not even start reading the Unicode book if they thought it was (say) 5000 pages long.

Let's now look at it literally and here is a template "Unicode is a X page book".

My guess would be "Unicode is a 10000+ page book"

Anyone care to estimate X?

Andr? Schappo


?? ?? ??
Andr? Schappo
https://schappo.blogspot.co.uk
https://twitter.com/andreschappo
https://weibo.com/andreschappo
https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171103/5719cc41/attachment.html>

From unicode at unicode.org  Fri Nov  3 11:23:04 2017
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Fri, 3 Nov 2017 09:23:04 -0700
Subject: ASCII v Unicode
In-Reply-To: <31535187.44475.1509725560242.JavaMail.defaultUser@defaultHost>
References: <10738465.40177.1509722360750.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk>
 <31535187.44475.1509725560242.JavaMail.defaultUser@defaultHost>
Message-ID: <2fac7b41-4fa0-2f3a-c403-8b83586c87bf@ix.netcom.com>

On 11/3/2017 9:12 AM, William_J_G Overington wrote:
> GS1-128 barcode technology is being introduced into National Health Service hospitals in the United Kingdom.

This is so off-topic and unrelated to the discussion.

A./
>
> http://www.scan4safety.nhs.uk/
>
> As barcode scanners will be in use, a not unrealistic scenario is that localizable sentences encoded in GS1-128 barcodes could be used for some everyday communication through the language barrier.
>
> For example, a whole sentence, such as, here localized into English,
>
> Would you like a drink of water?
>
> could be encoded as
>
> ::781:;
>
> within Application Identifier 97 of a GS1-128 barcode.
>
> Suppose that this system were being implemented.
>
> For localization into English, the sentence.dat text file could contain the following line of text for localizing that particuar localizable sentence.
>
> ::781:;|Would you like a drink of water?
>
> If the sentence.dat file and the software to handle it were implemented in 7-bit ASCII the system would work fine for localization into English.
>
> If many sentence.dat files, one for each language, and the software to handle them were implemented in 8-bit ASCII the system would work fine for localization into English and for localization into many of the languages of Western Europe and Scandinavia.
>
> If many sentence.dat files, one for each language, and the software to handle them were implemented in Unicode using the UTF-16 text file format for each sentence.dat file, the system would work fine for localization into many languages of the world.
>
> This seems to me to be a very good example of why Unicode is so much better than ASCII.
>
> William Overington
>
> Friday 3 November 2017
>


From unicode at unicode.org  Sat Nov  4 07:04:03 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Sat, 4 Nov 2017 12:04:03 +0000
Subject: ASCII v Unicode
In-Reply-To: <bbbca2fa-76bb-4855-4eca-d4ca8ce8607a@ix.netcom.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
 <59FC8B49.8040004@unicode.org>
 <bbbca2fa-76bb-4855-4eca-d4ca8ce8607a@ix.netcom.com>
Message-ID: <CD434BAE-2D97-4C38-B4F3-AAFA5D439D4C@lboro.ac.uk>

We now have a literal number for ASCII which is 31 pages https://twitter.com/srl295/status/926530928171671552

Andr? Schappo

On 3 Nov 2017, at 15:45, Asmus Freytag (c) <asmusf at ix.netcom.com<mailto:asmusf at ix.netcom.com>> wrote:

On 11/3/2017 8:29 AM, Rick McGowan wrote:
The 10.0 chart PDF is 2570 pages.

On 11/3/2017 2:36 AM, Asmus Freytag via Unicode wrote:
single file code charts in a while, but I believe you get at least that number again.


PS:  @Andre: update to my last message: 1,500 Core, 2570+ Charts, and, say 430, for the UAXs would make 4,500 pages. Off by a factor 9 from your initial value, but not quite "zillions". :)

?? ?? ??
Andr? Schappo
https://schappo.blogspot.co.uk
https://twitter.com/andreschappo
https://weibo.com/andreschappo
https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171104/3e7b6ee7/attachment.html>

From unicode at unicode.org  Sun Nov  5 23:55:53 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 5 Nov 2017 21:55:53 -0800
Subject: ASCII v Unicode
In-Reply-To: <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
Message-ID: <CAJ2xs_ERRzsMR+FMjAEcZWF6g4Uyx4PjvwFZWNLVk29Y6Fw82A@mail.gmail.com>

I had some time on the plane this weekend, and generated some more
comprehensive figures that take the following into account:

   1. There are two senses of "Unicode". In the narrow sense, it is only
   the Unicode Standard (ie, Unicode Characters). But it has grown to have a
   more comprehensive sense, including the other two main projects of the
   Unicode Consortium: Unicode CLDR and ICU.
   2. The ca. 3,300 pages that Asmus cited include specification *text*
   alone, but *data/code* (eg, UCD property data, or source code for ICU)
   is a vital part of the projects.


I thus generated a rough comparison where I (a) included CLDR and ICU, and
(b) included data. That gave the following results (where "encoding"
includes both the Unicode Standard *and* UTS's that are aligned with it in
version, including emoji ? since that is to be aligned with it).

[image: Inline image 1]


*Caveats*

   - *This is a rough approximation (my flight wasn't all that long...).*
   In particular, don't count on the 3 decimals of precision ? that is just
   the spreadsheet charting.
   - For the data files and code files, I filtered by removing # comments,
   collapsing sequences of whitespace into a single space character, trimming
   whitespace, and tossing empty lines. I then counted a page as a total of 3K
   code points. So the page count for data and code is far smaller than simply
   a line count. (Didn't bother dropping // and /*...*/ comments in code.) I
   also excluded .txt files that had the word "test" (case-insensitive) in
   their names.
   - For html pages I took a few samples of PDFs for UTS's and ICU docs,
   and got a count of HTML code points per page for each generated type of
   page, then divided out to get an approximate page count.
   - There were some other filters: for example, for ICU sources I included
   only files of type {"cpp", "c", "h", "ucm", "java"}, since files of type
   "txt" were likely generated from CLDR data. For CLDR I excluded charts and
   Survey Tool pages, since that would have bulked up the CLDR pie-slice
   drammatically.
   - (And by the way, the pie-slice for emoji is not visible in this graph:
   just 0.1%.)


Mark <https://twitter.com/mark_e_davis>

On Fri, Nov 3, 2017 at 2:36 AM, Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:
>
>
> You may find https://twitter.com/andreschappo/status/926163719331176450 amusing
> ??
>
> Andr? Schappo
>
> You're wildly off in your page count.
>
> The "book" part of Unicode (Core Specification) alone is 1,500 pages. I
> haven't looked at the single file code charts in a while, but I believe you
> get at least that number again. Then add the dozen or so "Annexes" for a
> few hundred additional pages and be happy that nobody prints the Unicode
> Character Database (or the Unihan Database for that matter).
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171105/26b185e1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2017-11-05 at 23.41.47.png
Type: image/png
Size: 57301 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20171105/26b185e1/attachment.png>

From unicode at unicode.org  Tue Nov  7 01:33:07 2017
From: unicode at unicode.org (Sudhanwa Jogalekar via Unicode)
Date: Tue, 7 Nov 2017 13:03:07 +0530
Subject: ASCII v Unicode
In-Reply-To: <CAGHjPPJiHAFmiFcaDgZvq9xE=SoF_7Xz+_LQ4NZi4BMrTmXbzg@mail.gmail.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
 <EEF466D1-3200-4D2E-A0AA-9F1C85F5B738@lboro.ac.uk>
 <CAGHjPPJiHAFmiFcaDgZvq9xE=SoF_7Xz+_LQ4NZi4BMrTmXbzg@mail.gmail.com>
Message-ID: <CA+Ho68xKJezBg09a+su_NHQJj=NS=NhKDhK4718bLwPsFwVuCA@mail.gmail.com>

Lets create another Annexe for standardising " Printing of Unicode
standard  - "usage of fonts, paper size " etc etc.... ;-)

LOL !!

On Fri, Nov 3, 2017 at 6:20 PM, Phake Nick via Unicode <unicode at unicode.org>
wrote:

> The entire Unicode can also be printed onto a single page if you use a
> very huge paper coupled with smaller font size! ?I think a football field
> sized paper could possibly do the job?
>
> 2017-11-03 19:29 GMT+08:00 Andre Schappo via Unicode <unicode at unicode.org>
> :
>
>>
>> On 3 Nov 2017, at 09:36, Asmus Freytag via Unicode <unicode at unicode.org>
>> wrote:
>>
>> On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:
>>
>>
>> You may find https://twitter.com/andreschappo/status/926163719331176450 amusing
>> ??
>>
>> Andr? Schappo
>>
>> You're wildly off in your page count.
>>
>> The "book" part of Unicode (Core Specification) alone is 1,500 pages. I
>> haven't looked at the single file code charts in a while, but I believe you
>> get at least that number again. Then add the dozen or so "Annexes" for a
>> few hundred additional pages and be happy that nobody prints the Unicode
>> Character Database (or the Unihan Database for that matter).
>>
>> A./
>>
>>
>> Yes, I agree, my page count is much lower than it should be for Unicode,
>> if I was being literal. I was being figurative rather than literal. I was
>> just making a point to the ASCII developers/programmers and ASCII Academics
>> ??
>>
>> Prior to tweeting I did consider other numbers. My considerations
>> included 1000, 5000 and 10000. But in my mind "Unicode is a 500 page book"
>> seemed to flow better. I don't know why.
>>
>> Actually, it probably for the best that I wrote "500 page" because
>> otherwise ASCII developers/programmers and ASCII Academics would not even
>> start reading the Unicode book if they thought it was (say) 5000 pages long.
>>
>> Let's now look at it literally and here is a template "Unicode is a X
>> page book".
>>
>> My guess would be "Unicode is a 10000+ page book"
>>
>> Anyone care to estimate X?
>>
>> Andr? Schappo
>>
>>
>>
>


-- 

~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!~!
web: www.sudhanwa.com  blog: www.sudhanwa.in
Twitter: sudhanwa Check on FB, Linkedin for more.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171107/bab6c8c9/attachment.html>

From unicode at unicode.org  Thu Nov  9 02:47:28 2017
From: unicode at unicode.org (=?UTF-8?Q?Elias_M=C3=A5rtenson?= via Unicode)
Date: Thu, 9 Nov 2017 16:47:28 +0800
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <20170703.184946.1082299263384367210.wl@gnu.org>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
Message-ID: <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>

On 4 July 2017 at 00:49, Werner LEMBERG via Unicode <unicode at unicode.org>
wrote:

>
> > No, the hyphenation oddity involving the addition of letters with
> > hyphenation (or, to be more precise, to suppress letters in
> > unhyphenated words) never affected the letter s.
>
> I'm not sure that this is really true.  As far as I know, `sss' in
> Swiss German was handled similar to other triplet consonants before
> the 1996 spelling reform.  In other words, you would have written
>
>   Abschlussatz (`closing sentence')
>
> instead of
>
>   Abschlusssatz  ,
>
> and which would have been hyphenated as
>
>   Abschluss-satz
>

This is still the case for Swedish though. I studied German before 1996,
and I was under the impression that the rules in this case wad identical
for Swedish and German. What do the rules say now?

Regards,
Elias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171109/0ae63f51/attachment.html>

From unicode at unicode.org  Thu Nov  9 04:12:25 2017
From: unicode at unicode.org (Walter Tross via Unicode)
Date: Thu, 9 Nov 2017 11:12:25 +0100
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
Message-ID: <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>

Long story short: it's Abschlusssatz now (and Rollladen, etc.) One of the
criteria of the reform was to normalise hyphenation. This has gone so far
as to hyphenate B?-cker, with the additional criterion of keeping the c
inside its group.

2017-11-09 9:47 GMT+01:00 Elias M?rtenson via Unicode <unicode at unicode.org>:

> On 4 July 2017 at 00:49, Werner LEMBERG via Unicode <unicode at unicode.org>
> wrote:
>
>>
>> > No, the hyphenation oddity involving the addition of letters with
>> > hyphenation (or, to be more precise, to suppress letters in
>> > unhyphenated words) never affected the letter s.
>>
>> I'm not sure that this is really true.  As far as I know, `sss' in
>> Swiss German was handled similar to other triplet consonants before
>> the 1996 spelling reform.  In other words, you would have written
>>
>>   Abschlussatz (`closing sentence')
>>
>> instead of
>>
>>   Abschlusssatz  ,
>>
>> and which would have been hyphenated as
>>
>>   Abschluss-satz
>>
>
> This is still the case for Swedish though. I studied German before 1996,
> and I was under the impression that the rules in this case wad identical
> for Swedish and German. What do the rules say now?
>
> Regards,
> Elias
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171109/f7d1c2c4/attachment.html>

From unicode at unicode.org  Thu Nov  9 20:40:19 2017
From: unicode at unicode.org (=?UTF-8?Q?Elias_M=C3=A5rtenson?= via Unicode)
Date: Fri, 10 Nov 2017 10:40:19 +0800
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
 <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
Message-ID: <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>

On 9 November 2017 at 18:12, Walter Tross <waltertross at gmail.com> wrote:

> Long story short: it's Abschlusssatz now (and Rollladen, etc.) One of the
> criteria of the reform was to normalise hyphenation. This has gone so far
> as to hyphenate B?-cker, with the additional criterion of keeping the c
> inside its group.
>

Wow. That looks incredibly strange to me. Thanks for informing me of this
change, I would probably have thought it to be a typo if I saw that
written. As for B?cker, I presume the previous hyphenation was B?ck-er? (at
least that's how it would be written in Swedish). Is this still allowed?
I.e. are the hyphenation points B?-ck-er?

Regards,
Elias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171110/c740f5f8/attachment.html>

From unicode at unicode.org  Thu Nov  9 21:11:17 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 9 Nov 2017 19:11:17 -0800
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
 <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
 <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
Message-ID: <a6975c43-499a-d28a-73b1-1d76ff0217d7@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171109/c24a044d/attachment.html>

From unicode at unicode.org  Thu Nov  9 21:25:47 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 10 Nov 2017 04:25:47 +0100
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
 <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
 <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
Message-ID: <CAGa7JC3DJhKEtHUgKwWV8+orn_3-CSVKdQeNdGBwWwKfizJ2jw@mail.gmail.com>

2017-11-10 3:40 GMT+01:00 Elias M?rtenson via Unicode <unicode at unicode.org>:

> On 9 November 2017 at 18:12, Walter Tross <waltertross at gmail.com> wrote:
>
>> Long story short: it's Abschlusssatz now (and Rollladen, etc.) One of the
>> criteria of the reform was to normalise hyphenation. This has gone so far
>> as to hyphenate B?-cker, with the additional criterion of keeping the c
>> inside its group.
>>
>
> Wow. That looks incredibly strange to me. Thanks for informing me of this
> change, I would probably have thought it to be a typo if I saw that
> written. As for B?cker, I presume the previous hyphenation was B?ck-er? (at
>
least that's how it would be written in Swedish). Is this still allowed?
> I.e. are the hyphenation points B?-ck-er?
>

The strange thing about the "triple s" is that it occurs when hyphenated as
  "ss<shy>s"
but if hyphenation does not occur, the "triple s" becomes only two (as if
"ss<shy>" was contextually creating a ligature as a single "s". We have no
way to create custom hyphenation sequences such as :
  "s<softhyphensequence>s-<br/></softhyphensequence>s"
which is what was really intended (with no hyphen the word is compacted
using only two "s").

Also I presume that to force the grouping of "ck" and avoid the soft hyphen
to break it, a SHY could be used just after it as
  "B?ck<SHY>er",
but I think what was meant was really this:
  "B?ck<hyphenate>-<br/>k<hyphenate>er"
where the k is repeated AFTER the linebreak while keeping the "ck" group
before.

This is possible to do that with some markup language, but not in Unicode
plain text without requesting the addition of two new controls !

And things could be even worse: here we specify what happens when a
linebreak occurs and specify nothing if it does not (the whole inner
sequence is deleted). So if the "hyphenated triple s" is compacted to a
single sharp s when there's no libebreak, we would need something like this:
  "<nohyphenate>?</nohyphenate><hyphenate>ss-<br/>s</hyphenate>"
And for that we would need at least 3 controls in plain text if we don't
want markup !!!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171110/a48564e4/attachment.html>

From unicode at unicode.org  Thu Nov  9 21:27:59 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 10 Nov 2017 04:27:59 +0100
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <a6975c43-499a-d28a-73b1-1d76ff0217d7@ix.netcom.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
 <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
 <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
 <a6975c43-499a-d28a-73b1-1d76ff0217d7@ix.netcom.com>
Message-ID: <CAGa7JC0i1+pgEqbrnJHktXJBYs-X2pXr2LeGumA1nL9f8Mb6ow@mail.gmail.com>

So this is effectively (custom HTML-like markup)
  "B?<nohyphenate>c</nohyphenate><hyphenate>k-<br/></hyphenate>ker"


2017-11-10 4:11 GMT+01:00 Asmus Freytag via Unicode <unicode at unicode.org>:

> On 11/9/2017 6:40 PM, Elias M?rtenson via Unicode wrote:
>
> On 9 November 2017 at 18:12, Walter Tross <waltertross at gmail.com> wrote:
>
>> Long story short: it's Abschlusssatz now (and Rollladen, etc.) One of the
>> criteria of the reform was to normalise hyphenation. This has gone so far
>> as to hyphenate B?-cker, with the additional criterion of keeping the c
>> inside its group.
>>
>
> Wow. That looks incredibly strange to me. Thanks for informing me of this
> change, I would probably have thought it to be a typo if I saw that
> written. As for B?cker, I presume the previous hyphenation was B?ck-er?
>
>
> no, B?k-ker ...
>
> (at least that's how it would be written in Swedish). Is this still
> allowed? I.e. are the hyphenation points B?-ck-er?
>
> Regards,
> Elias
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171110/e5353b25/attachment.html>

From unicode at unicode.org  Fri Nov 10 07:44:10 2017
From: unicode at unicode.org (Walter Tross via Unicode)
Date: Fri, 10 Nov 2017 14:44:10 +0100
Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized
In-Reply-To: <CAGa7JC0i1+pgEqbrnJHktXJBYs-X2pXr2LeGumA1nL9f8Mb6ow@mail.gmail.com>
References: <trinity-13d1f0eb-d819-423d-9d1e-89d49faccb0b-1499011152445@3capp-webde-bs63>
 <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net>
 <trinity-6fee03e9-39c2-40e1-9bd5-6fe4f193b67a-1499097928517@3capp-webde-bs05>
 <20170703.184946.1082299263384367210.wl@gnu.org>
 <CADtN0WLyNPcHaqRvys60ksJQZco4CiO-6mSxmUTtrk0xKZaYVA@mail.gmail.com>
 <CABtA2eF9M02uX-ntMOW2ijqzQP-n6WVAt4frgnyggk+eptT0gg@mail.gmail.com>
 <CADtN0WJnFbm-4L7e3aoEJpD9NsOpFyofFu4A1_ti-Txf=4e3=Q@mail.gmail.com>
 <a6975c43-499a-d28a-73b1-1d76ff0217d7@ix.netcom.com>
 <CAGa7JC0i1+pgEqbrnJHktXJBYs-X2pXr2LeGumA1nL9f8Mb6ow@mail.gmail.com>
Message-ID: <CABtA2eHc6QFcqVULtp5nxSMmxOmSEoBU_6vYTAAGWWNVFnxprA@mail.gmail.com>

Correct.
Just a note: the current hyphenation is B?-cker (as I wrote in a previous
email) ( https://www.duden.de/rechtschreibung/Baecker )

2017-11-10 4:27 GMT+01:00 Philippe Verdy via Unicode <unicode at unicode.org>:

> So this is effectively (custom HTML-like markup)
>   "B?<nohyphenate>c</nohyphenate><hyphenate>k-<br/></hyphenate>ker"
>
>
> 2017-11-10 4:11 GMT+01:00 Asmus Freytag via Unicode <unicode at unicode.org>:
>
>> On 11/9/2017 6:40 PM, Elias M?rtenson via Unicode wrote:
>>
>> On 9 November 2017 at 18:12, Walter Tross <waltertross at gmail.com> wrote:
>>
>>> Long story short: it's Abschlusssatz now (and Rollladen, etc.) One of
>>> the criteria of the reform was to normalise hyphenation. This has gone so
>>> far as to hyphenate B?-cker, with the additional criterion of keeping the c
>>> inside its group.
>>>
>>
>> Wow. That looks incredibly strange to me. Thanks for informing me of this
>> change, I would probably have thought it to be a typo if I saw that
>> written. As for B?cker, I presume the previous hyphenation was B?ck-er?
>>
>>
>> no, B?k-ker ...
>>
>> (at least that's how it would be written in Swedish). Is this still
>> allowed? I.e. are the hyphenation points B?-ck-er?
>>
>> Regards,
>> Elias
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171110/8d6edb90/attachment.html>

From unicode at unicode.org  Sun Nov 12 16:19:52 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 12 Nov 2017 22:19:52 +0000
Subject: ASCII v Unicode
In-Reply-To: <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
References: <3E1A532E-9E80-46F4-92FB-15B838BC1D84@lboro.ac.uk>
 <62ebc768-7cfa-3bd9-53bc-601a52d76f60@ix.netcom.com>
Message-ID: <20171112221952.21dcdc99@JRWUBU2>

On Fri, 3 Nov 2017 02:36:43 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:
> 
> You may
> find?https://twitter.com/andreschappo/status/926163719331176450?amusing
> ??
> 
> Andr? Schappo
> 
> You're wildly off in your page count.
> 
> The "book" part of Unicode (Core Specification) alone is 1,500 pages.
> I haven't looked at the single file code charts in a while, but I
> believe you get at least that number again. Then add the dozen or so
> "Annexes" for a few hundred additional pages and be happy that nobody
> prints the Unicode Character Database (or the Unihan Database for
> that matter).

A reasonable comparison would be ASCII v. ISO 10646 v. Unicode.  For
example, casing and text boundaries are not normally considered as part
of the scope for ASCII.

Richard.


From unicode at unicode.org  Mon Nov 13 12:20:18 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 18:20:18 +0000
Subject: Plane-2-only string
Message-ID: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?

We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.

Background:
The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.


Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 15428 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20171113/648c9afd/attachment.bin>

From unicode at unicode.org  Mon Nov 13 13:38:45 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 11:38:45 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>

A font's sample text can be used in place of the default "The quick
brown fox..." text which is used to illustrate the typeface in
applications which support that feature.

One approach would be to find a non-gibberish text string using some
Plane 2 characters and add the BMP glyphs to the font mapped to the
BMP PUA.  Because if only a handful of BMP CJK glyphs were added to
the font mapped to their standard code points, the font might need to
claim to support BMP CJK (when in fact it does not) in order to
display the sample text.  Or, (if standard code points are used) the
font might be auto-detected as supporting BMP CJK by some
applications, when it doesn't really support that range.

On Mon, Nov 13, 2017 at 10:20 AM, Peter Constable via Unicode
<unicode at unicode.org> wrote:
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>
>
>
> Peter


From unicode at unicode.org  Mon Nov 13 13:51:18 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 13 Nov 2017 20:51:18 +0100
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
Message-ID: <CAGa7JC2GUrbAqqy5bUwLWoY-BpOE8OjvY_6SqYe2msFpb-R1fQ@mail.gmail.com>

May be this test page ?
http://www.i18nguy.com/unicode/supplementary-test.html


2017-11-13 20:38 GMT+01:00 James Kass via Unicode <unicode at unicode.org>:

> A font's sample text can be used in place of the default "The quick
> brown fox..." text which is used to illustrate the typeface in
> applications which support that feature.
>
> One approach would be to find a non-gibberish text string using some
> Plane 2 characters and add the BMP glyphs to the font mapped to the
> BMP PUA.  Because if only a handful of BMP CJK glyphs were added to
> the font mapped to their standard code points, the font might need to
> claim to support BMP CJK (when in fact it does not) in order to
> display the sample text.  Or, (if standard code points are used) the
> font might be auto-detected as supporting BMP CJK by some
> applications, when it doesn't really support that range.
>
> On Mon, Nov 13, 2017 at 10:20 AM, Peter Constable via Unicode
> <unicode at unicode.org> wrote:
> > I?m wondering if anyone could come up with a string of 15 to 40
> characters _using only plane 2 characters_ that wouldn?t be gibberish?
> >
> > We are considering adding sample-text strings in some of our fonts. (In
> OpenType, the ?name? table can take sample-text strings using name ID 19.)
> One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts,
> which have CJK characters from plane 2 only.
> >
> > Background:
> > The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the
> Simsun and MingLiU fonts: the combined glyph count exceeds the number of
> glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts
> are used to contain all of the Plane 2 characters that are supported. For
> example, the Simsun font supports 28738 BMP characters, and no plane 2
> characters, while Simsun-ExtB supports the Basic Latin block from the BMP
> plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so
> can?t go into a single font.
> >
> >
> >
> > Peter
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171113/15fb4644/attachment.html>

From unicode at unicode.org  Mon Nov 13 14:05:24 2017
From: unicode at unicode.org (Charlie Ruland via Unicode)
Date: Mon, 13 Nov 2017 21:05:24 +0100
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <d1cee7db-b59e-46fc-ae02-09d0f51a6b4c@luckymail.com>

Many of characters in the CJK Compatibility Ideographs Supplement block 
are quite common Chinese characters, or variants thereof. You could try 
and build Chinese sentences with these characters.


On Mon, 13 Nov 2017 at 20:20 GMT+01:00 Peter Constable via Unicode wrote:
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>
>
>
> Peter


From unicode at unicode.org  Mon Nov 13 14:25:24 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 20:25:24 +0000
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
Message-ID: <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

We don't want to add BMP characters to the ExtB fonts.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass via Unicode
Sent: Monday, November 13, 2017 11:39 AM
To: Unicode list <unicode at unicode.org>
Subject: Re: Plane-2-only string

A font's sample text can be used in place of the default "The quick brown fox..." text which is used to illustrate the typeface in applications which support that feature.

One approach would be to find a non-gibberish text string using some Plane 2 characters and add the BMP glyphs to the font mapped to the BMP PUA.  Because if only a handful of BMP CJK glyphs were added to the font mapped to their standard code points, the font might need to claim to support BMP CJK (when in fact it does not) in order to display the sample text.  Or, (if standard code points are used) the font might be auto-detected as supporting BMP CJK by some applications, when it doesn't really support that range.

On Mon, Nov 13, 2017 at 10:20 AM, Peter Constable via Unicode <unicode at unicode.org> wrote:
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>
>
>
> Peter


From unicode at unicode.org  Mon Nov 13 14:29:01 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 20:29:01 +0000
Subject: Plane-2-only string
In-Reply-To: <CAGa7JC2GUrbAqqy5bUwLWoY-BpOE8OjvY_6SqYe2msFpb-R1fQ@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CAGa7JC2GUrbAqqy5bUwLWoY-BpOE8OjvY_6SqYe2msFpb-R1fQ@mail.gmail.com>
Message-ID: <CY4PR21MB0822CEB22C570308F92E0F08D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

Thanks. I?d need to know _at least something_ about what the characters signify, though, to have a sense of whether there?s anything potentially offensive.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode
Sent: Monday, November 13, 2017 11:51 AM
To: James Kass <jameskasskrv at gmail.com>
Cc: Unicode list <unicode at unicode.org>
Subject: Re: Plane-2-only string

May be this test page ?
http://www.i18nguy.com/unicode/supplementary-test.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.i18nguy.com%2Funicode%2Fsupplementary-test.html&data=02%7C01%7Cpetercon%40microsoft.com%7Ce4a52bf8c69943e825e908d52ad06d02%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636461997049400977&sdata=EeoebLU6skgb8lthnSQ3ChDzYCQTuQORcJNnXAYV4Ys%3D&reserved=0>


2017-11-13 20:38 GMT+01:00 James Kass via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>:
A font's sample text can be used in place of the default "The quick
brown fox..." text which is used to illustrate the typeface in
applications which support that feature.

One approach would be to find a non-gibberish text string using some
Plane 2 characters and add the BMP glyphs to the font mapped to the
BMP PUA.  Because if only a handful of BMP CJK glyphs were added to
the font mapped to their standard code points, the font might need to
claim to support BMP CJK (when in fact it does not) in order to
display the sample text.  Or, (if standard code points are used) the
font might be auto-detected as supporting BMP CJK by some
applications, when it doesn't really support that range.

On Mon, Nov 13, 2017 at 10:20 AM, Peter Constable via Unicode
<unicode at unicode.org<mailto:unicode at unicode.org>> wrote:
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>
>
>
> Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171113/1c6b0106/attachment.html>

From unicode at unicode.org  Mon Nov 13 14:31:30 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 20:31:30 +0000
Subject: Plane-2-only string
In-Reply-To: <d1cee7db-b59e-46fc-ae02-09d0f51a6b4c@luckymail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <d1cee7db-b59e-46fc-ae02-09d0f51a6b4c@luckymail.com>
Message-ID: <CY4PR21MB0822CBE9AA6D971F0BFB1678D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

Thanks for the suggestion. Alas, the fonts don't support that block.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Charlie Ruland via Unicode
Sent: Monday, November 13, 2017 12:05 PM
To: unicode at unicode.org
Subject: Re: Plane-2-only string

Many of characters in the CJK Compatibility Ideographs Supplement block are quite common Chinese characters, or variants thereof. You could try and build Chinese sentences with these characters.


On Mon, 13 Nov 2017 at 20:20 GMT+01:00 Peter Constable via Unicode wrote:
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>
>
>
> Peter


From unicode at unicode.org  Mon Nov 13 14:45:09 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 12:45:09 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z2BPZYg8775RLqhmD5ObA3TzzSa=2q1+zuLRYj-1h1W+w@mail.gmail.com>

Peter Constable wrote,


On Mon, Nov 13, 2017 at 12:25 PM, Peter Constable
<petercon at microsoft.com> wrote:
> We don't want to add BMP characters to the ExtB fonts.
>
>
> Peter
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass via Unicode
> Sent: Monday, November 13, 2017 11:39 AM
> To: Unicode list <unicode at unicode.org>
> Subject: Re: Plane-2-only string
>
> A font's sample text can be used in place of the default "The quick brown fox..." text which is used to illustrate the typeface in applications which support that feature.
>
> One approach would be to find a non-gibberish text string using some Plane 2 characters and add the BMP glyphs to the font mapped to the BMP PUA.  Because if only a handful of BMP CJK glyphs were added to the font mapped to their standard code points, the font might need to claim to support BMP CJK (when in fact it does not) in order to display the sample text.  Or, (if standard code points are used) the font might be auto-detected as supporting BMP CJK by some applications, when it doesn't really support that range.
>
> On Mon, Nov 13, 2017 at 10:20 AM, Peter Constable via Unicode <unicode at unicode.org> wrote:
>> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
>>
>> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
>>
>> Background:
>> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
>>
>>
>>
>> Peter
>


From unicode at unicode.org  Mon Nov 13 14:46:18 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 12:46:18 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z3LZ5O9-fKqO6_Fk+pow+7XCAoiiXsgyn+ncCHE9j3-FA@mail.gmail.com>

Peter Constable wrote,

> We don't want to add BMP characters to the ExtB fonts.

How about Plane 15 or 16, then?

From unicode at unicode.org  Mon Nov 13 14:46:18 2017
From: unicode at unicode.org (John H. Jenkins via Unicode)
Date: Mon, 13 Nov 2017 13:46:18 -0700
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <55072E26-7741-4F95-9B98-BB5809C80D9A@apple.com>

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 

That is an example of forty Cantonese-specific characters which are not obscene (that I'm aware of) from Extension B. For the curious, I've appended at the bottom the full list of 280 for all of Plane 2 which I was able to pull out of the Unihan database. I'm sure some enterprising poet can make something out of them.

> On Nov 13, 2017, at 11:20 AM, Peter Constable via Unicode <unicode at unicode.org> wrote:
> 
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
> 
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
> 
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
> 
> 
> 
> Peter
> <winmail.dat>

U+201A9		faan2	(Cant.) to play
U+20325		wu1 wu3	(Cant.) to bow, stoop
U+20341		man3	(Cant.) an undesirable situation
U+204FC		sip3	(Cant.) a wedge; to thrust in
U+20544		nap1	(Cant.) ???, a dimple
U+2076D		peng2	(Cant.) to fell, cut; to sweep away
U+20779		gaai3	(Cant.) to cut with a knife or scissors
U+20BA8		naai3	(Cant.) to tie, tow; bring along
U+20BA9		aa1 liu1	(Cant.) an interjection; rare, specialized
U+20BCB		jai4 jai5	(Cant.) naughty, inferior
U+20BE6		cai3	(Cant.) to eat, take a meal
U+20BFD		zi1	(Cant.) a final particle indicating affirmation
U+20C0B		jaau1	(Cant.) left-handed
U+20C32		eot1	(Cant.) to belch
U+20C41		tam3	(Cant.) to fool, trick, cheat
U+20C42		dat1	(Cant.) to put something or sit wherever one wishes; to rebuke, reproach
U+20C43		nip1	(Cant.) thin, flat; poor
U+20C53		ngai1	(Cant.) to importune, beg
U+20C58		ngaak6	(Cant.) contrary, opposing, against; disobedient
U+20C65		fik1 jit6 we5	(Cant.) wrangling, a noise; fitful; a soft fabric with no body
U+20C77		ming1	(Cant.) small
U+20C78		san2 seon2	(Cant.) phonetic
U+20C9C		zaang1	(Cant.) to owe
U+20CCF		ce2 ce6	(Cant.) interjection
U+20CD5		caau3	(Cant.) to search
U+20CD6		dap6	(Cant.) to strike, pound
U+20D15		miu2	(Cant.) to purse the lips; to wriggle
U+20D30		gau6	(Cant.) classifier for a piece or lump of something
U+20D47		keu4	(Cant.) peculiar, strange
U+20D48		mui2	(Cant.) to suck or chew without using the teeth
U+20D49		hong4	(Cant.) hope
U+20D69		go2	(Cant.) that
U+20D6F		gwit1 gwit3	(Cant.) onomatopoetic
U+20D7C		mang1 mang4	(Cant.) scars on the eyelid; phonetic
U+20D7E		waak1	(Cant.) eloquent, sharp-tongued
U+20D7F		pe1 pe5	(Cant.) a pair (from the Engl.); to stagger
U+20D9C		zai3	(Cant.) to do, work; to be willing
U+20DA7		dim6	(Cant.) straight, vertical; OK; to pick up with the fingers; verbal aspect marker of successful completion
U+20DB2		gap6 kap6	(Cant.) to stare at; to take a big bite
U+20E09		kak1	(Cant.) to block, obstruct
U+20E0A		tap1	(Cant.) an intensifying particle
U+20E0E		naa1	(Cant.) and, with
U+20E0F		ge2	(Cant.) final particle
U+20E10		kam1	(Cant.) to endure, last
U+20E11		soek3	(Cant.) soft, sodden
U+20E12		bou2	(Cant.) ????, a stranger
U+20E3A		ngaak6	(Cant.) contrary, opposing
U+20E6D		ko1	(Cant.) to call (Engl. loan-word)
U+20E73		git6	(Cant.) thick, viscous, dense
U+20E77		ngo4	(Cant.) to speak tirelessly
U+20E78		kam2	(Cant.) to cover, close up
U+20E7A		maai4	(Cant.) verbal aspect marker for comletion or movement towards
U+20E7B		zam6	(Cant.) classifier for smells
U+20E8C		gwe1	(Cant.) timid
U+20E98		long1 long2	(Cant.) hard to get along with; to rinse, spread thin
U+20E9D		gaak3	(Cant.) final particle
U+20EA2		gaa1 gaa2	(Cant.) final particle
U+20EAA		he3 hi1	(Cant.) in a rush; slovenly
U+20EAB		leu1	(Cant.) strange, peculiar
U+20EAC		he2	(Cant.) final particle
U+20ED7		le4	(Cant.) imperative final particle
U+20ED8		zeot6	(Cant.) sound of eating (onomatopoetic)
U+20EF4		long2	(Cant.) to rinse
U+20EFA		aa6	(Cant.) final particle
U+20EFB		bai3	(Cant.) noise, clamor
U+20F15		paai2	(Cant.) a suffix indicating time
U+20F2D		but1	(Cant.) sound of a car-horn (onomatopoetic)
U+20F2E		ngai1 ngi1	(Cant.) to urge, importune; a lie, fib
U+20F31		loe1 loe2	(Cant.) to spit out; to pester, nag
U+20F4C		syut3	(Cant.) sound of something rushing by
U+20F52		neng2	(Cant.) classifier for hats
U+20F64		kik1	(Cant.) to block, obstruct; head; phonetic
U+20F8D		he3	(Cant.) to flick something off in a disorderly way
U+20F8F		ce1	(Cant.) interjection
U+20FAD		we5	(Cant.) soft fabric with no body
U+20FB4		baang4 baang6	(Cant.) phonetic
U+20FB5		zaa1	(Cant.) final particle
U+20FBC		cyut1 cyut6	(Cant.) phonetic
U+20FEA		gaa2	(Cant.) final particle
U+20FEB		saau4	(Cant.) shabby
U+20FEC		soe4	(Cant.) ignorant
U+20FED		wet1	(Cant.) to go somewhere to have a good time
U+2101D		nam6	(Cant.) sound asleep
U+2101E		zip1	(Cant.) a Jeep; to wave, beckon
U+21020		bei6	or; emphatic particle; (Cant.) particle implying doubt
U+21029		lok1	(Cant.) onomatopoetic
U+2104F		am1 ngam1	(Cant.) soft rice or food for a baby
U+2105C		wo5	(Cant.) particle to close a quote
U+2106F		dyut1	(Cant.) to pout
U+21075		gan2	(Cant.) aspect marker for continuous action
U+21076		zit1	(Cant.) to scratch an itch
U+21077		doeng1	(Cant.) a sharp point; to peck
U+21078		kwaat1	(Cant.) a circle, ring
U+2107B		ziu1	(Cant.) to beat someone up
U+21088		buk6	(Cant.) to lie prone; to bend over
U+21096		lai2	(Cant.) unrestrained
U+2109D		zuk6	(Cant.) to choke and cough
U+210C0		e4 nge4	(Cant.) a musical instrument
U+210C1		leng1	(Cant.) member of a triad; young
U+210C7		bai6	(Cant.) exclamation
U+210C8		kwaak1 kwaak3	(Cant.) a lasso; a circle, frame
U+210C9		gaa3	(Cant.) final particle
U+210CF		doe6	(Cant.) to droop, hang down
U+210D3		bo3	(Cant.) final particle for emphasis
U+210E4		laai6	(Cant.) to leave behind, omit
U+210F4		ceoi4	(Cant.) smell, odor
U+210F5		ngung1 ngung2	(Cant.) to cover, bury; push from behind
U+210F6		sek3	(Cant.) to like, love; to kiss
U+2111F		haa1	(Cant.) onomatopoetic, the sound of panting
U+2112F		jik1	(Cant.) hiccough
U+21135		ji1	(Cant.) to grin, laugh
U+2113D		soe4	(Cant.) to slide down
U+21148		laa3	(Cant.) a particle implying completion, certainty, or urgency
U+2114F		lai2	(Cant.) to accuse, slander; to turn, sprain
U+21180		gwang2	(Cant.) special relationship
U+21187		wok1	(Cant.) a watt (Engl. loan-word)
U+211D9		doe4	(Cant.) round and full
U+21681		bai6	used-up, malpractices; (Cant.) bad, vile, corrupt
U+21731		gei2	to envy, to be angry with; (Cant.) pregnant
U+2197C		me1	(Cant.) to carry on the back
U+21C2A		duk1	(Cant.) end, bottom, rump
U+21CAC		gwat6	(Cant.) blunt
U+220C7		lei5	(Cant.) a sail
U+22208		nap1	(Cant.) dimple
U+22605		maau4	(Cant.) flurried, flustered; arbitrariliy
U+22696		ti4	(Cant.) intensifier
U+226F4		mang2	(Cant.) annoyed, impatient, restless
U+226F5		zang2	(Cant.) annoyed, irritated
U+22775		fit1	(Cant.) ???, to be fashionable
U+227B5		fit1	(Cant.) to brush, whisk
U+22803		geng6	(Cant.) to guard against; to take precautions
U+22939		goe4	(Cant.) satisfied, comfortable
U+22982		laan2	(Cant.) to brag, praise oneself
U+22A66		zit1	(Cant.) to squeeze out (as from a tube); to tickle
U+22ACF		kam2	(Cant.) to cover
U+22AD5		wing1 wing6	(Cant.) to throw away
U+22AE8		ngung2	(Cant.) to push from behind
U+22AEB		lat1	(Cant.) to rub
U+22B3F		kaai2 kaai5	(Cant.) sections or wedges (as of fruit); to take in the hand; to use
U+22B43		dau3 dau6	(Cant.) to touch; to bump into; to take, get, receive; to lightly support something with the hand
U+22B91		luk1	(Cant.) classifier for lengths of cylindrically shaped objects
U+22BCA		dik1	(Cant.) determination, resolution
U+22BCE		ngaau1	(Cant.) to scratch
U+22C38		wo5	(Cant.) rotten, bad, spoiled
U+22C51		waa2	(Cant.) to scratch
U+22C55		dap6	(Cant.) to beat, poud; to get drenched
U+22C62		saak3	to select; (Cant.) a wedge of a fruit such as an orange
U+22CA1		laa2 naa1	(Cant.) to grab with the hands; and, with
U+22CA9		kap1	(Cant.) to affix a chop or seal to a document
U+22CB5		cou5	(Cant.) to save up (money), to save up bit-by-bit
U+22CB7		ngaau1	to search; (Cant.) to scratch
U+22CB8		lou1	(Cant.) to shake violently, stir; to strip
U+22CC2		bat1 pat1	(Cant.) to scoop up, ladle out
U+22CC6		ngou4 ngou6	(Cant.) to shake, rattle
U+22D08		daat3	(Cant.) to throw down, fall down
U+22D12		paang1	(Cant.) to chase, drive away
U+22D44		cou5	(Cant.) to save up (money)
U+22D4C		deoi2	(Cant.) to goad, incite
U+22D53		paang1	(Cant.) to rush; chase someone out, drive out
U+22D67		gaan3	(Cant.) to draw lines
U+22D8D		saap3	(Cant.) garbage
U+22D9C		ngung2	(Cant.) to push; pull open
U+22DA0		saau4	(Cant.) to take without asking
U+22DA4		loe2	(Cant.) to pester, nag; to wallow; to roll around on the floor
U+22DAF		maan1	(Cant.) to pull, turn
U+22DEE		deoi2	(Cant.) to poke, nudge; stretch out
U+22E51		zaang6	(Cant.) to widen with force
U+22E8B		naan3	(Cant.) to stitch together, quilt
U+22F74		duk1	(Cant.) to poke, jab
U+233F4		jan4	(Cant.) a kind of fruit
U+233FE		dak6	(J) non-standard variant of ? U+6750, material, stuff; timber; talent; (Cant.) a peg, row of pegs
U+23528		kang3	(Cant.) to be entangled, twisted; (of alcohol and tobacco) to be strong
U+23595		peng1	(Cant.) the back of a chair for one to lean against
U+2361A		seot1	(Cant.) a bar; to bolt, lock
U+23695		jaap3	(Cant.) to wave, beckon with the hand
U+236BA		hong2 hong6	(Cant.) a young chicken
U+239C2		laai5	(Cant.) untidy
U+23CB7		nap6	(Cant.) sticky; not smooth; slow
U+23CFC		doe4	(Cant.) salivating
U+241A3		saap6	(Cant.) to cook in boiling water
U+24292		luk6	(Cant.) to scald with boiling water
U+2430D		hok3	(Cant.) to fry
U+245C8		sip3	(Cant.) to squeeze in, to stuff in
U+24674		caau1	(Cant.) gore
U+2472F		kap6	(Cant.) to bite
U+24DB8		naa1	(Cant.) a scar
U+24DC7		wak6	(Cant.) severe pain
U+24DEA		mang2	(Cant.) impatient, restless
U+24DEB		cek3 cik1	(Cant.) a prickling pain, ache
U+24E3B		naa1	(Cant.) a scar, scab; and, with
U+24E50		lit3	(Cant.) a knot
U+24EA7		zang2	(Cant.) annoyed
U+24FC2		saai4	(Cant.) unattractive, pale
U+24FEA		zaap3	(Cant.) wrinkled, crumpled
U+2502C		jim2	(Cant.) a scar
U+25052		ngaau4	(Cant.) warped
U+2510E		cik1	(Cant.) to pull, lift up
U+2512B		gap6	(Cant.) to stare, peep at
U+25148		laap3	(Cant.) to look, scan
U+25160		hau1	(Cant.) to fix one's eyes on, gaze at
U+2517E		zong1	(Cant.) to peek or peep at
U+251E3		gwat6	(Cant.) to glance
U+25232		kip1	(Cant.) to keep a close eye on, to control
U+25236		nam6	(Cant.) sound asleep
U+2528C		caau4	(Cant.) wrinkled, folded, creased, crumpled
U+25299		zong1	(Cant.) to peep at, look at secretly
U+252C7		caang3	(Cant.) to open the eyes wide
U+252D8		saau4	(Cant.) to swep the eyes over something
U+2531B		lai6	(Cant.) to gaze greedily at
U+25531		sin3	(Cant.) to slip
U+2553F		ham2	(Cant.) classifier for cannons, large guns, etc.
U+25945		lung1	(Cant.) a hole, hollow; cavity
U+259F9		tam5	(Cant.) puddle
U+25E49		nap6	(Cant.) sticky
U+26097		sok3	(Cant.) to tighten
U+260A5		dam3	(Cant.) to drop down
U+26258		caang1	(Cant.) a cooking pot, cooker
U+265BF		dap1	(Cant.) to hang down; to lower one's head
U+26629		paa4	(Cant.) chin
U+26696		zaap3	(Cant.) to wink
U+2688A		pok1	(Cant.) blister
U+26893		mak6	(Cant.) mole on skin
U+26926		hot3	(Cant.) a smell, scent
U+269F2		loe1 loe2	(Cant.) to dribble, spit; to pester, nag
U+269FA		laai2 laai5	(Cant.) to lick, lap up
U+26A88		ngou3	(Cant.) to kneel
U+26ED0		zaau3	(Cant.) to fry in oil
U+27285		gwaai2	(Cant.) frog, toad
U+272B6		doe3	(Cant.) insect sting
U+272CA		saa1	(Cant.) a large butterfly
U+272E6		mei1	(Cant.) a dragonfly; a small boat without a sail
U+27307		bang1	(Cant.) a large butterfly
U+27574		naan3	(Cant.) a pimple, an insect bite
U+27639		taai1	(Cant.) a necktie
U+27685		long6	(Cant.) crotch
U+27694		tung2	(Cant.) a kind of skirt
U+2775E		gei1	(Cant.) ???, khaki
U+2789D		lai6	(Cant.) to stare angrily
U+278C8		caau1	(Cant.) to gore
U+2797A		kwan1	(Cant.) to fool, deceive, hoodwink
U+279A0		ngaak1	(Cant.) to deceive
U+279DD		ngaa6	(Cant.) ????, to bar the way, obstruct
U+27A0A		zaa6	(Cant.) ????, to bar the way, obstruct
U+27A3E		tam3	(Cant.) to fool, trick, cheat
U+27D2F		me1	(Cant.) to carry on the back
U+27D84		zaang1	(Cant.) to owe
U+27ED9		mut6	(Cant.) ?????, not straightforward
U+27FD2		dam6	(Cant.) to stamp (one's foot)
U+27FEB		tau2	(Cant.) to have a rest
U+28023		kei2	(Cant.) a home, house
U+28024		leoi1	(Cant.) to suddenly fall or drop down
U+28048		gaang3	(Cant.) to ford, wade
U+28090		leoi1	(Cant.) to suddenly fall or drop down
U+280BD		dam6	(Cant.) to stamp the foot
U+280BE		naam3	(Cant.) to step across
U+280E9		sin3	(Cant.) to slip, slide
U+2814F		laam3	(Cant.) to step over, step across
U+2815D		jaang3	(Cant.) to press down or out with the foot; to kick; to tread on
U+281AA		jaang3	(Cant.) to press down or push out with the foot
U+281AF		buk6	(Cant.) to lie prone, bend over
U+28207		laam3	(Cant.) to step over, step across
U+28256		nei1 ni1	(Cant.) to hide oneself
U+2827C		wu3	(Cant.) to stoop, bow
U+2829B		laak3	(Cant.) nude, naked
U+282CD		wan1 wen1	(Cant.) a van
U+282E2		lip1	(Cant.) an elevator (from the British 'lift')
U+28B4C		baang1 paang1	(Cant.) bang; pan (Eng. loanwords)
U+294E5		ngok6	(Cant.) to raise the head
U+295F4		bung6	(Cant.) classifier for odors
U+29720		mam1 ngam1	(Cant.) soft rice for a small child
U+2994B		au6 ngau6	to gallop wildly; (Cant.) stupid
U+29A4D		peng1	(Cant.) ribs, rib-cage
U+29B0E		jam1 jam4	(Cant.) bangs (hair)
U+2A400		naa1	(Cant.) relationship; together
U+2A4AC		nung1	(Cant.) burned
U+2A601		kap6	(Cant.) to bite
U+2A632		ji1	(Cant.) to grin, smile
U+2A65B		nak1	(Cant.) decayed teeth; tongue-tied
U+2A6A9		gwi1	(Cant.) sound of shouting
U+2F907		baan6	(Cant.) mud, mire


From unicode at unicode.org  Mon Nov 13 14:48:39 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 12:48:39 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822CEB22C570308F92E0F08D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CAGa7JC2GUrbAqqy5bUwLWoY-BpOE8OjvY_6SqYe2msFpb-R1fQ@mail.gmail.com>
 <CY4PR21MB0822CEB22C570308F92E0F08D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z2govAwaZgBm-aD-6QdhP+qwXWdyggsiYAjdHydQT90Ww@mail.gmail.com>

Peter Constable wrote,

>> May be this test page ?
>>
>> http://www.i18nguy.com/unicode/supplementary-test.html
>
> Thanks. I?d need to know _at least something_ about what the characters
> signify, though, to have a sense of whether there?s anything potentially
> offensive.

The Plane 2 characters on that page appear to be random.


From unicode at unicode.org  Mon Nov 13 14:57:50 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 13 Nov 2017 21:57:50 +0100
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z2govAwaZgBm-aD-6QdhP+qwXWdyggsiYAjdHydQT90Ww@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CAGa7JC2GUrbAqqy5bUwLWoY-BpOE8OjvY_6SqYe2msFpb-R1fQ@mail.gmail.com>
 <CY4PR21MB0822CEB22C570308F92E0F08D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z2govAwaZgBm-aD-6QdhP+qwXWdyggsiYAjdHydQT90Ww@mail.gmail.com>
Message-ID: <CAGa7JC1UgvPn78phfjoAO+Swq3QbYY4CzuZJxUE50BuNqvLNqg@mail.gmail.com>

2017-11-13 21:48 GMT+01:00 James Kass <jameskasskrv at gmail.com>:

> Peter Constable wrote,
>
> >> May be this test page ?
> >>
> >> http://www.i18nguy.com/unicode/supplementary-test.html
> >
> > Thanks. I?d need to know _at least something_ about what the characters
> > signify, though, to have a sense of whether there?s anything potentially
> > offensive.
>
> The Plane 2 characters on that page appear to be random.
>

That's probable but the authors claim these are common characters. It's
possible they collected statistics from some corpus to find some of the
most widely used characters in Plane 2, without needing to understand what
they would mean if they are put side by side (I had noted already that
there was no punctuation at all, and the exposed collection is too long for
a typical Chinese text, and in fact I would expect the presence of some CJK
punctuations.
May be we could compile a list of Chinese toponyms using these, and select
those that use more than one Plane2 character, then separate these names
using CJK commas and a final CJK full stop.

Some Wikidata or OSM data search could be used to compile such list (I
think these topynyms will more likely be found in Cantonese, or Taiwanese
related sources, using the zh-Hant variant, but note that Wikidata does not
distinguish zh-Hans and zh-Hant as Wikimedia wikis use a transliterator,
but I doubt this transliterator performs transforms with Plane2 characters
which should remain unchanged with most of them kept for both traditional
and simplified use).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171113/0878dc1d/attachment.html>

From unicode at unicode.org  Mon Nov 13 15:19:04 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 21:19:04 +0000
Subject: Plane-2-only string
In-Reply-To: <55072E26-7741-4F95-9B98-BB5809C80D9A@apple.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <55072E26-7741-4F95-9B98-BB5809C80D9A@apple.com>
Message-ID: <CY4PR21MB082269769406A35490789FDBD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

Would a typical Chinese speaker be likely to recognize these as used in Cantonese? (I wouldn't want to have a font's sample-text string give the impression that it's a Cantonese font ? unless it were specifically intended for Cantonese.)

-----Original Message-----
From: jenkins at apple.com [mailto:jenkins at apple.com] 
Sent: Monday, November 13, 2017 12:46 PM
To: Peter Constable <petercon at microsoft.com>
Cc: Unicode list <unicode at unicode.org>
Subject: Re: Plane-2-only string

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 

That is an example of forty Cantonese-specific characters which are not obscene (that I'm aware of) from Extension B. For the curious, I've appended at the bottom the full list of 280 for all of Plane 2 which I was able to pull out of the Unihan database. I'm sure some enterprising poet can make something out of them.

> On Nov 13, 2017, at 11:20 AM, Peter Constable via Unicode <unicode at unicode.org> wrote:
> 
> I?m wondering if anyone could come up with a string of 15 to 40 characters _using only plane 2 characters_ that wouldn?t be gibberish?
> 
> We are considering adding sample-text strings in some of our fonts. (In OpenType, the ?name? table can take sample-text strings using name ID 19.) One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which have CJK characters from plane 2 only.
> 
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and MingLiU fonts: the combined glyph count exceeds the number of glyphs that can be added in a single OpenType font, and so the ?ExtB? fonts are used to contain all of the Plane 2 characters that are supported. For example, the Simsun font supports 28738 BMP characters, and no plane 2 characters, while Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 characters. The combined glyph count exceeds 64K, so can?t go into a single font.
> 
> 
> 
> Peter
> <winmail.dat>

U+201A9		faan2	(Cant.) to play
U+20325		wu1 wu3	(Cant.) to bow, stoop
U+20341		man3	(Cant.) an undesirable situation
U+204FC		sip3	(Cant.) a wedge; to thrust in
U+20544		nap1	(Cant.) ???, a dimple
U+2076D		peng2	(Cant.) to fell, cut; to sweep away
U+20779		gaai3	(Cant.) to cut with a knife or scissors
U+20BA8		naai3	(Cant.) to tie, tow; bring along
U+20BA9		aa1 liu1	(Cant.) an interjection; rare, specialized
U+20BCB		jai4 jai5	(Cant.) naughty, inferior
U+20BE6		cai3	(Cant.) to eat, take a meal
U+20BFD		zi1	(Cant.) a final particle indicating affirmation
U+20C0B		jaau1	(Cant.) left-handed
U+20C32		eot1	(Cant.) to belch
U+20C41		tam3	(Cant.) to fool, trick, cheat
U+20C42		dat1	(Cant.) to put something or sit wherever one wishes; to rebuke, reproach
U+20C43		nip1	(Cant.) thin, flat; poor
U+20C53		ngai1	(Cant.) to importune, beg
U+20C58		ngaak6	(Cant.) contrary, opposing, against; disobedient
U+20C65		fik1 jit6 we5	(Cant.) wrangling, a noise; fitful; a soft fabric with no body
U+20C77		ming1	(Cant.) small
U+20C78		san2 seon2	(Cant.) phonetic
U+20C9C		zaang1	(Cant.) to owe
U+20CCF		ce2 ce6	(Cant.) interjection
U+20CD5		caau3	(Cant.) to search
U+20CD6		dap6	(Cant.) to strike, pound
U+20D15		miu2	(Cant.) to purse the lips; to wriggle
U+20D30		gau6	(Cant.) classifier for a piece or lump of something
U+20D47		keu4	(Cant.) peculiar, strange
U+20D48		mui2	(Cant.) to suck or chew without using the teeth
U+20D49		hong4	(Cant.) hope
U+20D69		go2	(Cant.) that
U+20D6F		gwit1 gwit3	(Cant.) onomatopoetic
U+20D7C		mang1 mang4	(Cant.) scars on the eyelid; phonetic
U+20D7E		waak1	(Cant.) eloquent, sharp-tongued
U+20D7F		pe1 pe5	(Cant.) a pair (from the Engl.); to stagger
U+20D9C		zai3	(Cant.) to do, work; to be willing
U+20DA7		dim6	(Cant.) straight, vertical; OK; to pick up with the fingers; verbal aspect marker of successful completion
U+20DB2		gap6 kap6	(Cant.) to stare at; to take a big bite
U+20E09		kak1	(Cant.) to block, obstruct
U+20E0A		tap1	(Cant.) an intensifying particle
U+20E0E		naa1	(Cant.) and, with
U+20E0F		ge2	(Cant.) final particle
U+20E10		kam1	(Cant.) to endure, last
U+20E11		soek3	(Cant.) soft, sodden
U+20E12		bou2	(Cant.) ????, a stranger
U+20E3A		ngaak6	(Cant.) contrary, opposing
U+20E6D		ko1	(Cant.) to call (Engl. loan-word)
U+20E73		git6	(Cant.) thick, viscous, dense
U+20E77		ngo4	(Cant.) to speak tirelessly
U+20E78		kam2	(Cant.) to cover, close up
U+20E7A		maai4	(Cant.) verbal aspect marker for comletion or movement towards
U+20E7B		zam6	(Cant.) classifier for smells
U+20E8C		gwe1	(Cant.) timid
U+20E98		long1 long2	(Cant.) hard to get along with; to rinse, spread thin
U+20E9D		gaak3	(Cant.) final particle
U+20EA2		gaa1 gaa2	(Cant.) final particle
U+20EAA		he3 hi1	(Cant.) in a rush; slovenly
U+20EAB		leu1	(Cant.) strange, peculiar
U+20EAC		he2	(Cant.) final particle
U+20ED7		le4	(Cant.) imperative final particle
U+20ED8		zeot6	(Cant.) sound of eating (onomatopoetic)
U+20EF4		long2	(Cant.) to rinse
U+20EFA		aa6	(Cant.) final particle
U+20EFB		bai3	(Cant.) noise, clamor
U+20F15		paai2	(Cant.) a suffix indicating time
U+20F2D		but1	(Cant.) sound of a car-horn (onomatopoetic)
U+20F2E		ngai1 ngi1	(Cant.) to urge, importune; a lie, fib
U+20F31		loe1 loe2	(Cant.) to spit out; to pester, nag
U+20F4C		syut3	(Cant.) sound of something rushing by
U+20F52		neng2	(Cant.) classifier for hats
U+20F64		kik1	(Cant.) to block, obstruct; head; phonetic
U+20F8D		he3	(Cant.) to flick something off in a disorderly way
U+20F8F		ce1	(Cant.) interjection
U+20FAD		we5	(Cant.) soft fabric with no body
U+20FB4		baang4 baang6	(Cant.) phonetic
U+20FB5		zaa1	(Cant.) final particle
U+20FBC		cyut1 cyut6	(Cant.) phonetic
U+20FEA		gaa2	(Cant.) final particle
U+20FEB		saau4	(Cant.) shabby
U+20FEC		soe4	(Cant.) ignorant
U+20FED		wet1	(Cant.) to go somewhere to have a good time
U+2101D		nam6	(Cant.) sound asleep
U+2101E		zip1	(Cant.) a Jeep; to wave, beckon
U+21020		bei6	or; emphatic particle; (Cant.) particle implying doubt
U+21029		lok1	(Cant.) onomatopoetic
U+2104F		am1 ngam1	(Cant.) soft rice or food for a baby
U+2105C		wo5	(Cant.) particle to close a quote
U+2106F		dyut1	(Cant.) to pout
U+21075		gan2	(Cant.) aspect marker for continuous action
U+21076		zit1	(Cant.) to scratch an itch
U+21077		doeng1	(Cant.) a sharp point; to peck
U+21078		kwaat1	(Cant.) a circle, ring
U+2107B		ziu1	(Cant.) to beat someone up
U+21088		buk6	(Cant.) to lie prone; to bend over
U+21096		lai2	(Cant.) unrestrained
U+2109D		zuk6	(Cant.) to choke and cough
U+210C0		e4 nge4	(Cant.) a musical instrument
U+210C1		leng1	(Cant.) member of a triad; young
U+210C7		bai6	(Cant.) exclamation
U+210C8		kwaak1 kwaak3	(Cant.) a lasso; a circle, frame
U+210C9		gaa3	(Cant.) final particle
U+210CF		doe6	(Cant.) to droop, hang down
U+210D3		bo3	(Cant.) final particle for emphasis
U+210E4		laai6	(Cant.) to leave behind, omit
U+210F4		ceoi4	(Cant.) smell, odor
U+210F5		ngung1 ngung2	(Cant.) to cover, bury; push from behind
U+210F6		sek3	(Cant.) to like, love; to kiss
U+2111F		haa1	(Cant.) onomatopoetic, the sound of panting
U+2112F		jik1	(Cant.) hiccough
U+21135		ji1	(Cant.) to grin, laugh
U+2113D		soe4	(Cant.) to slide down
U+21148		laa3	(Cant.) a particle implying completion, certainty, or urgency
U+2114F		lai2	(Cant.) to accuse, slander; to turn, sprain
U+21180		gwang2	(Cant.) special relationship
U+21187		wok1	(Cant.) a watt (Engl. loan-word)
U+211D9		doe4	(Cant.) round and full
U+21681		bai6	used-up, malpractices; (Cant.) bad, vile, corrupt
U+21731		gei2	to envy, to be angry with; (Cant.) pregnant
U+2197C		me1	(Cant.) to carry on the back
U+21C2A		duk1	(Cant.) end, bottom, rump
U+21CAC		gwat6	(Cant.) blunt
U+220C7		lei5	(Cant.) a sail
U+22208		nap1	(Cant.) dimple
U+22605		maau4	(Cant.) flurried, flustered; arbitrariliy
U+22696		ti4	(Cant.) intensifier
U+226F4		mang2	(Cant.) annoyed, impatient, restless
U+226F5		zang2	(Cant.) annoyed, irritated
U+22775		fit1	(Cant.) ???, to be fashionable
U+227B5		fit1	(Cant.) to brush, whisk
U+22803		geng6	(Cant.) to guard against; to take precautions
U+22939		goe4	(Cant.) satisfied, comfortable
U+22982		laan2	(Cant.) to brag, praise oneself
U+22A66		zit1	(Cant.) to squeeze out (as from a tube); to tickle
U+22ACF		kam2	(Cant.) to cover
U+22AD5		wing1 wing6	(Cant.) to throw away
U+22AE8		ngung2	(Cant.) to push from behind
U+22AEB		lat1	(Cant.) to rub
U+22B3F		kaai2 kaai5	(Cant.) sections or wedges (as of fruit); to take in the hand; to use
U+22B43		dau3 dau6	(Cant.) to touch; to bump into; to take, get, receive; to lightly support something with the hand
U+22B91		luk1	(Cant.) classifier for lengths of cylindrically shaped objects
U+22BCA		dik1	(Cant.) determination, resolution
U+22BCE		ngaau1	(Cant.) to scratch
U+22C38		wo5	(Cant.) rotten, bad, spoiled
U+22C51		waa2	(Cant.) to scratch
U+22C55		dap6	(Cant.) to beat, poud; to get drenched
U+22C62		saak3	to select; (Cant.) a wedge of a fruit such as an orange
U+22CA1		laa2 naa1	(Cant.) to grab with the hands; and, with
U+22CA9		kap1	(Cant.) to affix a chop or seal to a document
U+22CB5		cou5	(Cant.) to save up (money), to save up bit-by-bit
U+22CB7		ngaau1	to search; (Cant.) to scratch
U+22CB8		lou1	(Cant.) to shake violently, stir; to strip
U+22CC2		bat1 pat1	(Cant.) to scoop up, ladle out
U+22CC6		ngou4 ngou6	(Cant.) to shake, rattle
U+22D08		daat3	(Cant.) to throw down, fall down
U+22D12		paang1	(Cant.) to chase, drive away
U+22D44		cou5	(Cant.) to save up (money)
U+22D4C		deoi2	(Cant.) to goad, incite
U+22D53		paang1	(Cant.) to rush; chase someone out, drive out
U+22D67		gaan3	(Cant.) to draw lines
U+22D8D		saap3	(Cant.) garbage
U+22D9C		ngung2	(Cant.) to push; pull open
U+22DA0		saau4	(Cant.) to take without asking
U+22DA4		loe2	(Cant.) to pester, nag; to wallow; to roll around on the floor
U+22DAF		maan1	(Cant.) to pull, turn
U+22DEE		deoi2	(Cant.) to poke, nudge; stretch out
U+22E51		zaang6	(Cant.) to widen with force
U+22E8B		naan3	(Cant.) to stitch together, quilt
U+22F74		duk1	(Cant.) to poke, jab
U+233F4		jan4	(Cant.) a kind of fruit
U+233FE		dak6	(J) non-standard variant of ? U+6750, material, stuff; timber; talent; (Cant.) a peg, row of pegs
U+23528		kang3	(Cant.) to be entangled, twisted; (of alcohol and tobacco) to be strong
U+23595		peng1	(Cant.) the back of a chair for one to lean against
U+2361A		seot1	(Cant.) a bar; to bolt, lock
U+23695		jaap3	(Cant.) to wave, beckon with the hand
U+236BA		hong2 hong6	(Cant.) a young chicken
U+239C2		laai5	(Cant.) untidy
U+23CB7		nap6	(Cant.) sticky; not smooth; slow
U+23CFC		doe4	(Cant.) salivating
U+241A3		saap6	(Cant.) to cook in boiling water
U+24292		luk6	(Cant.) to scald with boiling water
U+2430D		hok3	(Cant.) to fry
U+245C8		sip3	(Cant.) to squeeze in, to stuff in
U+24674		caau1	(Cant.) gore
U+2472F		kap6	(Cant.) to bite
U+24DB8		naa1	(Cant.) a scar
U+24DC7		wak6	(Cant.) severe pain
U+24DEA		mang2	(Cant.) impatient, restless
U+24DEB		cek3 cik1	(Cant.) a prickling pain, ache
U+24E3B		naa1	(Cant.) a scar, scab; and, with
U+24E50		lit3	(Cant.) a knot
U+24EA7		zang2	(Cant.) annoyed
U+24FC2		saai4	(Cant.) unattractive, pale
U+24FEA		zaap3	(Cant.) wrinkled, crumpled
U+2502C		jim2	(Cant.) a scar
U+25052		ngaau4	(Cant.) warped
U+2510E		cik1	(Cant.) to pull, lift up
U+2512B		gap6	(Cant.) to stare, peep at
U+25148		laap3	(Cant.) to look, scan
U+25160		hau1	(Cant.) to fix one's eyes on, gaze at
U+2517E		zong1	(Cant.) to peek or peep at
U+251E3		gwat6	(Cant.) to glance
U+25232		kip1	(Cant.) to keep a close eye on, to control
U+25236		nam6	(Cant.) sound asleep
U+2528C		caau4	(Cant.) wrinkled, folded, creased, crumpled
U+25299		zong1	(Cant.) to peep at, look at secretly
U+252C7		caang3	(Cant.) to open the eyes wide
U+252D8		saau4	(Cant.) to swep the eyes over something
U+2531B		lai6	(Cant.) to gaze greedily at
U+25531		sin3	(Cant.) to slip
U+2553F		ham2	(Cant.) classifier for cannons, large guns, etc.
U+25945		lung1	(Cant.) a hole, hollow; cavity
U+259F9		tam5	(Cant.) puddle
U+25E49		nap6	(Cant.) sticky
U+26097		sok3	(Cant.) to tighten
U+260A5		dam3	(Cant.) to drop down
U+26258		caang1	(Cant.) a cooking pot, cooker
U+265BF		dap1	(Cant.) to hang down; to lower one's head
U+26629		paa4	(Cant.) chin
U+26696		zaap3	(Cant.) to wink
U+2688A		pok1	(Cant.) blister
U+26893		mak6	(Cant.) mole on skin
U+26926		hot3	(Cant.) a smell, scent
U+269F2		loe1 loe2	(Cant.) to dribble, spit; to pester, nag
U+269FA		laai2 laai5	(Cant.) to lick, lap up
U+26A88		ngou3	(Cant.) to kneel
U+26ED0		zaau3	(Cant.) to fry in oil
U+27285		gwaai2	(Cant.) frog, toad
U+272B6		doe3	(Cant.) insect sting
U+272CA		saa1	(Cant.) a large butterfly
U+272E6		mei1	(Cant.) a dragonfly; a small boat without a sail
U+27307		bang1	(Cant.) a large butterfly
U+27574		naan3	(Cant.) a pimple, an insect bite
U+27639		taai1	(Cant.) a necktie
U+27685		long6	(Cant.) crotch
U+27694		tung2	(Cant.) a kind of skirt
U+2775E		gei1	(Cant.) ???, khaki
U+2789D		lai6	(Cant.) to stare angrily
U+278C8		caau1	(Cant.) to gore
U+2797A		kwan1	(Cant.) to fool, deceive, hoodwink
U+279A0		ngaak1	(Cant.) to deceive
U+279DD		ngaa6	(Cant.) ????, to bar the way, obstruct
U+27A0A		zaa6	(Cant.) ????, to bar the way, obstruct
U+27A3E		tam3	(Cant.) to fool, trick, cheat
U+27D2F		me1	(Cant.) to carry on the back
U+27D84		zaang1	(Cant.) to owe
U+27ED9		mut6	(Cant.) ?????, not straightforward
U+27FD2		dam6	(Cant.) to stamp (one's foot)
U+27FEB		tau2	(Cant.) to have a rest
U+28023		kei2	(Cant.) a home, house
U+28024		leoi1	(Cant.) to suddenly fall or drop down
U+28048		gaang3	(Cant.) to ford, wade
U+28090		leoi1	(Cant.) to suddenly fall or drop down
U+280BD		dam6	(Cant.) to stamp the foot
U+280BE		naam3	(Cant.) to step across
U+280E9		sin3	(Cant.) to slip, slide
U+2814F		laam3	(Cant.) to step over, step across
U+2815D		jaang3	(Cant.) to press down or out with the foot; to kick; to tread on
U+281AA		jaang3	(Cant.) to press down or push out with the foot
U+281AF		buk6	(Cant.) to lie prone, bend over
U+28207		laam3	(Cant.) to step over, step across
U+28256		nei1 ni1	(Cant.) to hide oneself
U+2827C		wu3	(Cant.) to stoop, bow
U+2829B		laak3	(Cant.) nude, naked
U+282CD		wan1 wen1	(Cant.) a van
U+282E2		lip1	(Cant.) an elevator (from the British 'lift')
U+28B4C		baang1 paang1	(Cant.) bang; pan (Eng. loanwords)
U+294E5		ngok6	(Cant.) to raise the head
U+295F4		bung6	(Cant.) classifier for odors
U+29720		mam1 ngam1	(Cant.) soft rice for a small child
U+2994B		au6 ngau6	to gallop wildly; (Cant.) stupid
U+29A4D		peng1	(Cant.) ribs, rib-cage
U+29B0E		jam1 jam4	(Cant.) bangs (hair)
U+2A400		naa1	(Cant.) relationship; together
U+2A4AC		nung1	(Cant.) burned
U+2A601		kap6	(Cant.) to bite
U+2A632		ji1	(Cant.) to grin, smile
U+2A65B		nak1	(Cant.) decayed teeth; tongue-tied
U+2A6A9		gwi1	(Cant.) sound of shouting
U+2F907		baan6	(Cant.) mud, mire


From unicode at unicode.org  Mon Nov 13 16:28:51 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 14:28:51 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>

Peter Constable wrote,

> We don't want to add BMP characters to the ExtB fonts.

So the sample text would lack punctuation.  Given that the
Supplementary Ideographic Plane is composed of rare and historical
characters from multiple sources, I suspect that the short answer to
Peter's original question is:  "No".

From unicode at unicode.org  Mon Nov 13 16:38:40 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 22:38:40 +0000
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
Message-ID: <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

I discussed this with one of my Chinese co-workers, and we came up with the following:

???????????
??????????
??????????
???????????

Factors in the choice of characters were:
- different radicals
- for a given radical, have a sequence of consecutive characters so people get the idea it's not a sentence but just a sequence of characters with related meanings
- radical groups increase in complexity


It's not a sentence that can be read, but there's an obvious pattern, so it's also not completely gibberish.


Peter

-----Original Message-----
From: James Kass [mailto:jameskasskrv at gmail.com] 
Sent: Monday, November 13, 2017 2:29 PM
To: Peter Constable <petercon at microsoft.com>
Cc: Unicode list <unicode at unicode.org>
Subject: Re: Plane-2-only string

Peter Constable wrote,

> We don't want to add BMP characters to the ExtB fonts.

So the sample text would lack punctuation.  Given that the Supplementary Ideographic Plane is composed of rare and historical characters from multiple sources, I suspect that the short answer to Peter's original question is:  "No".


From unicode at unicode.org  Mon Nov 13 16:54:03 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 14:54:03 -0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>

Peter Constable wrote,

> ???????????
> ??????????
> ??????????
> ???????????
>

??????????? ?????????? ?????????? ???????????

It looks good in blocks on four separate lines, but would a typical
font viewing or comparison tool be expected to break it down into four
lines?  The pattern is still apparent if displayed on just one line,
but separating the blocks with spaces or any punctuation would require
BMP characters in the ExtB font.

??????????????????????????????????????????


From unicode at unicode.org  Mon Nov 13 17:26:25 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 13 Nov 2017 23:26:25 +0000
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
Message-ID: <CY4PR21MB0822ED34DF707E3A0E6A01E0D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>

As mentioned in my initial mail, the fonts support the Basic Latin block from the BMP.

Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass via Unicode
Sent: Monday, November 13, 2017 2:54 PM
To: Unicode list <unicode at unicode.org>
Subject: Re: Plane-2-only string

Peter Constable wrote,

> ???????????
> ??????????
> ??????????
> ???????????
>

??????????? ?????????? ?????????? ???????????

It looks good in blocks on four separate lines, but would a typical font viewing or comparison tool be expected to break it down into four lines?  The pattern is still apparent if displayed on just one line, but separating the blocks with spaces or any punctuation would require BMP characters in the ExtB font.

??????????????????????????????????????????


From unicode at unicode.org  Mon Nov 13 17:52:40 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 14 Nov 2017 00:52:40 +0100
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
Message-ID: <CAGa7JC3KD2an1u+ODp+18tmJj-TwShY-8yisa+Q1RyUFkFM-jw@mail.gmail.com>

Any font would likely map the space (and probably for any CJK font the
ideographic space). As well the newline don't need any font, it is
synthetized by renderers. This could be used to compose some Japanese-like
Aiku with some meaning...

2017-11-13 23:54 GMT+01:00 James Kass via Unicode <unicode at unicode.org>:

> Peter Constable wrote,
>
> > ???????????
> > ??????????
> > ??????????
> > ???????????
> >
>
> ??????????? ?????????? ?????????? ???????????
>
> It looks good in blocks on four separate lines, but would a typical
> font viewing or comparison tool be expected to break it down into four
> lines?  The pattern is still apparent if displayed on just one line,
> but separating the blocks with spaces or any punctuation would require
> BMP characters in the ExtB font.
>
> ??????????????????????????????????????????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171114/065821c3/attachment.html>

From unicode at unicode.org  Mon Nov 13 18:35:42 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Nov 2017 16:35:42 -0800
Subject: Plane-2-only string
In-Reply-To: <CAGa7JC3KD2an1u+ODp+18tmJj-TwShY-8yisa+Q1RyUFkFM-jw@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
 <CAGa7JC3KD2an1u+ODp+18tmJj-TwShY-8yisa+Q1RyUFkFM-jw@mail.gmail.com>
Message-ID: <CABPY6Z3ojJ-Z9z4Bb3=xDBKS1Ueeq0oPRt0p3mG781upAO1LQA@mail.gmail.com>

Philippe Verdy wrote,

> ... As well the newline don't need any font, it is synthetized by renderers.

It's true that fonts don't need to have glyphs mapped for control
characters, but I'd hesitate to use any control character in a font's
sample text field because of the field's intended use.  But, we are
being moot here since Peter has reminded that the fonts in question
already have some BMP characters mapped, including certain punctuation
characters.

An ExtB font with BMP basic Latin could display the English language
default sample text "The quick brown fox..." with no problem, but a
non-English locale might substitute a default text string which the
font could not support.  So it's probably best to have *something* in
that field respresenting characters the font covers.

From unicode at unicode.org  Mon Nov 13 21:39:54 2017
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Tue, 14 Nov 2017 11:39:54 +0800
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z3ojJ-Z9z4Bb3=xDBKS1Ueeq0oPRt0p3mG781upAO1LQA@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
 <CAGa7JC3KD2an1u+ODp+18tmJj-TwShY-8yisa+Q1RyUFkFM-jw@mail.gmail.com>
 <CABPY6Z3ojJ-Z9z4Bb3=xDBKS1Ueeq0oPRt0p3mG781upAO1LQA@mail.gmail.com>
Message-ID: <CAGHjPPKbRz5PeduK=ZKwk1AZrWqTthaVPFYDAEe4Wf591DS-Ww@mail.gmail.com>

Perhaps the http://en.wikipedia.org/wiki/Martian_language should be
considered as a way to construct an example Chinese sentence from
characters that are only within Plane2? Probably coukd be understand by
more people than something Cantonese too
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171114/47cd7f6d/attachment.html>

From unicode at unicode.org  Mon Nov 13 21:23:40 2017
From: unicode at unicode.org (via Unicode)
Date: Tue, 14 Nov 2017 11:23:40 +0800
Subject: Plane-2-only string
In-Reply-To: <CABPY6Z3ojJ-Z9z4Bb3=xDBKS1Ueeq0oPRt0p3mG781upAO1LQA@mail.gmail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z30kJSPOAP6CnCt85U6m6SCJH-VPm2GHQM1N=CqAi6uDA@mail.gmail.com>
 <CAGa7JC3KD2an1u+ODp+18tmJj-TwShY-8yisa+Q1RyUFkFM-jw@mail.gmail.com>
 <CABPY6Z3ojJ-Z9z4Bb3=xDBKS1Ueeq0oPRt0p3mG781upAO1LQA@mail.gmail.com>
Message-ID: <7281e80980d8d0a9b8c07371798530ca@koremail.com>


With over a thousand Zhuang characters, Zhuang would work, though of 
course would not have punctuation.


Of the top of my head something like:-

????????????
????????????
????????????

In romanised Zhuang:-

Gou bae ranz gyoengqde
gou youq ranz ndaw gwn haeux
aen ranz baihlaeng miz naz

In English:-

I went to their house
I ate a meal in the house
behind the house were paddy fields


A native speaker would of course do much better.


Regards
John Knightley

From unicode at unicode.org  Mon Nov 13 23:45:13 2017
From: unicode at unicode.org (via Unicode)
Date: Tue, 14 Nov 2017 13:45:13 +0800
Subject: Plane-2-only string
In-Reply-To: <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
Message-ID: <eb839ba70a6e6dc5c85ba5a0b4318a5b@koremail.com>

Dear Peter,

since the Chinese characters below are meaningless in Chinese using 
them should not be a first choice, as they are meaningless, so 
gibberish, just not complete gibberish.

Plane 2 has a fair number of older Chinese characters, so someone with 
a knowledge of ancient Chinese might well be able make something 
meaningful. Run a competition in China would be one way to get 
suggestions, spotting a good suggestion is easier than making one.

Plane 2 has Cantonese, Vietnamese and Zhuang characters. The number of 
Cantonese characters is small, so making phrases using only them would 
be difficult. Both Vietnamese and Zhuang have a much larger number of 
characters so much easier to make something meaningful.

The following Zhuang proverb, or saying

????????????????????

"Plant sweet potatoes in the field, and raise pigs in the sty."[lit: 
house, as the bottom floor of tradional house used for livestock and 
people live in floor above.]

However third and eighth characters are not the most common used.

Regards
John


On 14.11.2017 06:38, Peter Constable via Unicode wrote:
> I discussed this with one of my Chinese co-workers, and we came up
> with the following:
>
> ???????????
> ??????????
> ??????????
> ???????????
>
> Factors in the choice of characters were:
> - different radicals
> - for a given radical, have a sequence of consecutive characters so
> people get the idea it's not a sentence but just a sequence of
> characters with related meanings
> - radical groups increase in complexity
>
>
> It's not a sentence that can be read, but there's an obvious pattern,
> so it's also not completely gibberish.
>
>
> Peter
>
> -----Original Message-----
> From: James Kass [mailto:jameskasskrv at gmail.com]
> Sent: Monday, November 13, 2017 2:29 PM
> To: Peter Constable <petercon at microsoft.com>
> Cc: Unicode list <unicode at unicode.org>
> Subject: Re: Plane-2-only string
>
> Peter Constable wrote,
>
>> We don't want to add BMP characters to the ExtB fonts.
>
> So the sample text would lack punctuation.  Given that the
> Supplementary Ideographic Plane is composed of rare and historical
> characters from multiple sources, I suspect that the short answer to
> Peter's original question is:  "No".


From unicode at unicode.org  Mon Nov 13 23:45:53 2017
From: unicode at unicode.org (Tex via Unicode)
Date: Mon, 13 Nov 2017 21:45:53 -0800
Subject: FW: Plane-2-only string i18nguy supplementary-test page 
Message-ID: <000a01d35d0b$d5fda670$81f8f350$@xencraft.com>

 
I am the author of the supplementary-test page on i18nguy.com.

 
The method for choosing the characters is described on the page, so isn?t a mystery. See below.

I do not believe any of the characters are offensive, although context matters greatly and languages evolve, so it is possible that a character can gain an offensive meaning or usage at any time.

Consider the humble eggplant?

 
The page was created to offer values for testing supplementary characters with values that would justify fixing any problems they uncover.

The values are probably not the best choice for demonstrating and marketing fonts, the usage Peter is looking for.

 
Here is an excerpt from the page:

 
In 2005, the IRG (Ideographic Rapporteur Group) <http://www.cse.cuhk.edu.hk/~irg/index.htm>  identified a set ideographs, called the Ideographic International Core (IICore) <http://appsrv.cse.cuhk.edu.hk/~irg/irg/IICore/IICore.htm> . The 10,000 ideographs in the IICore are the most frequently used characters that would cover the vast majority of modern texts in all locales where ideographs are used. This collection is intended for use in devices with limited resources, such as mobile phones.

Test Characters

To have characters that are good for testing software support for the Supplementary Plane, I extracted the 62 characters from the IICORE that are in the Supplementary Plane. These characters have the properties that:

?        Being in IICORE they are used frequently enough to be a minimum requirement for software supporting ideographs

?        They are in the Supplementary Plane and will test support for code points above U+FFFF

?        They are not "oddball" values. If using them uncovers a problem, fixing the problem is inherently justified.

 
Tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode
Sent: Monday, November 13, 2017 12:58 PM
To: James Kass
Cc: Peter Constable; Unicode list
Subject: Re: Plane-2-only string

 
2017-11-13 21:48 GMT+01:00 James Kass <jameskasskrv at gmail.com>:

Peter Constable wrote,

>> May be this test page ?
>>
>> http://www.i18nguy.com/unicode/supplementary-test.html
>
> Thanks. I?d need to know _at least something_ about what the characters
> signify, though, to have a sense of whether there?s anything potentially
> offensive.

The Plane 2 characters on that page appear to be random.

 
That's probable but the authors claim these are common characters. It's possible they collected statistics from some corpus to find some of the most widely used characters in Plane 2, without needing to understand what they would mean if they are put side by side (I had noted already that there was no punctuation at all, and the exposed collection is too long for a typical Chinese text, and in fact I would expect the presence of some CJK punctuations.

May be we could compile a list of Chinese toponyms using these, and select those that use more than one Plane2 character, then separate these names using CJK commas and a final CJK full stop.

 
Some Wikidata or OSM data search could be used to compile such list (I think these topynyms will more likely be found in Cantonese, or Taiwanese related sources, using the zh-Hant variant, but note that Wikidata does not distinguish zh-Hans and zh-Hant as Wikimedia wikis use a transliterator, but I doubt this transliterator performs transforms with Plane2 characters which should remain unchanged with most of them kept for both traditional and simplified use).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171113/ae80982e/attachment-0001.html>

From unicode at unicode.org  Tue Nov 14 02:04:09 2017
From: unicode at unicode.org (Bobby Tung via Unicode)
Date: Tue, 14 Nov 2017 16:04:09 +0800
Subject: Plane-2-only string
In-Reply-To: <eb839ba70a6e6dc5c85ba5a0b4318a5b@koremail.com>
References: <CY4PR21MB0822FA055A6FE4E6BF20701DD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z1WaWzKGCU8ZTobgdBqTzruawZjAV+4sGNfq+iGxoFjGQ@mail.gmail.com>
 <CY4PR21MB0822FAFE8002F2F78D3D196AD52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <CABPY6Z03SmQ+K+HykgTQyGYhXFBSC_2r5EA6-XU+VdYWFpqoig@mail.gmail.com>
 <CY4PR21MB08220FC77F2693C614903E84D52B0@CY4PR21MB0822.namprd21.prod.outlook.com>
 <eb839ba70a6e6dc5c85ba5a0b4318a5b@koremail.com>
Message-ID: <662D1539-BA8A-4237-BA66-7857FF4C8A8E@wanderer.tw>

Hello,

Here's a list of frequently used Han characters for Hakka and Minnan, Chinese dialects.

It contains several EXT-B characters that you can test: 

http://bobbytung.github.io/TaigiHakkaIdeograph/ <http://bobbytung.github.io/TaigiHakkaIdeograph/>
https://docs.google.com/spreadsheets/d/18CUbZ7tsvZ4QbUj3xcfYi9EGqsft4T37WtUMX9v2STQ/pubhtml <https://docs.google.com/spreadsheets/d/18CUbZ7tsvZ4QbUj3xcfYi9EGqsft4T37WtUMX9v2STQ/pubhtml>


Bobby Tung
W3C invited expert
Editor of CLREQ


> via Unicode <unicode at unicode.org> ? 2017?11?14? ??1:45 ???
> 
> Dear Peter,
> 
> since the Chinese characters below are meaningless in Chinese using them should not be a first choice, as they are meaningless, so gibberish, just not complete gibberish.
> 
> Plane 2 has a fair number of older Chinese characters, so someone with a knowledge of ancient Chinese might well be able make something meaningful. Run a competition in China would be one way to get suggestions, spotting a good suggestion is easier than making one.
> 
> Plane 2 has Cantonese, Vietnamese and Zhuang characters. The number of Cantonese characters is small, so making phrases using only them would be difficult. Both Vietnamese and Zhuang have a much larger number of characters so much easier to make something meaningful.
> 
> The following Zhuang proverb, or saying
> 
> ????????????????????
> 
> "Plant sweet potatoes in the field, and raise pigs in the sty."[lit: house, as the bottom floor of tradional house used for livestock and people live in floor above.]
> 
> However third and eighth characters are not the most common used.
> 
> Regards
> John
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 14.11.2017 06:38, Peter Constable via Unicode wrote:
>> I discussed this with one of my Chinese co-workers, and we came up
>> with the following:
>> 
>> ???????????
>> ??????????
>> ??????????
>> ???????????
>> 
>> Factors in the choice of characters were:
>> - different radicals
>> - for a given radical, have a sequence of consecutive characters so
>> people get the idea it's not a sentence but just a sequence of
>> characters with related meanings
>> - radical groups increase in complexity
>> 
>> 
>> It's not a sentence that can be read, but there's an obvious pattern,
>> so it's also not completely gibberish.
>> 
>> 
>> Peter
>> 
>> -----Original Message-----
>> From: James Kass [mailto:jameskasskrv at gmail.com]
>> Sent: Monday, November 13, 2017 2:29 PM
>> To: Peter Constable <petercon at microsoft.com>
>> Cc: Unicode list <unicode at unicode.org>
>> Subject: Re: Plane-2-only string
>> 
>> Peter Constable wrote,
>> 
>>> We don't want to add BMP characters to the ExtB fonts.
>> 
>> So the sample text would lack punctuation.  Given that the
>> Supplementary Ideographic Plane is composed of rare and historical
>> characters from multiple sources, I suspect that the short answer to
>> Peter's original question is:  "No".
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171114/bfa8d146/attachment.html>

From unicode at unicode.org  Thu Nov 30 13:58:02 2017
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Thu, 30 Nov 2017 19:58:02 +0000 (GMT)
Subject: International Digital Preservation Day
Message-ID: <23335823.63353.1512071882309.JavaMail.defaultUser@defaultHost>

I have learned this evening (I am in England where it is nearly 8pm as I write this note) that today, Thursday 30 November 2017, is the first International Digital Preservation Day.

I have searched on the web and found lots of links about International Digital Preservation Day.

William Overington

Thursday 30 November 2017