From samjnaa at gmail.com  Sat Feb  1 22:20:32 2014
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Sun, 2 Feb 2014 09:50:32 +0530
Subject: Astrological symbol for Pluto?
Message-ID: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>

Currently Unicode encodes a distinct astrological symbol for Uranus 2645 ?
vs an astronomical symbol 26E2 ?.

However the only symbol encoded for Pluto is the astronomical one: 2647 ?.
Just now I learnt from https://en.wikipedia.org/wiki/Pluto#Name that there
is a distinct astrological symbol:

[image: Inline image 1]
Has there been any proposal to encode this? (I'm guessing Michael might be
interested...)

-- 
Shriramana Sharma ???????????? ????????????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140202/1faef939/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 200px-Pluto's_astrological_symbol.svg.png
Type: image/png
Size: 3056 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140202/1faef939/attachment.png>

From ishida at w3.org  Sun Feb  2 11:46:14 2014
From: ishida at w3.org (Richard Ishida)
Date: Sun, 02 Feb 2014 17:46:14 +0000
Subject: UniView is back
Message-ID: <52EE8466.2050307@w3.org>

If you were a user of my UniView tool you'll find a new version at

http://rishida.net/uniview/


I am rebuiding UniView without PHP. Most of the essential features work 
in the new version, but there are one or two that I have yet to rebuild, 
and you may find that the odd thing just won't work.

Principal things still outstanding include:

     Searching the Unicode database (searching the information local to 
the page works, if you select 'local')
     Listing of characters with a given property (although filtering the 
information currently on the page still works, if you turn 'local' on)
     My notes on individual characters no longer appear at the bottom of 
the right panel
     You can't show an annotated list of all characters in a block

A graphic X indicates some of the things that don't work, or only 
partially work. The images will be removed as the features are reinstated.

You may also find that UniView is initially slower in a couple of ways. 
I haven't yet reinstated the AJAX to pull in character data at the point 
of need: instead the app downloads a 2.5Mb of data before running. Also, 
the initial draw of a block on the left is somewhat slower than before. 
Hopefully, over time, I will address these issues.

If there's a feature you especially need that is not available, let me 
know and I may be able to prioritise work on that.


From jknappen at web.de  Mon Feb  3 01:57:19 2014
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Mon, 3 Feb 2014 08:57:19 +0100 (CET)
Subject: Aw: Astrological symbol for Pluto?
In-Reply-To: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
Message-ID: <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140203/0ea6b9e0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 200px-Pluto's_astrological_symbol.svg.png
Type: image/png
Size: 3056 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140203/0ea6b9e0/attachment.png>

From frederic.grosshans at gmail.com  Mon Feb  3 04:45:42 2014
From: frederic.grosshans at gmail.com (=?ISO-8859-1?Q?Fr=E9d=E9ric_Grosshans?=)
Date: Mon, 03 Feb 2014 11:45:42 +0100
Subject: Aw: Astrological symbol for Pluto?
In-Reply-To: <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
 <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>
Message-ID: <52EF7356.2010207@gmail.com>

Le 03/02/2014 08:57, "J?rg Knappen" a ?crit :
> Unfortunately,
> this astrological symbol is given in the Wikipedia article, but not 
> sourced. So I think, further evidence for its usage is needed.
>
Actually, it is sourced (with the other symbils) to 
http://www.uranian-institute.org/bfglyphs.htm , which lists no less than 
4 symbols for Pluto...

     Fred


From samjnaa at gmail.com  Mon Feb  3 07:14:39 2014
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Mon, 3 Feb 2014 18:44:39 +0530
Subject: Aw: Astrological symbol for Pluto?
In-Reply-To: <52EF7356.2010207@gmail.com>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
 <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>
 <52EF7356.2010207@gmail.com>
Message-ID: <CAH-HCWWMrT0GKJhagjaULcm4gctRo+7pS-rtK+JJs_uPPrtwXw@mail.gmail.com>

On Mon, Feb 3, 2014 at 4:15 PM, Fr?d?ric Grosshans <
frederic.grosshans at gmail.com> wrote:

>
>>  Actually, it is sourced (with the other symbils) to
> http://www.uranian-institute.org/bfglyphs.htm , which lists no less than
> 4 symbols for Pluto...
>

In any case, it seems its astronomical symbol was encoded quite early
(DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to
dwarf planet status. Of course, even if it were encoded today I'm sure it
would be the only dwarf planet to have a symbol encoded since no other
dwarf planet has captured the common man's imagination (and basic
knowledge) like Pluto, and I have not heard any of the other dwarf planets
(Ceres, Haumea, Makemake and Eris) having any symbols...

-- 
Shriramana Sharma ???????????? ????????????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140203/06cc5409/attachment.html>

From andrewcwest at gmail.com  Mon Feb  3 07:29:22 2014
From: andrewcwest at gmail.com (Andrew West)
Date: Mon, 3 Feb 2014 13:29:22 +0000
Subject: Aw: Astrological symbol for Pluto?
In-Reply-To: <CAH-HCWWMrT0GKJhagjaULcm4gctRo+7pS-rtK+JJs_uPPrtwXw@mail.gmail.com>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
 <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>
 <52EF7356.2010207@gmail.com>
 <CAH-HCWWMrT0GKJhagjaULcm4gctRo+7pS-rtK+JJs_uPPrtwXw@mail.gmail.com>
Message-ID: <CALgEMhyMm8B3HAB8g=JfTtCcFtKV=XG8uAEiH7nFFFQb-3Nt0A@mail.gmail.com>

On 3 February 2014 13:14, Shriramana Sharma <samjnaa at gmail.com> wrote:
>
> In any case, it seems its astronomical symbol was encoded quite early
> (DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to
> dwarf planet status. Of course, even if it were encoded today I'm sure it
> would be the only dwarf planet to have a symbol encoded since no other dwarf
> planet has captured the common man's imagination (and basic knowledge) like
> Pluto, and I have not heard any of the other dwarf planets (Ceres, Haumea,
> Makemake and Eris) having any symbols...

Well, there are no fewer than four unencoded astrological symbols for
Eris according to this Wikipedia article:

<https://en.wikipedia.org/wiki/Astrological_symbols>

Andrew


From jknappen at web.de  Mon Feb  3 07:43:59 2014
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Mon, 3 Feb 2014 14:43:59 +0100 (CET)
Subject: Aw: Re:  Astrological symbol for Pluto?
In-Reply-To: <CAH-HCWWMrT0GKJhagjaULcm4gctRo+7pS-rtK+JJs_uPPrtwXw@mail.gmail.com>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
 <trinity-d9ae414b-9de8-4578-aac2-6e2995c8f620-1391414239224@3capp-webde-bs28>
 <52EF7356.2010207@gmail.com>,
 <CAH-HCWWMrT0GKJhagjaULcm4gctRo+7pS-rtK+JJs_uPPrtwXw@mail.gmail.com>
Message-ID: <trinity-a982c590-165d-4070-9268-d3c43cad4a39-1391435038912@3capp-webde-bs28>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140203/fc295771/attachment.html>

From everson at evertype.com  Mon Feb  3 10:34:41 2014
From: everson at evertype.com (Michael Everson)
Date: Mon, 3 Feb 2014 08:34:41 -0800
Subject: Astrological symbol for Pluto?
In-Reply-To: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
References: <CAH-HCWUrd8ebLpV4Z4qJ50T51aPe1mc7=Hd0tnyjaG4ZNzosVw@mail.gmail.com>
Message-ID: <A5592285-FF72-4F91-907D-DE179D493770@evertype.com>

On 1 Feb 2014, at 20:20, Shriramana Sharma <samjnaa at gmail.com> wrote:

> Currently Unicode encodes a distinct astrological symbol for Uranus 2645 ? vs an astronomical symbol 26E2 ?. 
> 
> Has there been any proposal to encode this? (I'm guessing Michael might be interested?)

I?d be happy to, unless there was going to be pushback from the UTC.

Michael Everson * http://www.evertype.com/


From martinho.fernandes at gmail.com  Tue Feb  4 08:43:37 2014
From: martinho.fernandes at gmail.com (Martinho Fernandes)
Date: Tue, 04 Feb 2014 15:43:37 +0100
Subject: Arabic percent sign and percent signs in RTL scripts
Message-ID: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>

Is the arabic percent sign (U+066A) just a typographical variation of the "normal" percent sign (U+0025) or is it somehow more distinct than that?

What about its placement? Is it placed to the left or to the right of the digits it applies to?

Mit freundlichen Gr??en,?

Martinho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140204/7a55717a/attachment.html>

From James_Lin at symantec.com  Tue Feb  4 11:05:53 2014
From: James_Lin at symantec.com (James Lin)
Date: Tue, 4 Feb 2014 09:05:53 -0800
Subject: Arabic percent sign and percent signs in RTL scripts
In-Reply-To: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
References: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
Message-ID: <CF165BE0.D495%james_lin@symantec.com>

For Arabic, percentage sign is fixed on the left side of the digit: %10  and for Hebrew, percentage sign is on the right side of digit: 10%.


-James


From: Martinho Fernandes <martinho.fernandes at gmail.com<mailto:martinho.fernandes at gmail.com>>
Date: Tuesday, February 4, 2014 6:43 AM
To: Unicode List <unicode at unicode.org<mailto:unicode at unicode.org>>
Subject: Arabic percent sign and percent signs in RTL scripts

Is the arabic percent sign (U+066A) just a typographical variation of the "normal" percent sign (U+0025) or is it somehow more distinct than that?

What about its placement? Is it placed to the left or to the right of the digits it applies to?

Mit freundlichen Gr??en,

Martinho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140204/f4515d5e/attachment.html>

From jkorpela at cs.tut.fi  Tue Feb  4 13:37:06 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Tue, 04 Feb 2014 21:37:06 +0200
Subject: Arabic percent sign and percent signs in RTL scripts
In-Reply-To: <CF165BE0.D495%james_lin@symantec.com>
References: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
 <CF165BE0.D495%james_lin@symantec.com>
Message-ID: <52F14162.8040800@cs.tut.fi>

2014-02-04 19:05, James Lin wrote:

> For Arabic, percentage sign is fixed on the left side of the digit: %10

There seem to be different opinions and practices on this. In the CLDR 
database, the formats have ?%? (the Ascii percent sign) on the right of 
the number, as far as I can see; Arabic inherits the root settings for 
percentages.

Yucca


From richard.wordingham at ntlworld.com  Tue Feb  4 18:51:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 5 Feb 2014 00:51:09 +0000
Subject: Arabic percent sign and percent signs in RTL scripts
In-Reply-To: <52F14162.8040800@cs.tut.fi>
References: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
 <CF165BE0.D495%james_lin@symantec.com> <52F14162.8040800@cs.tut.fi>
Message-ID: <20140205005109.6db5accd@JRWUBU2>

On Tue, 04 Feb 2014 21:37:06 +0200
"Jukka K. Korpela" <jkorpela at cs.tut.fi> wrote:

> 2014-02-04 19:05, James Lin wrote:

> > For Arabic, percentage sign is fixed on the left side of the digit:
> > %10

> There seem to be different opinions and practices on this. In the
> CLDR database, the formats have ?%? (the Ascii percent sign) on the
> right of the number, as far as I can see; Arabic inherits the root
> settings for percentages.

As far as I can *see* (perhaps there are hidden format characters in
the CLDR data), the '%' follows the digits, and so will occur on the
left if number plus percentage sign is flanked by Arabic text.
Character sequence <beh, 1, 2, 3, %, teh> yields, recording
right-to-left, glyph sequence <beh, 3, 2, 1, %, teh>. Both percentage
signs are bidi class ET ('European Terminator'), and the preceding
Arabic text converts the digits to class AN 'Arabic number' whichever
of the three sets of Arabic digits (DIGIT ZERO onwards, ARABIC-INDIC
DIGIT ZERO onwards or EXTENDED ARABIC-INDIC DIGIT ZERO onwards) is used.

Richard.


From khaledhosny at eglug.org  Wed Feb  5 02:06:03 2014
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Wed, 5 Feb 2014 10:06:03 +0200
Subject: Arabic percent sign and percent signs in RTL scripts
In-Reply-To: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
References: <vcpuu80uq0u6mo4rcikwj38d.1391525017237@email.android.com>
Message-ID: <20140205080602.GA15328@khaled-laptop>

On Tue, Feb 04, 2014 at 03:43:37PM +0100, Martinho Fernandes wrote:
> Is the arabic percent sign (U+066A) just a typographical variation of
> the "normal" percent sign (U+0025) or is it somehow more distinct than
> that?

The former. It is mainly used when Arabic-Indic or Extended Arabic-Indic
digits are used.
 
> What about its placement? Is it placed to the left or to the right of
> the digits it applies to?

It should follow the digit in the input stream, and its proper visual
placement should be handled by the Unicode bidirectional algorithm.

Regards,
Khaled


From rhavin at shadowtec.de  Tue Feb  4 16:25:06 2014
From: rhavin at shadowtec.de (Rhavin Grobert)
Date: Tue, 04 Feb 2014 23:25:06 +0100
Subject: proposal for new character 'soft/preferred line break'
Message-ID: <52F168C2.7090401@shadowtec.de>

Parallel to soft hyphen, a hyphen that is just inserted if the word was 
broken, it would be practical to have some way to tell browser: if you 
need to break the line, try here first. This would be really usefull for 
poems, music lines, adresses,?

And it would be really easy to implement: there is no visual 
representation needed and if the right code-point is choosen, it would 
be downward-compatible to all systems not knowing of the new character.

Also, the implementation in browers would be very easy to acomplish.

Please support this proposal,

Rhavin Grobert


-- 
Rhavin Grobert ? ShadowTec media, B?dikersteig 11, 13629 Berlin
http://rhavin.de/ ? Tontechnik, Consulting, Wartung und Planung
MCITP & Event-Professional, Windows & Linux Administrator
C++ ? Perl ? Java ? JavaScript ? Ruby ? XHTML+CSS ? XML ? PowerShell


From markus.icu at gmail.com  Wed Feb  5 10:22:12 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 5 Feb 2014 08:22:12 -0800
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F168C2.7090401@shadowtec.de>
References: <52F168C2.7090401@shadowtec.de>
Message-ID: <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>

On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de> wrote:

> Parallel to soft hyphen, a hyphen that is just inserted if the word was
> broken, it would be practical to have some way to tell browser: if you need
> to break the line, try here first. This would be really usefull for poems,
> music lines, adresses,?
>

That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or U+200B
ZERO WIDTH SPACE <http://unicode.org/cldr/utility/character.jsp?a=200B>.

And it would be really easy to implement: there is no visual representation
> needed and if the right code-point is choosen, it would be
> downward-compatible to all systems not knowing of the new character.
>

Unlikely. There are some unassigned code points that are predefined with
Default_Ignorable_Code_Point<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADI%3A%5D%26%5B%3ACn%3A%5D&g=>,
but that is not supported everywhere.

Also, the implementation in browers would be very easy to acomplish.
>

Maybe. You could research how widely <wbr> and U+200B are supported. (I
don't have that data.)

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140205/51c9f2f8/attachment.html>

From jkorpela at cs.tut.fi  Wed Feb  5 12:35:59 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Wed, 05 Feb 2014 20:35:59 +0200
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
Message-ID: <52F2848F.6050103@cs.tut.fi>

2014-02-05 18:22, Markus Scherer wrote:

> On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de
> <mailto:rhavin at shadowtec.de>> wrote:
>
>     Parallel to soft hyphen, a hyphen that is just inserted if the word
>     was broken, it would be practical to have some way to tell browser:
>     if you need to break the line, try here first. This would be really
>     usefull for poems, music lines, adresses,?
>
>
> That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or
> U+200B ZERO WIDTH SPACE

As a suggested direct line break point, they both work fine, with few 
caveats though, making it a bit difficult to decide which one is better, 
see my treatise
http://www.cs.tut.fi/~jkorpela/html/nobr.html#suggest

In plain text, of course, U+200B is the way. The main problem with it is 
that some software, including some old browsers like IE 6, do not 
recognize it but try to render it as a graphic character, possibly using 
a font that has no glyph for it. Adding a new character would not help 
here at all, of course.

>     And it would be really easy to implement: there is no visual
>     representation needed and if the right code-point is choosen, it
>     would be downward-compatible to all systems not knowing of the new
>     character.
>
> Unlikely.

Indeed, there is no reason to expect old software to silently ignore 
characters that they do not recognize. Whatever the Unicode Standard 
might say, old software just does what it has been programmed to do, and 
this may well be ?here?s a character for which I have no special rule, 
so I?ll use whatever is available in the font(s) I?m using?, typically 
resulting in a small rectangle that represents a character for which no 
glyph is available.

But I?m not quite sure of the idea of the suggestion. If the idea is to 
provide an optional break point, in a position where none would normally 
not be present, then U+200B is the way. Not 100% reliable, but better 
than anything else (in plain text).

But if the idea is to suggest that among permissible line break points, 
this one is preferable, then it?s a different issue. Theoretically 
interesting, but in practical terms, things don?t work that way. In 
practice, there are permissible line break points (either by implicit 
rules that e.g. normally allow a break after a space, or by explicit 
indication by U+200B). Programs will take it from there, and if they do 
some optimization, like good publishing software does, they typically 
optimize the division of an entire paragraph into lines, applying 
several criteria.

Yucca


From richard.wordingham at ntlworld.com  Wed Feb  5 14:20:23 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 5 Feb 2014 20:20:23 +0000
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
Message-ID: <20140205202023.02bf8b48@JRWUBU2>

On Wed, 5 Feb 2014 08:22:12 -0800
Markus Scherer <markus.icu at gmail.com> wrote:

> On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de>
> wrote:
> 
> > Parallel to soft hyphen, a hyphen that is just inserted if the word
> > was broken, it would be practical to have some way to tell browser:
> > if you need to break the line, try here first. This would be really
> > usefull for poems, music lines, adresses,?
> >
> 
> That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or
> U+200B ZERO WIDTH SPACE
> <http://unicode.org/cldr/utility/character.jsp?a=200B>.

I don't think these are the same.  They give permission for the line to
be broken at those points, with a strong tendency for the opportunity
nearest the end to be taken.  What Rhavin wants to do is to override
this tendency.  I presume the idea is that if a line of poetry will not
fit on a physical line, the line should instead be broken as its
principal caesura.

While such logic makes sense if a line only needs to be broken once,
what if it needs to be broken three or four times?  I feel this logic
belongs in the realm of complex mark-up rather than the very simple
mark-up afforded by characters.

I'll give an example.  As I don't trust my formatting to survive, I've
marked the end of physical lines with a raised dot(?).  For example,
consider:

The princely palace of the sun stood gorgeous to behold?
On stately pillars builded high of yellow burnished gold?

If we break it at the principal caesuras, then

The princely palace of the sun?
  stood gorgeous to behold?
On stately pillars builded high?
  of yellow burnished gold?

looks fine.  Am I cheating by believing one would choose to have
continuations of lines indented?

However, if the available width is reduced further:

The princely palace of the?
  sun?
  stood gorgeous to?
  behold?
On stately pillars builded?
  high?
  of yellow burnished?
  gold?

The result is a mess.

Richard.


From rhavin at shadowtec.de  Wed Feb  5 15:44:18 2014
From: rhavin at shadowtec.de (Rhavin Grobert)
Date: Wed, 05 Feb 2014 22:44:18 +0100
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
Message-ID: <52F2B0B2.9090209@shadowtec.de>

last mail was sent from wrong address, sorry, if u get id twice, answer 
to this one ;)


Am 05.02.2014 17:22, schrieb Markus Scherer:
 > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de
 > <mailto:rhavin at shadowtec.de>> wrote:
 >
 >     Parallel to soft hyphen, a hyphen that is just inserted if the word
 >     was broken, it would be practical to have some way to tell browser:
 >     if you need to break the line, try here first. This would be really
 >     usefull for poems, music lines, adresses,?
 >
 >
 > That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or
 > U+200B ZERO WIDTH SPACE
 > <http://unicode.org/cldr/utility/character.jsp?a=200B>.


No, you did not understand. <wbr> is like &shy; its below the whitespace 
level: if the line is to long, it breaks a word:

"This is a long line with a verylong<wbr>awesomeword in its middle."

Wbr gives the opportunity to break at long|awesome. But what i mean is:
- non existing "sbr" in parralell to shy assumed -

"Do you think me gentle,<sbr/>do you think me cold?
do you wanna risk a<sbr/>look into my thoughts?"

if line is long enough:

"Do you think me gentle, do you think me cold?
do you wanna risk a look into my thoughts?"

if line is not long enough:

"Do you think me gentle,
do you think me cold?
do you wanna risk a
look into my thoughts?"

Poems need some whitespace-element that is *above* usual whaitespaces 
when it comes to linebreaks, <wbr/> and &shy are *below* all whitespaces.
-- 
Rhavin Grobert ? ShadowTec media, B?dikersteig 11, 13629 Berlin
http://rhavin.de/ ? Tontechnik, Consulting, Wartung und Planung
MCITP & Event-Professional, Windows & Linux Administrator
C++ ? Perl ? Java ? JavaScript ? Ruby ? XHTML+CSS ? XML ? PowerShell


From buck at yelp.com  Wed Feb  5 16:15:44 2014
From: buck at yelp.com (Buck Golemon)
Date: Wed, 5 Feb 2014 14:15:44 -0800
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
Message-ID: <CANDQx1rEZ_C2wdcgzcdMPx=BjgNVRD1Czy_QW7XQxPirBvok_Q@mail.gmail.com>

On Wed, Feb 5, 2014 at 8:22 AM, Markus Scherer <markus.icu at gmail.com> wrote:

> On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de>wrote:
>
>> Parallel to soft hyphen, a hyphen that is just inserted if the word was
>> broken, it would be practical to have some way to tell browser: if you need
>> to break the line, try here first. This would be really usefull for poems,
>> music lines, adresses,...
>>
>
> That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or U+200B
> ZERO WIDTH SPACE <http://unicode.org/cldr/utility/character.jsp?a=200B>.
>
> And it would be really easy to implement: there is no visual
>> representation needed and if the right code-point is choosen, it would be
>> downward-compatible to all systems not knowing of the new character.
>>
>
> Unlikely. There are some unassigned code points that are predefined with
> Default_Ignorable_Code_Point<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADI%3A%5D%26%5B%3ACn%3A%5D&g=>,
> but that is not supported everywhere.
>
> Also, the implementation in browers would be very easy to acomplish.
>>
>
> Maybe. You could research how widely <wbr> and U+200B are supported. (I
> don't have that data.)
>
>
Here's the wbr support story: http://www.quirksmode.org/oddsandends/wbr.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140205/48c2752e/attachment.html>

From jkorpela at cs.tut.fi  Wed Feb  5 16:27:10 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Thu, 06 Feb 2014 00:27:10 +0200
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F2B0B2.9090209@shadowtec.de>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
Message-ID: <52F2BABE.4070406@cs.tut.fi>

2014-02-05 23:44, Rhavin Grobert wrote:

> Wbr gives the opportunity to break at long|awesome. But what i mean is:
> - non existing "sbr" in parralell to shy assumed -

Just giving a hypothetical character or tag an identifier does not 
specify its intended meaning.

> "Do you think me gentle,<sbr/>do you think me cold?
> do you wanna risk a<sbr/>look into my thoughts?"
>
> if line is long enough:
>
> "Do you think me gentle, do you think me cold?
> do you wanna risk a look into my thoughts?"
>
> if line is not long enough:
>
> "Do you think me gentle,
> do you think me cold?
> do you wanna risk a
> look into my thoughts?"

This seems to be what Richard Wordingham guessed what you mean, more or 
less.

> Poems need some whitespace-element that is *above* usual whaitespaces
> when it comes to linebreaks, <wbr/> and &shy are *below* all whitespaces.

Anything ?above? the character level is generally up to higher-level 
protocols rather than what the Unicode Standard deals with.

It seems to me that you actually want is to make some line break points 
the only allowed break points. So you would rather want to prohibit 
breaks elsewhere than introduce a ?soft/preferred line break?.

At the character level, you could use no-break spaces for the purpose. 
Using the entity reference &nbsp; (for U+00A9) for clarity here, you 
could write

Do&nbsp;you&nbsp;think&nbsp;me&nbsp;gentle, 
do&nbsp;you&nbsp;think&nbsp;me&nbsp;cold?
do&nbsp;you&nbsp;wanna&nbsp;risk&nbsp;a 
look&nbsp;into&nbsp;my&nbsp;thoughts?

If the text contains hyphens or other characters that might allow a line 
break by default, you made need something extra.

If this is actually about HTML authoring, you can successfully use

<nobr>Do you think me gentle,</nobr>
<nobr>do you think me cold?</nobr>
<nobr>do you wanna risk a</nobr>
<nobr>look into my thoughts?</nobr>

If you need/want to ?conform to HTML standards?, you can, with some 
marginal loss in functionality, use <span style="white-space: 
nowrap">...</span> instead of nobr elements.

Anyway, there appears to be existing solutions to the problem. They 
might be a bit clumsy, but adding an ?exclusive line break opportunity? 
into Unicode would introduce quite some complexity and burden on 
implementations.

Yucca


From asmusf at ix.netcom.com  Wed Feb  5 17:55:46 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 05 Feb 2014 15:55:46 -0800
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F2BABE.4070406@cs.tut.fi>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de> <52F2BABE.4070406@cs.tut.fi>
Message-ID: <52F2CF82.2020607@ix.netcom.com>

I agree, the use of <nobreak> markup is more appropriate to the problem. 
This is not a plain text issue and it even fails the "smell test" for 
"issue that is more elegantly solved by format characters than markup".

A./

On 2/5/2014 2:27 PM, Jukka K. Korpela wrote:
> 2014-02-05 23:44, Rhavin Grobert wrote:
>
>> Wbr gives the opportunity to break at long|awesome. But what i mean is:
>> - non existing "sbr" in parralell to shy assumed -
>
> Just giving a hypothetical character or tag an identifier does not 
> specify its intended meaning.
>
>> "Do you think me gentle,<sbr/>do you think me cold?
>> do you wanna risk a<sbr/>look into my thoughts?"
>>
>> if line is long enough:
>>
>> "Do you think me gentle, do you think me cold?
>> do you wanna risk a look into my thoughts?"
>>
>> if line is not long enough:
>>
>> "Do you think me gentle,
>> do you think me cold?
>> do you wanna risk a
>> look into my thoughts?"
>
> This seems to be what Richard Wordingham guessed what you mean, more 
> or less.
>
>> Poems need some whitespace-element that is *above* usual whaitespaces
>> when it comes to linebreaks, <wbr/> and &shy are *below* all 
>> whitespaces.
>
> Anything ?above? the character level is generally up to higher-level 
> protocols rather than what the Unicode Standard deals with.
>
> It seems to me that you actually want is to make some line break 
> points the only allowed break points. So you would rather want to 
> prohibit breaks elsewhere than introduce a ?soft/preferred line break?.
>
> At the character level, you could use no-break spaces for the purpose. 
> Using the entity reference &nbsp; (for U+00A9) for clarity here, you 
> could write
>
> Do&nbsp;you&nbsp;think&nbsp;me&nbsp;gentle, 
> do&nbsp;you&nbsp;think&nbsp;me&nbsp;cold?
> do&nbsp;you&nbsp;wanna&nbsp;risk&nbsp;a 
> look&nbsp;into&nbsp;my&nbsp;thoughts?
>
> If the text contains hyphens or other characters that might allow a 
> line break by default, you made need something extra.
>
> If this is actually about HTML authoring, you can successfully use
>
> <nobr>Do you think me gentle,</nobr>
> <nobr>do you think me cold?</nobr>
> <nobr>do you wanna risk a</nobr>
> <nobr>look into my thoughts?</nobr>
>
> If you need/want to ?conform to HTML standards?, you can, with some 
> marginal loss in functionality, use <span style="white-space: 
> nowrap">...</span> instead of nobr elements.
>
> Anyway, there appears to be existing solutions to the problem. They 
> might be a bit clumsy, but adding an ?exclusive line break 
> opportunity? into Unicode would introduce quite some complexity and 
> burden on implementations.
>
> Yucca
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From fantasai.lists at inkedblade.net  Fri Feb  7 02:02:01 2014
From: fantasai.lists at inkedblade.net (fantasai)
Date: Fri, 07 Feb 2014 00:02:01 -0800
Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing
 Modes
In-Reply-To: <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp>
References: <BLU174-W409D4099EDCD4ED53E28ACB3C60@phx.gbl>
 <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp>
Message-ID: <52F492F9.3020705@inkedblade.net>

On 01/27/2014 05:34 PM, Koji Ishii wrote:
> On Dec 21, 2013, at 20:39, CE Whitehead <cewcathar at hotmail.com <mailto:cewcathar at hotmail.com>> wrote:
>
>> 4.3
>> "alphabetic
>>     The alphabetic baseline is assumed to be at the under margin edge.
>> "central
>>     The central baseline is assumed to be halfway between the under and over margin edges of the box. "
>> =>
>> "alphabetic
>>     The alphabetic baseline is assumed to be at the under-margin edge.
>> "central
>>     The central baseline is assumed to be halfway between the under- and over-margin edges of the box. "
>>
>> {COMMENT:  normally when you use two words to modify a single word, as when "under margin", "over margin" modify the word,
>> "edge" or "edges", then it is customary to join the two modifying words with a hyphen.}
>
> Fixed.

Actually, this is an incorrect edit. I've reverted it. Under and
over are in this case used as adjectives, and are not part of
the word "margin". This follows the pattern of "left margin" as
opposed to "left-margin".

>> 6.2 second paragraph (after the list of four "flow-relative  directions" -- block-end, block-start, etc.)
>> "Where unambiguous (or dual-meaning), the terms start and end are used in place of block-start/inline-start and
>> block-end/inline-end, respectively."
>>
>> {COMMENT: "unambiguous" is the opposite of "dual-meaning" -- "dual meaning" means "ambiguous"; do you mean the following?
>> (if so it's o.k. to eliminate the stuff in parentheses altogether):}
>
> Fixed.

Similarly, this is an incorrect edit. The intent is the opposite
of "ambiguous" in the sense of "lacking clearness or definiteness".
If the intent is clear from context OR if the intent encompasses
both meanings, then the ambiguous terms start/end are allowed to
be used. I have removed the parentheses to make this clear.

~fantasai


From kojiishi at gluesoft.co.jp  Fri Feb  7 02:22:10 2014
From: kojiishi at gluesoft.co.jp (Koji Ishii)
Date: Fri, 7 Feb 2014 08:22:10 +0000
Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3
 Writing  Modes
In-Reply-To: <52F492F9.3020705@inkedblade.net>
References: <BLU174-W409D4099EDCD4ED53E28ACB3C60@phx.gbl>
 <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp>
 <52F492F9.3020705@inkedblade.net>
Message-ID: <DD63E4C8-FC64-4A01-B57F-6C9F53BDED91@gluesoft.co.jp>

On Feb 7, 2014, at 0:02, fantasai <fantasai.lists at inkedblade.net> wrote:

>>> 6.2 second paragraph (after the list of four "flow-relative  directions" -- block-end, block-start, etc.)
>>> "Where unambiguous (or dual-meaning), the terms start and end are used in place of block-start/inline-start and
>>> block-end/inline-end, respectively."
>>> 
>>> {COMMENT: "unambiguous" is the opposite of "dual-meaning" -- "dual meaning" means "ambiguous"; do you mean the following?
>>> (if so it's o.k. to eliminate the stuff in parentheses altogether):}
>> 
>> Fixed.
> 
> Similarly, this is an incorrect edit. The intent is the opposite
> of "ambiguous" in the sense of "lacking clearness or definiteness".
> If the intent is clear from context OR if the intent encompasses
> both meanings, then the ambiguous terms start/end are allowed to
> be used. I have removed the parentheses to make this clear.

After a bit more discussion with fantasai, the intent of ?dual-meaning? in this context is ?both directions?, but I thought it means ?either. direction?

Maybe it?s better to use different wording that indicates ?both directions? better?

/koji


From fantasai.lists at inkedblade.net  Fri Feb  7 01:06:16 2014
From: fantasai.lists at inkedblade.net (fantasai)
Date: Thu, 06 Feb 2014 23:06:16 -0800
Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing
 Modes
In-Reply-To: <CA+FsOYZzoKiK1EzxKrjuwG=GgW_hQJ7dzKV0EAf5fGKg3=GtMA@mail.gmail.com>
References: <52957E08.1060000@inkedblade.net>
 <CA+FsOYZzoKiK1EzxKrjuwG=GgW_hQJ7dzKV0EAf5fGKg3=GtMA@mail.gmail.com>
Message-ID: <52F485E8.4010306@inkedblade.net>

On 12/26/2013 05:58 AM, Aharon (Vladimir) Lanin wrote:
> Hixie filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=24006 on
> Writing Modes in the beginning of December, and I added some comments
> there. It does not seem to have been addressed yet.

Thanks for punting that to the ML.

Wrt the paragraph beginning "In general...", it has been revised:

   # In CSS, the paragraph embedding level must be set (following rule HL1)
   # according to the direction property of the paragraph?s containing
   # block rather than by the heuristic given in steps P2 and P3 of the
   # Unicode algorithm. There is, however, one exception: when the
   # computed unicode-bidi of the paragraph?s containing block is
   # 'plaintext', the Unicode heuristics in P2 and P3 are used as
   # described in [UAX9], without the HL1 override.

Wrt referring to the HL* rules, the bidi spec does not appear to require
such references, only that modifications to the algorithm conform to
those rules. However I have added the references as you request to help
clarify the intent.

Wrt using "must" everywhere, whether one agrees or disagrees with the style,
it is not a habit of the CSS specs to do so, and statements without the
modifier are nonetheless normative per
   http://www.w3.org/TR/css3-writing-modes/#conventions

> > is "the bidi control codes assigned to the end" defined anywhere?
>
> Yes, the control codes are defined under the various unicode-bidi
> values [..] But I agree that some sort of reference is needed.

Since this sentence is only a few paragraphs below the section that
defines them, I haven't added a link. But all of them are now talking
about rule HL3, so this will help create that correspondance.

> I now realize, however, that the spec does not make it 100% clear for
> isolate-override whether it "combines" the isolate on the outside of
> the override or vice-versa.

This is now specified explicitly.

Comment #2 is handled separately, see thread at
   http://lists.w3.org/Archives/Public/www-style/2014Feb/0267.htm

Updated ED: http://dev.w3.org/csswg/css-writing-modes/

Please let me know if this sufficiently addresses the comment.

~fantasai


From kojiishi at gluesoft.co.jp  Fri Feb  7 13:01:07 2014
From: kojiishi at gluesoft.co.jp (Koji Ishii)
Date: Fri, 7 Feb 2014 19:01:07 +0000
Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3
 Writing  Modes
In-Reply-To: <DD63E4C8-FC64-4A01-B57F-6C9F53BDED91@gluesoft.co.jp>
References: <BLU174-W409D4099EDCD4ED53E28ACB3C60@phx.gbl>
 <5C1870EC-0ED7-400A-A469-FB6635D4FEB1@gluesoft.co.jp>
 <52F492F9.3020705@inkedblade.net>
 <DD63E4C8-FC64-4A01-B57F-6C9F53BDED91@gluesoft.co.jp>
Message-ID: <a678e89a526c4f83bd037a27517d1c9e@HKXPR01MB005.apcprd01.prod.exchangelabs.com>

> After a bit more discussion with fantasai, the intent of "dual-meaning" in this context
> is "both directions", but I thought it means "either. direction"
> Maybe it's better to use different wording that indicates "both directions" better?

And we've fixed this.

/koji


From fantasai.lists at inkedblade.net  Fri Feb  7 14:48:15 2014
From: fantasai.lists at inkedblade.net (fantasai)
Date: Fri, 07 Feb 2014 12:48:15 -0800
Subject: [CSSWG][css-writing-modes] Last Call for Comments on CSS3 Writing
 Modes
In-Reply-To: <CA+FsOYbS1yHHRrhUSdQqq7swdbfzf22Gb-7AH8NtmzfLBpuB-g@mail.gmail.com>
References: <52957E08.1060000@inkedblade.net>
 <CA+FsOYZzoKiK1EzxKrjuwG=GgW_hQJ7dzKV0EAf5fGKg3=GtMA@mail.gmail.com>
 <52F485E8.4010306@inkedblade.net>
 <CA+FsOYbS1yHHRrhUSdQqq7swdbfzf22Gb-7AH8NtmzfLBpuB-g@mail.gmail.com>
Message-ID: <52F5468F.7070304@inkedblade.net>

On 02/07/2014 09:57 AM, Aharon (Vladimir) Lanin wrote:
> Thanks, looks great!
>
> Just one nit: HL1 etc. are not rules. UAX9 referes to the HLs as
> "clauses". So, the references to them should be something
> like "clause HLx of [UAX9]".

Fixed!

~fantasai


From verdy_p at wanadoo.fr  Mon Feb 10 01:13:05 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 10 Feb 2014 08:13:05 +0100
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F2B0B2.9090209@shadowtec.de>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
Message-ID: <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>

2014-02-05 22:44 GMT+01:00 Rhavin Grobert <rhavin at shadowtec.de>:

> last mail was sent from wrong address, sorry, if u get id twice, answer to
> this one ;)
>
>
> Am 05.02.2014 17:22, schrieb Markus Scherer:
> > On Tue, Feb 4, 2014 at 2:25 PM, Rhavin Grobert <rhavin at shadowtec.de
> > <mailto:rhavin at shadowtec.de>> wrote:
> >
> >     Parallel to soft hyphen, a hyphen that is just inserted if the word
> >     was broken, it would be practical to have some way to tell browser:
> >     if you need to break the line, try here first. This would be really
> >     usefull for poems, music lines, adresses,...
> >
> >
> > That would be HTML <wbr> <http://dev.w3.org/html5/markup/wbr.html> or
> > U+200B ZERO WIDTH SPACE
> > <http://unicode.org/cldr/utility/character.jsp?a=200B>.
>
>
> No, you did not understand. <wbr> is like &shy; its below the whitespace
> level: if the line is to long, it breaks a word:
>
> "This is a long line with a verylong<wbr>awesomeword in its middle."
>
> Wbr gives the opportunity to break at long|awesome. But what i mean is:
> - non existing "sbr" in parralell to shy assumed -
>
> "Do you think me gentle,<sbr/>do you think me cold?
> do you wanna risk a<sbr/>look into my thoughts?"
>
> if line is long enough:
>
> "Do you think me gentle, do you think me cold?
> do you wanna risk a look into my thoughts?"
>

The <wbr> is enough for this purpose,

A browser could even use them to give higher priority to break lines, than
on other breaking oppotunities (on whitespace or with some punctuation).
However I' not convinced this increased priority is a good thing if this
cannot be controled (the wbr element should then have an attribute to
control this priority, relatively to standard priorities of break
opportunities found in the plain text; it should also have attributes to
control how the break will be realized, such as inserting hyphens or not,
or another character, as it is not necessarily easy to deduce only from the
language tagging).

What you want is just to hint the line breaker in the renderer on where the
linebreaks are the best beneficial. This is really something that does not
belong to plain text, but to the presentation layer, and HTML for example
is reach enough about such presentation layer (that does not modify the
underlying plain text, so if you get the "innerText" property of an element
containing these tags, they are invisible and you'll onlywant to see the
plain text itself).

In my opinion the encced SHY character is there only for legacy reasons
(compatibility with older encodings when renderers had no good option to
break words. But in HTML SHY is not needed and <wbr> will work better.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140210/1f12757a/attachment.html>

From jkorpela at cs.tut.fi  Mon Feb 10 01:53:37 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Mon, 10 Feb 2014 09:53:37 +0200
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
 <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
Message-ID: <52F88581.5080501@cs.tut.fi>

2014-02-10 9:13, Philippe Verdy wrote:

> The <wbr> is enough for this purpose,

No, since the purpose was clearly to specify a line break point that is 
preferred over other possible line break points, or even the only 
allowed line break point within a string.

The <wbr> tag (an old nonstandard tag, now being standardized in HTML5) 
would not have been needed if browsers had supported U+200B. It is 
nowadays debatable which one should be used (U+200B has the disadvantage 
of not being supported by IE 6, a still somewhat significant point). But 
in any case, they are for allowing direct line break points, nothing more.

> A browser could even use them to give higher priority to break lines,

That would be rather arbitrary and won?t happen; there is no good reason 
for that.

> What you want is just to hint the line breaker in the renderer on where
> the linebreaks are the best beneficial. This is really something that
> does not belong to plain text, but to the presentation layer, and HTML
> for example is reach enough about such presentation layer

In rendering software, the choice between line break opportunities is 
usually either a very simple one (put as many characters on a line as 
possible) or a complicated layout decision that tries to optimize the 
spacing between words at a paragraph level. I don?t think there is much 
room for any layout instructions at any layer, beyond interactive fine 
tuning where a human user instructs the problem to split at specific 
point and sees what happens, or prevents a specific break. 
Theoretically, it is an interesting idea to consider control characters 
or markup for line break opportunities with different preferability, but 
in practice, it would be too complicated as compared with the possible gain.

> In my opinion the encced SHY character is there only for legacy reasons
> (compatibility with older encodings when renderers had no good option to
> break words. But in HTML SHY is not needed and <wbr> will work better.

They are completely different things. You might be confusing <wbr> with 
&shy; (which is just a named reference for SHY, useful when you want it 
to be visible in source code).

Yucca


From richard.wordingham at ntlworld.com  Mon Feb 10 13:49:03 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 10 Feb 2014 19:49:03 +0000
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F88581.5080501@cs.tut.fi>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
 <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
 <52F88581.5080501@cs.tut.fi>
Message-ID: <20140210194903.07ad1df0@JRWUBU2>

On Mon, 10 Feb 2014 09:53:37 +0200
"Jukka K. Korpela" <jkorpela at cs.tut.fi> wrote:

> The <wbr> tag (an old nonstandard tag, now being standardized in
> HTML5) would not have been needed if browsers had supported U+200B.
> It is nowadays debatable which one should be used (U+200B has the
> disadvantage of not being supported by IE 6, a still somewhat
> significant point).

U+200B has the distinct advantage of being a character, and therefore
readily travelling with the words it separates.  It's quite a useful
character when dealing with inadequate or non-existent dictionaries for
languages that don't have visible separators between words or,
depending on line-breaking practice, syllables.

Richard.


From verdy_p at wanadoo.fr  Mon Feb 10 14:30:41 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 10 Feb 2014 21:30:41 +0100
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F88581.5080501@cs.tut.fi>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
 <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
 <52F88581.5080501@cs.tut.fi>
Message-ID: <CAGa7JC0bwUF_mS3Y_vz8e6+Ke+qNoLEksjaAORcfoVPRU3qWJA@mail.gmail.com>

2014-02-10 8:53 GMT+01:00 Jukka K. Korpela <jkorpela at cs.tut.fi>:

> They are completely different things. You might be confusing <wbr> with
> &shy; (which is just a named reference for SHY, useful when you want it to
> be visible in source code).
>

No I make no confusion: <wbr> is a formatting HTML element, SHY (or &shy;
in HTML syntax for the defined entity) is a character. Both play equivalent
roles in HTML, except that &shy; has a defined behavior to insert an hyphen
at end of broken lines, where <wbr> would adopt a language-dependant
behavior (not all languages use hyphens at end of lines to mark words that
have been split by breaking lines).

I really know that &shy; and SHY are synonyms in this context but that
<wbr> is a bit different and is not part of plain-text (notably it will be
filtered out from $(element).innerText, but not &shy

Note that some browsers are resolving the "innerText" property of HTML DOM
elements by parsing the CSS properties, so this property does not really
reflect only the plain-text elements of the document: Chrome for example
does this to remove spans of texts that are hidden, either by display:none,
or display:hidden, or color:transparent, and it transforms <br> elements
into newlines, and detects the boundarty of block-elements (e.g. with
"display:block" or "display:table-cell')  to generate newline characters,
or sometimes tabs. Chome also injects text added by CSS ":before" and
":after" selectors.

The effect of all this is that a browser uses the HTML DOM to still infer
some plain text to return for the innerText element property, and <wbr> may
become a SHY format control (should it?)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140210/99334c6c/attachment.html>

From jkorpela at cs.tut.fi  Mon Feb 10 14:41:37 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Mon, 10 Feb 2014 22:41:37 +0200
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <20140210194903.07ad1df0@JRWUBU2>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
 <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
 <52F88581.5080501@cs.tut.fi> <20140210194903.07ad1df0@JRWUBU2>
Message-ID: <52F93981.2000308@cs.tut.fi>

2014-02-10 21:49, Richard Wordingham wrote:

> U+200B has the distinct advantage of being a character, and therefore
> readily travelling with the words it separates.

Granted, but it?s still a character that the rendering software needs to 
know and support in order to have the desired effect. As I mentioned, 
some legacy software try to render it as a graphic character, with poor 
results. In contrast, in HTML, the <wbr> tag is safe in the sense that 
when it does not work (some modern browser have oddities in this 
respect), it gets ignored

> It's quite a useful
> character when dealing with inadequate or non-existent dictionaries for
> languages that don't have visible separators between words or,
> depending on line-breaking practice, syllables.

That is correct. Yet, it needs to be supported by the relevant software.

Yucca


From jkorpela at cs.tut.fi  Tue Feb 11 01:25:35 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Tue, 11 Feb 2014 09:25:35 +0200
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <CAGa7JC0bwUF_mS3Y_vz8e6+Ke+qNoLEksjaAORcfoVPRU3qWJA@mail.gmail.com>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
 <CAGa7JC10ZqQSpHFYjeT2N7LRz7iminAo9P5qS12581hSXavV9w@mail.gmail.com>
 <52F88581.5080501@cs.tut.fi>
 <CAGa7JC0bwUF_mS3Y_vz8e6+Ke+qNoLEksjaAORcfoVPRU3qWJA@mail.gmail.com>
Message-ID: <52F9D06F.4060604@cs.tut.fi>

2014-02-10 22:30, Philippe Verdy kirjoitti:

> No I make no confusion: <wbr> is a formatting HTML element, SHY (or
> &shy; in HTML syntax for the defined entity) is a character. Both play
> equivalent roles in HTML,

Not at all.

> except that &shy; has a defined behavior to
> insert an hyphen at end of broken lines, where <wbr> would adopt a
> language-dependant behavior (not all languages use hyphens at end of
> lines to mark words that have been split by breaking lines).

Quite the opposite. The effect of SOFT HYPHEN is expected to be 
language-dependent (though it hardly is in web browsers):
http://www.unicode.org/reports/tr14/#SoftHyphen
Normally, it causes hyphenation, with a hyphen inserted when adequate. 
This is quite different from a direct break opportunity, which is what 
<wbr> means in browser practice, being standardized in HTML5:
http://www.w3.org/TR/html5/text-level-semantics.html#the-wbr-element
There is nothing language-dependent about <wbr>, in theory or in 
practice. It is never expected to result in the addition of a hyphen, 
and it never does that.

When Netscape invented <wbr> long ago, they chose a cryptic name, which, 
when expanded (to ?word break?), has seriously misled many people into 
thinking that it is for suggesting breaks inside human-language words. 
Its primary use is for breaks inside strings that are *not* words. 
(Exceptionally, it sometimes has use inside words: you might wish to 
write e.g. tax-<wbr>free, but there the point is that a simple string 
break is OK, since the ?-? is part of the word and no hyphen needs to be 
added when the word is divided.)

Yucca


From emuller at adobe.com  Wed Feb 12 10:46:58 2014
From: emuller at adobe.com (Eric Muller)
Date: Wed, 12 Feb 2014 08:46:58 -0800
Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt
Message-ID: <52FBA582.508@adobe.com>

Does anybody have a program that transforms the UCD file BidiTest.txt to 
the format of BidiCharacterTest.txt, and that they are willing to share?

Thanks,
Eric.


From ken.whistler at sap.com  Wed Feb 12 13:09:33 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Wed, 12 Feb 2014 19:09:33 +0000
Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt
In-Reply-To: <52FBA582.508@adobe.com>
References: <52FBA582.508@adobe.com>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233E7F04@USPHLEMB12C.global.corp.sap>

Eric,

The C version of the bidiref code does that, in part.

See the function br_ParseFileFormatB in brinput.c.

http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/

It doesn't actually *transform* the BidiTest.txt file to output the other format, but it
parses the input and then constructs calls into the bidi testing API in the same format
used when it parses BidiCharacterTest.txt. So you could adapt that code, if you
want, to writing out lines in the format of BidiCharacterTest.txt. The
main addition you would have to make would be to add a table of
characters exemplifying each of the bidi classes, so you could map
the bidi class values from BidiTest.txt back to actual code points to
store in BidiCharacterTest.txt format.

--Ken

> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eric
> Muller
> Sent: Wednesday, February 12, 2014 8:47 AM
> To: unicode at unicode.org
> Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt
> 
> Does anybody have a program that transforms the UCD file BidiTest.txt to
> the format of BidiCharacterTest.txt, and that they are willing to share?
> 
> Thanks,
> Eric.


From markus.icu at gmail.com  Wed Feb 12 13:46:03 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 12 Feb 2014 11:46:03 -0800
Subject: Transforming BidiTest.txt to the format of BidiCharacterTest.txt
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233E7F04@USPHLEMB12C.global.corp.sap>
References: <52FBA582.508@adobe.com>
 <B6B31BB04593D64F8B3E169A5C3DC62F233E7F04@USPHLEMB12C.global.corp.sap>
Message-ID: <CAN49p6q+T-n19hky--oUsp6EXxraj0nbaaGS5e7BedVVZMWe0Q@mail.gmail.com>

On Wed, Feb 12, 2014 at 11:09 AM, Whistler, Ken <ken.whistler at sap.com>wrote:

> Eric,
>
> The C version of the bidiref code does that, in part.
>
> See the function br_ParseFileFormatB in brinput.c.
>
> http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/
>
> It doesn't actually *transform* the BidiTest.txt file to output the other
> format, but it
> parses the input and then constructs calls into the bidi testing API in
> the same format
> used when it parses BidiCharacterTest.txt. So you could adapt that code,
> if you
> want, to writing out lines in the format of BidiCharacterTest.txt. The
> main addition you would have to make would be to add a table of
> characters exemplifying each of the bidi classes, so you could map
> the bidi class values from BidiTest.txt back to actual code points to
> store in BidiCharacterTest.txt format.
>

ICU also has test code that parses both files, but it does not transform
either one into the format of the other. We have both C++ and Java, and I
can send you URLs if you are interested. There are also sample characters
per Bidi_Class.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140212/c4ceee36/attachment.html>

From chris.fynn at gmail.com  Thu Feb 13 06:30:44 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Thu, 13 Feb 2014 18:30:44 +0600
Subject: proposal for new character 'soft/preferred line break'
In-Reply-To: <52F2B0B2.9090209@shadowtec.de>
References: <52F168C2.7090401@shadowtec.de>
 <CAN49p6oo7xaHVwZonbLTOFYaf9VeOWTr8768-HFdg9f1S1bk0Q@mail.gmail.com>
 <52F2B0B2.9090209@shadowtec.de>
Message-ID: <CAA_CYcKmcZkF8r7cL8o_9k=rugWYih=W3WvX2D_3u5vsVuq-9A@mail.gmail.com>

On 06/02/2014, Rhavin Grobert <rhavin at shadowtec.de> wrote:

> No, you did not understand. <wbr> is like &shy; its below the whitespace
> level: if the line is to long, it breaks a word:

Not really alike. <wbr> is an HTML tag  while &shy; is a named
reference for a character.

Unicode has nothing to do with <wbr> as it is higher level markup.

- C


From everson at evertype.com  Fri Feb 14 07:57:12 2014
From: everson at evertype.com (Michael Everson)
Date: Fri, 14 Feb 2014 07:57:12 -0600
Subject: Sorting notation
Message-ID: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>

So if

A <<< a < B <<< b < C <<< c

means that A and a sort together before B and b and that before C and c, what is the notation for where A and ? and a and ? sort together before B and ? and b and ? and then C and ? and c and ??

Michael Everson * http://www.evertype.com/


From verdy_p at wanadoo.fr  Fri Feb 14 08:38:10 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 14 Feb 2014 15:38:10 +0100
Subject: Sorting notation
In-Reply-To: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
Message-ID: <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>

A <<< a << ? <<< ?  < B <<< b << ? <<< ? < C <<< c << ? <<< ?


2014-02-14 14:57 GMT+01:00 Michael Everson <everson at evertype.com>:

> So if
>
> A <<< a < B <<< b < C <<< c
>
> means that A and a sort together before B and b and that before C and c,
> what is the notation for where A and ? and a and ? sort together before B
> and ? and b and ? and then C and ? and c and ??
>
> Michael Everson * http://www.evertype.com/
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140214/c719b195/attachment.html>

From verdy_p at wanadoo.fr  Fri Feb 14 08:41:12 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 14 Feb 2014 15:41:12 +0100
Subject: Sorting notation
In-Reply-To: <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
Message-ID: <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>

or if you dont want to include case distinctions at third level, but only
sorting the groups for the same base letter on the secondary level:

A << a << ? << ?  < B << b << ? << ? < C << c << ? << ?

A << ? << a << ?  < B << ? << b << ? < C << ? << c << ?


2014-02-14 15:38 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> A <<< a << ? <<< ?  < B <<< b << ? <<< ? < C <<< c << ? <<< ?
>
>
> 2014-02-14 14:57 GMT+01:00 Michael Everson <everson at evertype.com>:
>
> So if
>>
>> A <<< a < B <<< b < C <<< c
>>
>> means that A and a sort together before B and b and that before C and c,
>> what is the notation for where A and ? and a and ? sort together before B
>> and ? and b and ? and then C and ? and c and ??
>>
>> Michael Everson * http://www.evertype.com/
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140214/de93ed57/attachment.html>

From markus.icu at gmail.com  Fri Feb 14 10:26:07 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 14 Feb 2014 08:26:07 -0800
Subject: Sorting notation
In-Reply-To: <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
Message-ID: <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>

You need a reset point to say where in the UCA/CLDR universe this rule
chain goes.
http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings

The default collation puts lowercase first. Normally you reset to a
lowercase character and tailor variations to that, otherwise the few
characters you tailor are inconsistent with the rest of Unicode.
Implementations like ICU provide parametric settings (no need for rules) to
specify uppercase first.
http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options

You should only reorder characters that the default order does not already
have where you need them. For example, reset at each base letter, unless
you want to reorder them relative to each other's default order.
http://www.unicode.org/charts/collation/

See also http://cldr.unicode.org/index/cldr-spec/collation-guidelines
especially about "Minimal Rules".

You can try out collation rules and settings at
http://demo.icu-project.org/icu-bin/locexp?_=root&d_=en&x=col

Best regards,
markus
-- 
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140214/9ac9f0f3/attachment.html>

From Perka at muchomail.com  Fri Feb 14 04:37:19 2014
From: Perka at muchomail.com (=?utf-8?B?0JrRgNGD0YjQtdCy0ZnQsNC90LjQvQ==?=)
Date: Fri, 14 Feb 2014 02:37:19 -0800
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
Message-ID: <20140214023719.A856914@m0048141.ppops.net>

There is still problem with letters ????? in italic, and ? in regular mode.

OpenType support is still very weak (Firefox, LibreOffice on Linux, Adobe's software and that's it, practically). It's also disappointing that Microsoft is still incapable to implement and force this support on system level.

Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 types ? 6 possible letters = 42 combinations) where majority of them don't exist precomposed, and is impossible to enter them. A lot of nowadays' fonts (even commercial) still have issues with accents.

In Unicode, Latin scripts are always favored, which is simply not fair to the rest of the world. They have space to put glyphs for dominoes, a lot of dead languages etc. but they don't have space for real-world issues.

I want Unicode organization to change their politics and pay attention to small countries like Serbia and Macedonia. We have real-world problems. Thank you.

If you think these are biases of me, I say ? real-world problem for us.
If you think changes would invalidate existing texts, I say ? no, because *real* Serbian/Macedonian support still doesn't exist! And we can develop converters in the future, so I don't see any "huge cost" problems...

-- 
??????????? ????

_____________________________________________________________
The Free Email with so much more!
=====> http://www.MuchoMail.com <=====


From mark at macchiato.com  Fri Feb 14 15:00:45 2014
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=)
Date: Fri, 14 Feb 2014 13:00:45 -0800
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140214023719.A856914@m0048141.ppops.net>
References: <20140214023719.A856914@m0048141.ppops.net>
Message-ID: <CAJ2xs_FJnKcm9GW_uz4Pj+J8upCeVkOSrq929jvD74Saz0tf4Q@mail.gmail.com>

Unicode is not anti-Serbian or Macedonian.

The exact level of Unicode support will depend on your operating system and
font choice. For example, on the Mac there are reasonable results with
arbitrary
accents. Here are examples with <q,U+0308> and <Q,U+0308>

q?

Q?

Here is an image, in case your emailer or OS doesn't handle these well.

[image: Inline image 1]
See also http://www.unicode.org/standard/where/

As to the italic, that also depends on the font support on your system.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*


On Fri, Feb 14, 2014 at 2:37 AM, ??????????? <Perka at muchomail.com> wrote:

> There is still problem with letters ????? in italic, and ? in regular mode.
>
> OpenType support is still very weak (Firefox, LibreOffice on Linux,
> Adobe's software and that's it, practically). It's also disappointing that
> Microsoft is still incapable to implement and force this support on system
> level.
>
> Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7
> types ? 6 possible letters = 42 combinations) where majority of them don't
> exist precomposed, and is impossible to enter them. A lot of nowadays'
> fonts (even commercial) still have issues with accents.
>
> In Unicode, Latin scripts are always favored, which is simply not fair to
> the rest of the world. They have space to put glyphs for dominoes, a lot of
> dead languages etc. but they don't have space for real-world issues.
>
> I want Unicode organization to change their politics and pay attention to
> small countries like Serbia and Macedonia. We have real-world problems.
> Thank you.
>
> If you think these are biases of me, I say ? real-world problem for us.
> If you think changes would invalidate existing texts, I say ? no, because
> *real* Serbian/Macedonian support still doesn't exist! And we can develop
> converters in the future, so I don't see any "huge cost" problems...
>
> --
> ??????????? ????
>
> _____________________________________________________________
> The Free Email with so much more!
> =====> http://www.MuchoMail.com <=====
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140214/0c04a21d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-02-14 at 12.56.52.png
Type: image/png
Size: 7794 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140214/0c04a21d/attachment.png>

From ishida at w3.org  Sat Feb 15 05:45:24 2014
From: ishida at w3.org (Richard Ishida)
Date: Sat, 15 Feb 2014 11:45:24 +0000
Subject: [counter-styles] i18n-ISSUE-339: Should Japanese spec styles
 match implementations or vice versa?
In-Reply-To: <CAAWBYDBpUnO-e+qOxOKMNWeZ9Wx+ucQbPFznuHnFPCGDBAX8Eg@mail.gmail.com>
References: <52FE8762.3060504@w3.org>
 <CAAWBYDBpUnO-e+qOxOKMNWeZ9Wx+ucQbPFznuHnFPCGDBAX8Eg@mail.gmail.com>
Message-ID: <52FF5354.7010903@w3.org>

[cc public-i18n-cjk and unicode at unicode.org to get some more eyes on this]

I don't think you revised the algorithm either. I think this discrepancy 
has been around for a long time.

As Xidorn points out, we're talking here about characters that, yes, 
exist in the kana set, but that are not often used or not often used in 
this context.

That said, this whole spec is about being able to customise these lists 
however you want. So in a sense the list of characters described in the 
spec is a kind of default.

So I'm wondering whether, in that case, it's best to just document the 
exisiting implementations, and allow people to modify the list if they 
want.  Unless you have a list of over 44 items you won't meet the 
problem anyway.

Just thinking out loud, really.

RI


On 14/02/2014 23:18, Tab Atkins Jr. wrote:
> On Fri, Feb 14, 2014 at 1:15 PM, Richard Ishida <ishida at w3.org> wrote:
>> 6.2 Alphabetic: lower-alpha, lower-latin, upper-alpha, upper-latin,
>> lower-greek, hiragana, hiragana-iroha, katakana, katakana-iroha
>> http://dev.w3.org/csswg/css-counter-styles/#simple-alphabetic
>>
>> The hiragana, katakana, hiragana-iroha, and katakana-iroha seem to be
>> implemented in the same way in Firefox, Chrome, Safari, and now Opera. The
>> implementation differs from the spec only by the addition of one or two
>> characters to the basic set.
>>
>> Should we change the spec to align with the implementations?
>>
>> For more information see the test results at
>> http://www.w3.org/International/tests/repository/css3-counter-styles/predefined-styles/results-cstyles#simplealpha
>
> It's weird that the spec differs from implementations.  I don't
> *think* I revised those algorithms at all.
>
> I'd prefer to go ahead and match implementations unless they're totally off.
>
> ~TJ
>


From otto.stolz at uni-konstanz.de  Sat Feb 15 11:01:06 2014
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Sat, 15 Feb 2014 18:01:06 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140214023719.A856914@m0048141.ppops.net>
References: <20140214023719.A856914@m0048141.ppops.net>
Message-ID: <52FF9D52.2000403@uni-konstanz.de>

Hello,

Am 14.02.2014 11:37, schrieb ???????????:
> There is still problem with letters ????? in italic, and ? in regular mode.

As has been said, already, in this thread, this is a mere font issue:
you have to use a particular font in order to display those italic
letters, in the Serbian and Macedonian style.

One example:
The ?Gentium Plus? font from SIL International, available from
<http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium>
can be configured to display the Serbian/Macedonian style italics
rather than the glyphs used elsewhere. If this configuration is
too cumbersome for you, feel free to ask me privately, for a copy
of the font, configured for Serbian/Macedonian. I can send you
that copy, without any obligation to maintain it or to adapt forth-
coming versions.

Best wishes,
   Otto Stolz


From richard.wordingham at ntlworld.com  Sat Feb 15 12:25:51 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 15 Feb 2014 18:25:51 +0000
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140214023719.A856914@m0048141.ppops.net>
References: <20140214023719.A856914@m0048141.ppops.net>
Message-ID: <20140215182551.7535808d@JRWUBU2>

On Fri, 14 Feb 2014 02:37:19 -0800
??????????? <Perka at muchomail.com> wrote:

> There is still problem with letters ????? in italic, and ? in regular
> mode.

> OpenType support is still very weak (Firefox, LibreOffice on Linux,
> Adobe's software and that's it, practically). It's also disappointing
> that Microsoft is still incapable to implement and force this support
> on system level.

I'll be interested to know what stops Gentium Plus, suggested by Otto
Stolz, from working on, say, Windows 7.  I'm very sure the support is
there at a system level - the problem (if any) is more likely to be
at an application level.  Does LibreOffice on Windows not support
Serbian italics?   

> Also, there are Serbian/Macedonian cyrillic vowels with accents
> (total: 7 types ? 6 possible letters = 42 combinations) where
> majority of them don't exist precomposed, and is impossible to enter
> them. A lot of nowadays' fonts (even commercial) still have issues
> with accents.

Should these combinations be well known?  They're not listed in the
CLDR exemplar characters for Serbian.

As for input, I would suggest that the solution for the simpler
keyboarding techniques is to enter them as base character and then dead
key.  Dead keys could be available for more advanced input systems,
e.g. ibus on Linux and 'text services' on Windows (Vista and above, I
believe).

> In Unicode, Latin scripts are always favored, which is simply not
> fair to the rest of the world. They have space to put glyphs for
> dominoes, a lot of dead languages etc. but they don't have space for
> real-world issues.

Precomposed characters are an unpleasant feature in Unicode.  I am
curious as to how the Serbian combinations escaped notice.  When are
they actually used?  Each precomposed character adds a small processing
overhead to an extremely large number of computers, not just to the
computers that actually use it.  By contrast, dominoes can be ignored
when no-one using the computer is using the characters for them.

Richard.


From jsbien at mimuw.edu.pl  Sat Feb 15 12:39:59 2014
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sat, 15 Feb 2014 19:39:59 +0100
Subject: precomposed characters (was: Unicode organization is still
 anti-Serbian and anti-Macedonian)
In-Reply-To: <20140215182551.7535808d@JRWUBU2>
References: <20140214023719.A856914@m0048141.ppops.net>
 <20140215182551.7535808d@JRWUBU2>
Message-ID: <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl>

Quote/Cytat - Richard Wordingham <richard.wordingham at ntlworld.com>  
(Sat 15 Feb 2014 07:25:51 PM CET):
> Each precomposed character adds a small processing
> overhead to an extremely large number of computers, not just to the
> computers that actually use it.

This is a very strong claim. Would be so kind to elaborate?

Best regards

Janus

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From gansmann at uni-bonn.de  Sat Feb 15 14:15:51 2014
From: gansmann at uni-bonn.de (Gerrit Ansmann)
Date: Sat, 15 Feb 2014 21:15:51 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140214023719.A856914@m0048141.ppops.net>
References: <20140214023719.A856914@m0048141.ppops.net>
Message-ID: <op.xbcakpg75dzc5p@armleuchter.rechnerverbund>

On Fri, 14 Feb 2014 11:37:19 +0100, ??????????? <Perka at muchomail.com> wrote:

> There is still problem with letters ????? in italic, and ? in regular mode.
>
> OpenType support is still very weak (Firefox, LibreOffice on Linux, Adobe's software and that's it, practically). It's also disappointing that Microsoft is still incapable to implement and force this support on system level.
>
> I want Unicode organization to change their politics and pay attention to small countries like Serbia and Macedonia. We have real-world problems. Thank you.

Just to avoid that I am arguing from a wrong premise: From what I gathered in a quick research, the problem is that the upright letter ? and the italic letters ?, ?, ?, ? and ? have a different shape in Serbian and Macedonian Cyrillic than in other flavours of Cyrillic.

First of all, the lack of support of such features by font creators and the support of font standards by a certain software company (who ironically happens to have created that standard) are hardly Unicode?s fault. It?s like complaining to your government that your favorite merchant still won?t sell bananas, though bananas were legalised twenty years ago.

But most importantly: Encoding these characters won?t do your goal any good for many reasons:
? Even if Unicode did include these characters today, it would take a long time for creators of fonts and other software to catch up ? just consider how slow support of OpenType (or other intelligent font standards) is growing, despite the fact that it concerns a lot of languages and not just two.
? You cannot control every old text to be converted. However, for many such text you can control with which font or font technologies they are rendered. The support of working solutions for the latter is likely to grow even slower if your request were granted.
? People will be confused as to which characters they should use.
? In the current situation, if a font does support Cyrillic but not the Serbian and Macedonian specialties, there is a decent if not identical fallback in many cases. If the new characters were used, however, fonts that support Cyrillic but not the new characters (which especially includes every font that exists today) would not even render the upright versions of the new Serbian/Macedonian ?, ?, ? and ? correctly, even though they do contain these glyphs.
? If you consider this only a temporary makeshift solution to the problem, it works against other temporary solutions (see below).

Actually, the only advantage I see in encoding these letters separately is that it makes type designers aware of these specialties of Serbian and Macedonian ? but neither is this the purpose of Unicode nor is it the best way to do so, and moreover does it not compensate the aforementioned disadvantages.

Some suggestions, how to better invest your ressources and energy on this issue:
? Make type designers aware of this.
? Enforce support of OpenType (or other intelligent font standards) or work on it yourself. (In general, it would be good if people stopped working on makeshift solutions for problems specific to their language or complaining about their lack of support and started working on the support of global standards that will not only solve their problem but benefit many other languages too.)
? Improve open-source fonts by adding the special glyphs yourself.
? As a temporary solution: Request and advertise versions of important fonts that default to the Serbian/Macedonian versions of said characters instead of the others. Or for open-source fonts: Make those versions yourself. (See also Otto Stolz?s answer)

> In Unicode, Latin scripts are always favored, which is simply not fair to the rest of the world. They have space to put glyphs for dominoes, a lot of dead languages etc. but they don't have space for real-world issues.

It?s somewhat amazing how you complain about Unicode?s focus on Latin script and its encoding of things that are not Latin in one line.

> Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 types ? 6 possible letters = 42 combinations) where majority of them don't exist precomposed, and is impossible to enter them. A lot of nowadays' fonts (even commercial) still have issues with accents.

At least for the 6 accents and 5 vowels I found, using combining diacritical marks should work very well even without OpenType, given that the font supports these characters (and you can bet that a font which does not even support this, would not support your requested precomposed glyphs).


From richard.wordingham at ntlworld.com  Sat Feb 15 17:33:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 15 Feb 2014 23:33:09 +0000
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140215182551.7535808d@JRWUBU2>
References: <20140214023719.A856914@m0048141.ppops.net>
 <20140215182551.7535808d@JRWUBU2>
Message-ID: <20140215233309.282203b0@JRWUBU2>

On Sat, 15 Feb 2014 18:25:51 +0000
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> On Fri, 14 Feb 2014 02:37:19 -0800
> ??????????? <Perka at muchomail.com> wrote:
> 
> > There is still problem with letters ????? in italic, and ? in
> > regular mode.
> 
> > OpenType support is still very weak (Firefox, LibreOffice on Linux,
> > Adobe's software and that's it, practically). It's also
> > disappointing that Microsoft is still incapable to implement and
> > force this support on system level.
> 
> I'll be interested to know what stops Gentium Plus, suggested by Otto
> Stolz, from working on, say, Windows 7.

I do seem to have found a problem, though I find it hard to believe.
When I looked for the OpenType language tag for Serbian at
http://www.microsoft.com/typography/otspec/languagetags.htm , it
wasn't there!  Now I'm puzzled as to how any flavour of OpenType is
supposed to automatically switch between Russian and Serbian italics as
such. Gentium Plus (italic) has the Serbian italic glyphs, but via the
aalt feature, which I don't think is what one would want for most uses.

> I'm very sure the support is there at a system level

It seems I was wrong!

Richard.


From richard.wordingham at ntlworld.com  Sat Feb 15 19:55:58 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Feb 2014 01:55:58 +0000
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <ywy5kuqykeh9j5yq4wxqvqyt.1392514985721@email.android.com>
References: <ywy5kuqykeh9j5yq4wxqvqyt.1392514985721@email.android.com>
Message-ID: <20140216015558.53e9deac@JRWUBU2>

On Sat, 15 Feb 2014 17:43:05 -0800
"Steven R. Loomis" <srl at icu-project.org> wrote:

> Richard, SRB and MKD respectively are both in the page you linked to.

Good.  I made the mistake of thinking the list was sorted by
English language name, rather than tag.

Richard.

> >I do seem to have found a problem, though I find it hard to believe.
> >When I looked for the OpenType language tag for Serbian at
> >http://www.microsoft.com/typography/otspec/languagetags.htm , it
> >wasn't there!


From richard.wordingham at ntlworld.com  Sun Feb 16 07:13:29 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Feb 2014 13:13:29 +0000
Subject: precomposed characters (was: Unicode organization is still
 anti-Serbian and anti-Macedonian)
In-Reply-To: <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl>
References: <20140214023719.A856914@m0048141.ppops.net>
 <20140215182551.7535808d@JRWUBU2>
 <20140215193959.11167l0awd3dxlkv@mail.mimuw.edu.pl>
Message-ID: <20140216131329.4a9daa87@JRWUBU2>

On Sat, 15 Feb 2014 19:39:59 +0100
"Janusz S. Bien" <jsbien at mimuw.edu.pl> wrote:

> Quote/Cytat - Richard Wordingham <richard.wordingham at ntlworld.com>  
> (Sat 15 Feb 2014 07:25:51 PM CET):
> > Each precomposed character adds a small processing
> > overhead to an extremely large number of computers, not just to the
> > computers that actually use it.

> This is a very strong claim. Would be so kind to elaborate?

The following need to be stored simply because the character has been
assigned:

name (typically for character pick-lists)
script (typically for breaking text runs by script)
casing (upper/lower/titlecase)
collation properties (not strictly necessary)

There are many other properties, but many of them will often be covered
by default rules and may not need to be stored explicitly.

The only likely subsetting options I can think of would be to not
support the supplementary planes or to not support CJK characters.
This data will be moved when an operating system is installed, and the
files are liable to be moved or replaced at other times.  I will concede
that it is possible that this information may not need to be moved from
disk to memory - the data is likely to be ordered by codepoint and if
nearby codepoints are never used either it will not need to be loaded.

Some data files are mapped to memory, but I unfortunately I can't
comment on the processing overhead of increasing their size if the
additional data is not accessed.

The operations that will be most significantly be affected is
composition.  I am assuming that composition information will be used
even in the presence of a composition exclusion, e.g. to select the
best glyph from a font.  (One could optimise this away by potentially
rendering the canonical decomposition of a precomposed character
differently to the precomposed character.)  The composition data,
consisting of the pairs of characters to which precomposed characters
decompose, will be stored in codepoint order of the decomposition.  The
net effect of this is that the existence of unused composition data
will increase the number of cache misses, and thus increase the amount
of processing required.

If there is not a separate store of compositions not subject to
composition exclusion, then the same effect will occur whenever a
composition happens as part of the transform of a character string to
NFC or NFKC, e.g. in the processing of a non-ASCII internet domain name.

If data access is not carefully optimised, there will be many more
occasions when unused decompositions will nevertheless add to the
processing load.

Richard.


From Perka at muchomail.com  Sun Feb 16 05:33:56 2014
From: Perka at muchomail.com (=?utf-8?B?0JrRgNGD0YjQtdCy0ZnQsNC90LjQvQ==?=)
Date: Sun, 16 Feb 2014 03:33:56 -0800
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
Message-ID: <20140216033356.31E84EC7@m0005299.ppops.net>

O-kay, I got several on-list and off-list messages, so I'll compile some replies here. I receive this mailing list in daily digest, so please excuse my style of replying/commenting. Please read this compilation minutely and don't take everything as insult.

People, I am perfectly aware of their existance and capable to use fonts like:
- from Microsoft (Windows Vista and above): Calibri, Cambria, Candara, Consolas (please make upper part (macron) of italic '?' longer, it looks stupid now), Constantia, Corbel, Sitka (Gabriola has the potential)
- from Adobe: Arno Pro, Baskerville Cyrillic, Excelsior LT, Garamond Premier Pro, Helvetica Inserat, Minion Pro, Myriad (currently misses Serbian '?'), Times Ten, Warnock Pro (Sava Pro also fits for this purpose)
- DejaVu family (Sans, Serif, Mono)
- GNU FreeFont family (Sans, Serif, Mono)
- Ubuntu family (Ubuntu, Mono)
- other useful fonts: Gentium Plus (SIL Graphite technology), EB Garamond.

Linux Libertine/Liberation/Biolinum family currently have severe issues and/or missing glyphs. And, font developers: please forgive me if I missed some good font for Serbian/Macedonian purposes!

I would like Microsoft to alter and provide Serbian/Macedonian support to following old (but unfortunately still used as default in many modern programs) fonts: Arial, Comic Sans (please provide Serbian '?' and fix italic '?') Courier New (please provide Serbian '?'), Georgia, Impact (please provide Serbian '?'), Tahoma (please provide Serbian '?'), Times New Roman, Verdana (please provide Serbian '?')

Adobe, Microsoft and others: please also note that, to cover both languages, in OpenType fonts you need to place both locale tags, language SRB and language MKD. (SRB for Serbian, MKD for Macedonian.) Macedonian cyrillic incorporates ????? from Serbian cyrillic, plus they have separate character '?' and italic glyph for that character rarely looks correct (GNU FreeSerif and EB Garamond have it best).

What is interesting, I know next to nothing about Apple. (Probably because Macintosh computers are expensive as hell.) I have read something about AAT technology, but what about their fonts? Are there Serbian/Macedonian glyphs? I saw one old screenshot of some Serbian Wikipedia page viewed from MacOS (and Safari?, I don't know exact details) but I didn't see proper glyphs.

* * *

Unicode problems that small countries (like Serbia and Macedonia) have are SEVERE, they can not be called "a mere font issue". Please do not insult my intelligence quotient. This is because Serbian/Macedonian language and our cyrillic script is not used on south Balkan only. People from all around the world communicate, and we all have different operating systems, software, fonts...

When folks from America, Germany, Russia, China, Japan... exchange mail, documents, textual informations on Wikipedia (even on Wikipedia informations are not always and everywhere tagged) with folks in Serbia and Macedonia, they all encounter problems ? they get Russian cyrillic instead of Serbian/Macedonian.

People, do you realize that proper glyphs are needed everywhere and every time, CONSTANTLY, even when American ordinary user chats with German ordinary user about Serbian language, on different OS-es, textual e-mail/chat clients, GUI (Graphical User Interface) forms... We must NOT rely on OpenType and similar technologies for this! Serbia and Macedonia became "second-class citizens", systematically discriminated in computer world! That's why I want Unicode to finally fulfill this requirement. To make Serbia and Macedonia "first-class citizens"! And you can not use "Private User Areas", that's not reliable. Please read further discusion below with employee from Microsoft.

And note that Serbian/Macedonian cyrillic is not just "preferable", this is not appropriate term. The correct glyphs are REQUIRED ? we can not accept Russian glyphs. Especially when in Russian small italic '?' and '?' looks *exactly* like latin 'n' and 'm'! That's nonsense for Serbian/Macedonian users (because we also use latin).

Furthermore, Serbian small '?' is visually better than Russian counterpart. Sure this is my personal opinion, and I say it because Russian version looks to digit 6, Serbian doesn't (or, at least, at very low size)! So, Serbian small '?' can enter the Unicode as authentic Serbian letter. It resembles Greek gamma, but it's not exactly the same ? the pronunciation is different and upper part of glyph design must be slightly altered, and result would be fine.

(And all Serbian glyphs are visually better than Russian. Yes, I claim it. Russian "curvature" italic '?', for example, is *extremely ugly* for me. Serbian "i-macron" style is better. And longer part of cursive/handwritten '?' always goes below, like latin 'g' in some designs, not above.)

* * *

Technologies like OpenType, SIL Graphite and AAT are good. People want stylistic alternate shapes, ornaments etc. But these technologies can not replace Unicode. Unicode comes first and it obviously shows that this organization must do internal, system-level support for Serbian/Macedonian issues.

>From disappointing and incapable company called Microsoft, heh heh, one employee asked me to further clarify implementation and system-level OpenType support. Well, I'm not C/C++ programmer (man, I wish I were!), but for non-compliant software can't you somehow intercept all textual communication and replace Russian glyphs with Serbian? It is crucially important to apply this behaviour on all Windows GUI forms (native API, .Net framework etc.), system-wide. And why only in Internet Explorer 11 (currently via CSS 3, can't you force this in settings?), and Office 2010 and above (Word only? Why not Excel, Access... man, we need it EVERYWHERE). Please continue reading the following.

Mozzila Firefox has great support for resolving Serbian/Macedonian issues. OpenType locale is supported, correct rendering when you have HTML attribute like lang="sr" and, for example, you can entirely disable page author's choice of fonts, for any writing script Firefox supports. To compensate for bad or incomplete support, I use that powerful feature all the time, and I wish other manufacturers like Google, Opera etc. do the same in their products. (Just implement the same as Firefox did, but then again, it's not almighty feature.)

LibreOffice also does nice job, but currently under GNU/Linux only. (I talked to one developer from Red Hat Software I believe, and the problem is shaping/rendering engine they currently use for MS Windows ? they should change it and adopt better one like Pango, HarfBuzz...)

It must be said that GNU/Linux in general, stands much better than MS Windows in this regard. (If I recall correctly, in Ubuntu from the very beggining/instalation you can have OpenType locale support.) So, Microsoft, start modelling your OT support like the one from GNU/Linux, make good programming library and abandon old useless stuff. Can DirectWrite help in this regard?

So, I would like that EVERY piece of software has great OpenType/Graphite/AAT support like Firefox and GNU/Linux, but unfortunately, we are still far away from that "nirvana". (Conclusion: We are far away because of Unicode organization and Microsoft, in the first place.)

* * *

About the further support with accents. I was asked to provide "a reference" for 42 combinations I mentioned. (The biggest reference is that I'm Serb, heh heh, and I have modern local scientific books for proper spelling.)

In Serbian (and Macedonian can not be much different in this regard) there are 5 vowels (?, ?, ?, ?, ?) and in some linguistic cases '?' can be considered as vowel (all of these characters are cyrillic, not latin). So, that's 6 of them.
In Serbian there are usually 4 accents, but for *full professional* linguistic purposes, 7 of them (grave, double grave, acute, breve, inverted breve, circumflex and macron).

I inform you that I used MS Keyboard Layout Creator v1.4 and I created excellent keyboard layout for me, but *most* fonts nowadays, even from Microsoft and Adobe, show their ugly behaviour in this regard. I think DejaVu family stands on firm ground here, Gentium Plus too, and I also heard SIL Doulos font has been created with professional linguistics is mind...

So, mathematics is: 6 ? 7 = 42 combinations of *cyrillic* accented letters. Hmmm, that's for small, do we need capital versions as well? Yikes, that makes 84 glyphs! Still, the best option is to have them precomposed, don't you agree, my friends?
Font developers, please make *perfect* support with combining diacritics, and, just to be sure, draw these 84 characters precomposed now, mark them eventually as (Serbian) accented cyrillic, make excellent kerning, and I would buy such precious font (with Serbian ?????, of course). Who knows, you then might be of interest to scientific institutions, government... and not just Serbian ones.

* * *

You know what? I'm not that young and incompetent computer user. I've been struggling with these notorious issues for more than 15 years. It just happened to express my rage now. Before posting this, I surely took some time and read previous related conversations on this mailing list, and a lot of related things beside. I know perfectly well what (you say that) Unicode is and is not. It is easy for you latin-oriented nations (USA, Germany...) to ignore the rest of the world, especially third-world countries. You are powerful, others are weak. You have big software companies like Microsoft and Adobe, others don't. Your latin scripts are perfected, others have to battle with their own. You have fancy OpenType effects, others don't even deserve the basic support. It is easy for you to make only Russian-compatible fonts, and you do it practically always, because the market is considerably bigger than market of south Balkan. Who cares about their real-world problems... But all of this is simply NOT FAIR.

My final conclusion: Until Serbian and Macedonian people get required/proper glyphs and required accented letters, all this SYSTEMATICALLY packaged in Unicode and operating system level ("first-class citizens"), not just on "popular software", Unicode will still be anti-Serbian and anti-Macedonian organization. Whole Unicode standard will be faulty and Unicode organization *politically aggressive* to small, "incompetent", "ugly" countries like Serbia and Macedonia.

-- 
Best regards from ??????????? ???? (that's one resentful and provocative computer user from Krusevac town in Serbia)


_____________________________________________________________
The Free Email with so much more!
=====> http://www.MuchoMail.com <=====


From verdy_p at wanadoo.fr  Sun Feb 16 11:44:55 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 16 Feb 2014 18:44:55 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140215182551.7535808d@JRWUBU2>
References: <20140214023719.A856914@m0048141.ppops.net>
 <20140215182551.7535808d@JRWUBU2>
Message-ID: <CAGa7JC1RFTVXRtmEEbiajVwfTLLSMvNtb-upoE=CC=ET30hB9A@mail.gmail.com>

2014-02-15 19:25 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Fri, 14 Feb 2014 02:37:19 -0800
> ??????????? <Perka at muchomail.com> wrote:Should these combinations be well
> known?  They're not listed in the
> CLDR exemplar characters for Serbian.
>
> As for input, I would suggest that the solution for the simpler
> keyboarding techniques is to enter them as base character and then dead
> key.
>
"Dead keys" don't work this way. Their name really indicate that these keys
have no action (seam dead) until another key is pressed AFTER them.

So you press the dead key for the diacritic, then the key for the base
letter, to produce EITHER:
- a single precomposed character (where it exists) ; OR
- a canonically equivalent decomposed combiing sequence representing the
letter with its diacritic(s) (preferably in NFC form).

Dead keys may be combined in advanced keyboard drivers supporting complex
input states for handling multiple diacritics typed before a base letter ;
but simple keyboard drivers (such as those generated by MS Keyboard Layout
Editor) do not handle these complex states. But nothing prohibits building
such a keboard driver.

There's another inut method where you can press a key for the diacritic
after a base letter: this key is treated in isolation and immediately
generates the combining diacritic, independantly of the characters pressed
before. But such input method will not warranty the NFC form, and cand
produce broken sequences (in some cases the diacritic may be invisible in
the generated text).

For simple alphabetic scripts (like Latin, Greek, Cyrillic), the dead key
input method is generally prefered. the other one is used to enter isolated
combining diacritics which are almost never used in association with other
letters (and notably not in combining sequences equivalent to an existing
precomposed letter).

If you think about the combining diaeresis, as it is already used very
frequently in association with Latin and Cyrillic letters using a dead key
method, it should also be used as a dead key even for less frequent base
letters such as the Cyrillic letter Q. All that is needed is to use an
updated driver adding the mapping for diacritic dead key+letter, in which
it will output the NFC combining sequence if there's no precomposed NFC
equivalent

----

Unfortunately, the drivers generated by the MS Keyboard Layout Creator
(MSKLC), when it does not find any explciitly predefined mapping for
diacritic dead key+base letter, will generate the mapping for <diacritic
dead key+SPACE>, followed by the base letter, meaning that you won't get
the text <base letter, combining diacritic>, but <spacing modifier letter
for the diacritic, base letter> !

The second limitation of MSKLC is that it cannot chain dead letters: each
input state must be mapped to a single state represented by a single
character, which is the spacing modifier letter that would be output if you
press the SPACE bar after the diacritic. It incorrectly assumes that
combinations that are not mapped explicitly will always be used followed by
a space bar keystroke to produce a spacing modifier letter, as if all
unmapped sequences were not possible and do not exist in the real world.

The other limitation is that this input state table can only be represented
by a single character in the BMP (but it may be represented by a PUA of the
BMP, even if MSKLC warns that this character may not be supported by fonts
on the native OS or in the Console using the local legacy OEM or "ANSI"
codepage (an 8-bit code page which may be either SBCS or DBCS).

Drivers built by MSKLC do not allow mapping a dead key outside the root
state table (so after pressing a dead key, possibly in combination with
state modier keys like Shift; Ctrl, Alt, and with the current state of the
CapsLock/ShiftLock), you can only press a single base character (also
possibly in combinjation with state modifier keys).

Due to these limitations of MSKLC, trying to generate some advanced keymaps
to support extended sets of combining sequences, requires using complex key
combinations with state modifiers (for the dead key and for the base
letter), which are very uneasy to input when it would be simpler and faster
to enter if sequences of dead keys were supported.

Dead keys are not very complex, in fact they are quire friendly and have
the advantage of normalizing the input to NFC directly, without needing any
additional support from the external text editor (modifying the text buffer
on the flow). They are natural to users even if the input order of
keystrokes is reversed, compared to the Unicode encoding of the generated
text (something that most users will never see as they have no idea about
how the text will be finally encoded and used in their applications).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140216/8677c783/attachment.html>

From prosfilaes at gmail.com  Sun Feb 16 15:57:38 2014
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 16 Feb 2014 13:57:38 -0800
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140216033356.31E84EC7@m0005299.ppops.net>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
Message-ID: <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>

Every time you attack the only character set that supports various
third-world African languages and various tiny North American
languages  and various small Indian languages and various Philippine
scripts, as it's "easy for you latin-oriented nations (USA,
Germany...) to ignore the rest of the world, especially third-world
countries", people stop listening to you. Unicode is the system
designed to make it possible to write the scripts of all languages.
Microsoft happens to have been one of the largest drivers behind it,
having spent a lot of money on Unicode and OpenType to make this stuff
possible.

> People, do you realize that proper glyphs are needed everywhere and every time, CONSTANTLY, even when American ordinary
> user chats with German ordinary user about Serbian language

They'd use Latin because that's what their keyboards are going to
support. Virtually every recent protocol runs over some sort of XML,
so language tagging comes free, and if they don't, they need to
provide some sort of language tagging.

And if we picked your option and they did use Cyrillic? I'm betting
American ordinary user and German ordinary user would load up their
Russian keyboards and type away using Russian letters for Serbian. It
is an incredibly well-known problem that if you have two similar
looking characters, users will use the more common one even when the
less common one is the correct one.

There won't be new precomposed characters, and there shouldn't be a
need for them. There won't be new Serbian characters invalidating
every text stored in systems in Serbian today. Maybe 15 years ago, a
change could in theory have been done, but not today. Deal with what
you have, because those decisions have been made and written in stone.

-- 
Kie ekzistas vivo, ekzistas espero.


From richard.wordingham at ntlworld.com  Sun Feb 16 16:12:23 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Feb 2014 22:12:23 +0000
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <CAGa7JC1RFTVXRtmEEbiajVwfTLLSMvNtb-upoE=CC=ET30hB9A@mail.gmail.com>
References: <20140214023719.A856914@m0048141.ppops.net>
 <20140215182551.7535808d@JRWUBU2>
 <CAGa7JC1RFTVXRtmEEbiajVwfTLLSMvNtb-upoE=CC=ET30hB9A@mail.gmail.com>
Message-ID: <20140216221223.7369356f@JRWUBU2>

On Sun, 16 Feb 2014 18:44:55 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-15 19:25 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Fri, 14 Feb 2014 02:37:19 -0800
> > ??????????? <Perka at muchomail.com> wrote:Should these combinations
> > be well known?  They're not listed in the
> > CLDR exemplar characters for Serbian.
> >
> > As for input, I would suggest that the solution for the simpler
> > keyboarding techniques is to enter them as base character and then
> > dead key.

> There's another inut method where you can press a key for the
> diacritic after a base letter: this key is treated in isolation and
> immediately generates the combining diacritic, independantly of the
> characters pressed before.

Sorry, this is what I meant.  I should have written 'diacritic',
not 'dead key'.

> But such input method will not warranty the NFC form,

Which is an argument for text editors to have normalisation functions,
like the emacs ucs-normalize-NFC-region command.

> and cand
> produce broken sequences (in some cases the diacritic may be
> invisible in the generated text).

Something many users of the Thai script currently have to live
with.

Richard.


From richard.wordingham at ntlworld.com  Sun Feb 16 17:50:45 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Feb 2014 23:50:45 +0000
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
 <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>
Message-ID: <20140216235045.7f916c0d@JRWUBU2>

On Sun, 16 Feb 2014 13:57:38 -0800
David Starner <prosfilaes at gmail.com> wrote:
 
> > People, do you realize that proper glyphs are needed everywhere and
> > every time, CONSTANTLY, even when American ordinary user chats with
> > German ordinary user about Serbian language

> And if we picked your option and they did use Cyrillic? I'm betting
> American ordinary user and German ordinary user would load up their
> Russian keyboards and type away using Russian letters for Serbian.

American *ordinary* user and German *ordinary* user would not be typing
Serbian.

One issue here that I don't know the solution for is how the right
glyphs should be chosen for displaying plain text communication.  I
don't know any general mechanism for, say, specifying that by
default Cyrillic text should use Serbian glyphs, CJK characters
should use Japanese glyphs and that Cuneiform should use Neo-Assyrian
glyphs.  

> There won't be new Serbian characters invalidating
> every text stored in systems in Serbian today.

I don't like the idea, but one possibility would be to define Serbian
glyph styles by adding variation selectors.  Variation selectors are
already 'defined' for the decimal digits U+0030 to U+0039.  It would,
however, mess up string comparison operations that weren't smart enough
to ignore variation selectors.

Richard.


From tom at bluesky.org  Sun Feb 16 18:23:04 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Sun, 16 Feb 2014 17:23:04 -0700
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140216235045.7f916c0d@JRWUBU2>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
 <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>
 <20140216235045.7f916c0d@JRWUBU2>
Message-ID: <44391A18-ED55-4CE5-AF3D-1D363E9F89F7@bluesky.org>

On Feb 16, 2014, at 4:50 PM, Richard Wordingham wrote:

> 
> One issue here that I don't know the solution for is how the right
> glyphs should be chosen for displaying plain text communication.  I
> don't know any general mechanism for, say, specifying that by
> default Cyrillic text should use Serbian glyphs, CJK characters
> should use Japanese glyphs and that Cuneiform should use Neo-Assyrian
> glyphs.  

In Mac OS X and iOS, this is currently being done for the CJK case by switch fonts according to the order of languages in the system level language preferences.  If Japanese is higher than Chinese on the list, then by default a Japanese font is used for CJK plain text.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140216/905fe54f/attachment.html>

From tom at bluesky.org  Sun Feb 16 18:51:55 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Sun, 16 Feb 2014 17:51:55 -0700
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140216033356.31E84EC7@m0005299.ppops.net>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
Message-ID: <EA33E5A8-8057-49DF-89B5-A8BE16C4372C@bluesky.org>


On Feb 16, 2014, at 4:33 AM, ??????????? wrote:

> 
> What is interesting, I know next to nothing about Apple. (Probably because Macintosh computers are expensive as hell.) I have read something about AAT technology, but what about their fonts? Are there Serbian/Macedonian glyphs?

I had a look, and I think the answer is "no".  (Except for two, one of which is Chinese, which seem to have the Serbian '?' by mistake).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140216/da22545e/attachment.html>

From gansmann at uni-bonn.de  Mon Feb 17 03:33:05 2014
From: gansmann at uni-bonn.de (Gerrit Ansmann)
Date: Mon, 17 Feb 2014 10:33:05 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140216235045.7f916c0d@JRWUBU2>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
 <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>
 <20140216235045.7f916c0d@JRWUBU2>
Message-ID: <op.xbe55fad5dzc5p@dumpfbacke.rechnerverbund>

On Mon, 17 Feb 2014 00:50:45 +0100, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> I don't like the idea, but one possibility would be to define Serbian glyph styles by adding variation selectors.  Variation selectors are already 'defined' for the decimal digits U+0030 to U+0039.  It would, however, mess up string comparison operations that weren't smart enough to ignore variation selectors.

Also, for the variation selectors to work for the end user, it requires the same technologies whose lack of support is why we are discussing this in the first place, doesn?t it? So, defining the corresponding variation selectors would not make the end user see the correct glyphs earlier.


From otto.stolz at uni-konstanz.de  Mon Feb 17 07:57:56 2014
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Mon, 17 Feb 2014 14:57:56 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <20140216235045.7f916c0d@JRWUBU2>
References: <20140216033356.31E84EC7@m0005299.ppops.net>
 <CAMZ=zj5Ggh1De=wfRRjtNXzxypY-WFfwmW6-XFBY-n=R4jpt-A@mail.gmail.com>
 <20140216235045.7f916c0d@JRWUBU2>
Message-ID: <53021564.4040102@uni-konstanz.de>

Hello,

??????????? ???? had written:
> People, do you realize that proper glyphs are needed everywhere and
> every time, CONSTANTLY, even when American ordinary user chats with
> German ordinary user about Serbian language

Am 2014-02-17 um 00:50 Uhr MEZ schrieb Richard Wordingham:
> One issue here that I don't know the solution for is how the right
> glyphs should be chosen for displaying plain text communication.  I
> don't know any general mechanism for, say, specifying that by
> default Cyrillic text should use Serbian glyphs, CJK characters
> should use Japanese glyphs and that Cuneiform should use Neo-Assyrian
> glyphs.

This boils down to the fact that, in plain-text communication, the
receiver can ? and should ? chose the appropriate font. This holds,
in particular, for classical e-mail. Thence my recent claim that the
problem posed by ???? is a mere font issue.

In HTML, this is a bit different: The author has control over the
fonts (thence over the glyphic style) used for the display, but the
reader can normally override the author?s choice. Hence, WWW authors
should specify suitable fonts for their respective articvles (or even
parts thereof).

On paper, or in PDF and other facsimile formaats, the author is
entirely responsible for the glyphic style and appearnce, and he
should always chose suitable fonts. This is the realm of the
solution involving that ?Gentium Plus srp? font I had mentioned,
recently.

May i humbly remind ???? (and all other readers of this thread) that
the problem manifests itself (mainly or only) with italic style
letters; hence there remains virually no problem with normal
(non-italic) style.

Best wishes,
   Otto Stolz


From kent.karlsson14 at telia.com  Mon Feb 17 08:23:00 2014
From: kent.karlsson14 at telia.com (Kent Karlsson)
Date: Mon, 17 Feb 2014 15:23:00 +0100
Subject: Unicode organization is still anti-Serbian and anti-Macedonian
In-Reply-To: <op.xbe55fad5dzc5p@dumpfbacke.rechnerverbund>
Message-ID: <CF27D9D4.2A1DD%kent.karlsson14@telia.com>


Den 2014-02-17 10:33, skrev "Gerrit Ansmann" <gansmann at uni-bonn.de>:

>> I don't like the idea, but one possibility would be to define Serbian glyph
>> styles by adding variation selectors.  Variation selectors are already
>> 'defined' for the decimal digits U+0030 to U+0039.  It would, however,
>> mess up string comparison operations that weren't smart enough to ignore
>> variation selectors.

>

Also, for the variation selectors to work for the end user, it requires
> the same technologies whose lack of support is why we are discussing this
> in the first place, doesn?t it? So, defining the corresponding variation
> selectors would not make the end user see the correct glyphs earlier.

Still, variation selectors would be, in the text, a very localized
indication, independent of (displaying) user's preference settings
or language declaration (from the author, in e.g. XML/HTML formats)
for the text, and variation selectors are indeed more likely to
survive operations like cut-and-paste. There would be a problem of
inserting variation selectors at all places where appropriate, though.
Spell checking functionality could, in principle at least, help with
the latter.

/Kent K


From mathias at qiwi.be  Thu Feb 20 04:42:01 2014
From: mathias at qiwi.be (Mathias Bynens)
Date: Thu, 20 Feb 2014 11:42:01 +0100
Subject: =?windows-1252?Q?Difference_between_=91combining_characters=92_a?=
 =?windows-1252?Q?nd_=91grapheme_extenders=92=3F?=
Message-ID: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be>

What is the difference between ?combining characters? (http://www.unicode.org/faq/char_combmark.html) and ?grapheme extenders? (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode?

They seem to do the same thing, as far as I can tell ? although the set of grapheme extenders is larger than the set of combining characters. I?m clearly missing something here. Why the distinction?

I?ve also posted this question on Stack Overflow: http://stackoverflow.com/q/21722729/96656


From verdy_p at wanadoo.fr  Thu Feb 20 05:10:09 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 20 Feb 2014 12:10:09 +0100
Subject: Difference between 'combining characters' and 'grapheme
 extenders'?
In-Reply-To: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be>
References: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be>
Message-ID: <CAGa7JC0F-9wCxftw8h6Z2DuQh5eRBk7MK5fyDXiAq_poO6+xLg@mail.gmail.com>

Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively stabilized.
As well most combining characters have a non-zero combining class and they
are stabilized for the purpose of normalization.

Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also more
complex in some scripts (there are extenders encoded BEFORE the base
character; and they cannot be classified as combining characters because
combining characters are always encoded AFTER a base character)

For legacy reasons (and roundtrip compatibility with older standards) not
all scripts are encoded using the UCS character model using combining
characters. (E.g. the Thai script; not following the "logical" encoding
order; but following the model used in TIS-620 and other standards based on
it; including for Windows, and *nix/*nux).


2014-02-20 11:42 GMT+01:00 Mathias Bynens <mathias at qiwi.be>:

> What is the difference between 'combining characters' (
> http://www.unicode.org/faq/char_combmark.html) and 'grapheme extenders' (
> http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode?
>
> They seem to do the same thing, as far as I can tell - although the set of
> grapheme extenders is larger than the set of combining characters. I'm
> clearly missing something here. Why the distinction?
>
> I've also posted this question on Stack Overflow:
> http://stackoverflow.com/q/21722729/96656
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140220/4d050ade/attachment.html>

From tyler at tylercipriani.com  Thu Feb 20 09:57:26 2014
From: tyler at tylercipriani.com (Tyler Cipriani)
Date: Thu, 20 Feb 2014 08:57:26 -0700
Subject: Banjo glyph proposal--open discussion
Message-ID: <CACAD6wezs_7d7Ke5-ZiP39S3=pbTN5SBFhG7TtJZ+=tRtaO9PQ@mail.gmail.com>

I'm proposing adding a single UCS character to further the goal of
set completeness for the set of glyphs represented on the SMP block:
Miscellaneous Symbols and Pictographs: 'BANJO' (proposed glyph U+1F3DB)

My current proposal is available at:
https://github.com/thcipriani/unicode-banjo/blob/master/Proposal/Banjo_Unicode_Proposal.markdown

Thank you in advance for any feedback or comments.

Tyler Cipriani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140220/762be21d/attachment.html>

From richard.wordingham at ntlworld.com  Thu Feb 20 14:00:15 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 20 Feb 2014 20:00:15 +0000
Subject: Difference between =?windows-1252?B?kWNvbWJpbmluZyBjaGFyYWN0?=
 =?windows-1252?B?ZXJzkg==?= and =?windows-1252?B?kWdyYXBoZW1lIGV4dGVuZGVy?=
 =?windows-1252?B?c5I/?=
In-Reply-To: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be>
References: <330A96AB-BEA0-4726-B2CE-8B2E49A6752C@qiwi.be>
Message-ID: <20140220200015.6ba76901@JRWUBU2>

On Thu, 20 Feb 2014 11:42:01 +0100
Mathias Bynens <mathias at qiwi.be> wrote:

> What is the difference between ?combining
> characters? (http://www.unicode.org/faq/char_combmark.html) and
> ?grapheme
> extenders? (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in
> Unicode?
> 
> They seem to do the same thing, as far as I can tell ? although the
> set of grapheme extenders is larger than the set of combining
> characters. I?m clearly missing something here. Why the distinction?

Spacing combining marks (category Mc) are in general not grapheme
extenders.  The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence.  There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ,
U+302E HANGUL SINGLE DOT TONE MARK
U+302F HANGUL DOUBLE DOT TONE MARK
U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character.  The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.

I don't actually see the logic for including U+302E and U+302F.  If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.

Richard.


From rwhlk142 at gmail.com  Sat Feb 22 13:46:03 2014
From: rwhlk142 at gmail.com (Robert Wheelock)
Date: Sat, 22 Feb 2014 14:46:03 -0500
Subject: Hebrew Extended Block(s)
Message-ID: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>

Hello!

There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints
where we could put needed additional Hebrew characters in...

The (U+00860) column could house such things as:
??A?AF-?IRIQ (?iriq + sh?wa?)
??A?AF-QIBBU? (qibbu? + sh?wa?)
?TRUE SHURUQ POINT FOR WAW (point inside WAW, but slightly higher up)
?WAW WITH SHURUQ (WAW having the TRUE SHURUQ POINT inside)
?DOUBLY-POINTED SHIN (a SHIN with both ?IN and SHIN points atop)
?DOUBLY-POINTED SHIN WITH DAGHESH (the preceding letter?only with an added
DAGHESH inside)
?WAW WITH DAGHESH AND ?OLAM
?WAW WITH DAGHESH AND SHURUQ
?The 4 DIACRITICAL POINTS ABOVE (single, double horizontal, triple up
triangle, and quad squared) used to extend the Hebrew alphabet to new
sounds for other Jewish languages (Judeo-Arabic, ...)
?VARIQA? HAFUKH (for the same purpose)
?GALGAL HAFUKH (to write the Yiddish palatals?instead of using a double
yudh ligature; also extended to ?AYIN for writing an /e/ vowel in Yiddish)

The remaining 48 codepoints (U+00870 ? U+0089F) could house additional
letters that?re used in other Jewish languages (Hebrew letters with points
above that mimic those on the corresponding Arabic letters).  Research
needs to be done to determine the most-widely used of those (keeping in
mind that those based on KAF, MEM, NUN, PE?, and ?ADHEH require 2
codepoints each?the 1st for its final form, followed by a 2nd for its
regular form) to assign to these 48 codepoints.

The remainder of those marked letter?along with variant cantillation
signs?will need another codepoint subblock to reside at...  we got (at
least) 3 variant cantillation systems?each with their own vowel points and
reading signs; besides those, we also have occasionally- and rarely-used
supplemental marked Hebrew letters.  We should also reserve (at least) a
codepoint for the YHWH TETRAGRAMMATON.

Shalom!
Thank You!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140222/b5b8f3e6/attachment.html>

From verdy_p at wanadoo.fr  Sat Feb 22 14:06:37 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 22 Feb 2014 21:06:37 +0100
Subject: Hebrew Extended Block(s)
In-Reply-To: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
References: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
Message-ID: <CAGa7JC18V32qjCZoPT3fOtyM4gs3sObfmdPHWLpz==Jb4XiMuQ@mail.gmail.com>

2014-02-22 20:46 GMT+01:00 Robert Wheelock <rwhlk142 at gmail.com>:

> Hello!
>
> There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints
> where we could put needed additional Hebrew characters in...
>
> The (U+00860) column could house such things as:
> ??A?AF-?IRIQ (?iriq + sh?wa?)
> ??A?AF-QIBBU? (qibbu? + sh?wa?)
> ?TRUE SHURUQ POINT FOR WAW (point inside WAW, but slightly higher up)
> ?WAW WITH SHURUQ (WAW having the TRUE SHURUQ POINT inside)
> ?DOUBLY-POINTED SHIN (a SHIN with both ?IN and SHIN points atop)
> ?DOUBLY-POINTED SHIN WITH DAGHESH (the preceding letter?only with an added
> DAGHESH inside)
> ?WAW WITH DAGHESH AND ?OLAM
> ?WAW WITH DAGHESH AND SHURUQ
> ?The 4 DIACRITICAL POINTS ABOVE (single, double horizontal, triple up
> triangle, and quad squared) used to extend the Hebrew alphabet to new
> sounds for other Jewish languages (Judeo-Arabic, ...)
> ?VARIQA? HAFUKH (for the same purpose)
> ?GALGAL HAFUKH (to write the Yiddish palatals?instead of using a double
> yudh ligature; also extended to ?AYIN for writing an /e/ vowel in Yiddish)
>

Most of these are already encoded using multiple codepoints (e.g.
doubly-pointed shin, with or without dagesh).

The Hebrew script is already enough challenging that we don't need to add
complexity by adding even more ways to encode the same thing (and then need
to update the already complex collation rules in which there's been tons of
comments to find solutions already implemented (including solutions already
used to borrow some Arabic combining marks for use within the Hebrew
script).
Some of the characters you propose are just typographic variants.

I suggest you read the PDF document speaking about the development of the
SIL SBL Font, it is really informative about many of these issues and how
that font was designed (according to many discussions that have occured
years ago in this list).I'm convinced that this document should also be
better referenced as an informative technical report for the script
(because it is infact not specific to the SBL font itself).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140222/ce327981/attachment.html>

From everson at evertype.com  Sat Feb 22 14:55:33 2014
From: everson at evertype.com (Michael Everson)
Date: Sat, 22 Feb 2014 12:55:33 -0800
Subject: Hebrew Extended Block(s)
In-Reply-To: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
References: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
Message-ID: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com>

On 22 Feb 2014, at 11:46, Robert Wheelock <rwhlk142 at gmail.com> wrote:

> There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in?

We?re not going to add a rake of pre-composed Hebrew characters though. 

Michael Everson * http://www.evertype.com/


From jonathan.rosenne at gmail.com  Sat Feb 22 15:10:16 2014
From: jonathan.rosenne at gmail.com (Jonathan Rosenne)
Date: Sat, 22 Feb 2014 23:10:16 +0200
Subject: Hebrew Extended Block(s)
In-Reply-To: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com>
References: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
 <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com>
Message-ID: <006d01cf3012$7d78b140$786a13c0$@gmail.com>

May I suggest a correction:

We're not going to add a rake of pre-composed characters though.

Best Regards,

Jonathan Rosenne

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael
Everson
Sent: Saturday, February 22, 2014 10:56 PM
To: unicode Unicode Discussion
Subject: Re: Hebrew Extended Block(s)

On 22 Feb 2014, at 11:46, Robert Wheelock <rwhlk142 at gmail.com> wrote:

> There's an empty subblock (U+00860 - U+0089F) with 64 empty codepoints
where we could put needed additional Hebrew characters in.

We're not going to add a rake of pre-composed Hebrew characters though. 

Michael Everson * http://www.evertype.com/


_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From verdy_p at wanadoo.fr  Sun Feb 23 13:49:24 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 23 Feb 2014 20:49:24 +0100
Subject: Sorting notation
In-Reply-To: <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
Message-ID: <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>

OK, I ignored these resets only for simplicity, the question was not about
a full set of rules to build a collation; but a small subset of rules that
could be used.

It seems surprisng that Michael Everson asks the question, when he already
knows so much about Unicode algorithms (but may be less about notations
used in CLDR data)

The CLDR also has several competing notations for specifying collations so
that may be the purpose of his question. I don't think that all notations
need an explicit reset at start (it can be implicit for the first element
in a chain of relations)


2014-02-14 17:26 GMT+01:00 Markus Scherer <markus.icu at gmail.com>:

> You need a reset point to say where in the UCA/CLDR universe this rule
> chain goes.
> http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings
>
> The default collation puts lowercase first. Normally you reset to a
> lowercase character and tailor variations to that, otherwise the few
> characters you tailor are inconsistent with the rest of Unicode.
> Implementations like ICU provide parametric settings (no need for rules) to
> specify uppercase first.
> http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
>
> You should only reorder characters that the default order does not already
> have where you need them. For example, reset at each base letter, unless
> you want to reorder them relative to each other's default order.
> http://www.unicode.org/charts/collation/
>
> See also http://cldr.unicode.org/index/cldr-spec/collation-guidelines
> especially about "Minimal Rules".
>
> You can try out collation rules and settings at
> http://demo.icu-project.org/icu-bin/locexp?_=root&d_=en&x=col
>
> Best regards,
> markus
> --
> Google Internationalization Engineering
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140223/c937b234/attachment.html>

From richard.wordingham at ntlworld.com  Sun Feb 23 15:32:45 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 23 Feb 2014 21:32:45 +0000
Subject: Sorting notation
In-Reply-To: <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
Message-ID: <20140223213245.26f99657@JRWUBU2>

On Sun, 23 Feb 2014 20:49:24 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> It seems surprisng that Michael Everson asks the question, when he
> already knows so much about Unicode algorithms (but may be less about
> notations used in CLDR data)
> 
> The CLDR also has several competing notations for specifying
> collations so that may be the purpose of his question.

I have no confidence that his question has been understood.  Collation
is a monster, and it is unsafe to assume that one understands it.  The
ICU notation and implementation for an abstract definition of collation
turned out to be full of traps, and won't catch up with CLDR
definitions until Markus Scherer's raft of collation amendments goes in.
(Or have I missed the announcement?)  Rigorous definitions have had to
address collation elements (i.e. sets of weights, one at each level
with 0 a special value), which is not as abstract as the ICU notation
was meant to be. 

As an example of the treachery of collation definitions, one might
na?vely think that adding &a<<? to the default collation would result
in ? << ? holding, but it doesn't, for ? has two collation elements,
not one.  CLDR has now* redefined the notation so that &[before 2]? <<
? will give the ordering relationships a << ? << ? << ? without having
to reorder U+0323 COMBINING DOT BELOW. In the default collations,
secondary differences are implemented by adding collation elements with
zero primary weights, while tertiary differences are implemented as
different tertiary weights in collation elements with non-zero primary
weights.  I doubt that using both methods at the same level works well.
Fortunately, collation generally only needs to work well when
restricted to valid words.  For some languages, the task of placing an
arbitrary string of the language's characters in the correct place by
alphabetical order is meaningless.

*At least, referring to Version 24 of the LFML specification, I assume
Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
which purports to define the meaning of "&[before 2]..<<".  It's
conceivable that I am wrong, and the meaning of "&[before 2]? << ?" is
undefined.

Richard.


From verdy_p at wanadoo.fr  Sun Feb 23 16:13:53 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 23 Feb 2014 23:13:53 +0100
Subject: Sorting notation
In-Reply-To: <20140223213245.26f99657@JRWUBU2>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
Message-ID: <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>

2014-02-23 22:32 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sun, 23 Feb 2014 20:49:24 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:*At least, referring to Version
> 24 of the LFML specification, I assume
> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
> which purports to define the meaning of "&[before 2]..<<".  It's
> conceivable that I am wrong, and the meaning of "&[before 2]? << ?" is
> undefined.
>

This looks like a cryptic notation anyway. If we assume that there's an
implicit reset at start of a collation rule, and that collation does not
define any relative order for the empty string, you could simply write this
reset at level 2 as:
  << ? << ?
instead of the mysterious notation (and in fact verbose and probably
inconsistant in the way the same level 2 is further used with "<<"):
  &[before 2]? << ?

I don't thing the "&" is necessary except as a separator between separate
rules (where all rules must implicitly start by a reset at some level).

The "monster" you describe belongs to ICU implementation (which is not part
of any standard but now integrated in various products that have abandonned
the idea of implementing (unstable and complex) collations themselves.

My opinion is that this part of ICU should be detached from it in a
completely separated project, to help simplifying it, because all the rest
of ICU have viable competitive implementations (that are also more easily
ported to other languages without having to create possibly unsafe binary
bindings to native C/C++ code or Java).

It is notable that after so many years years, collation is still not
implemented in Javascript, and still does not have a standardized API in
Javascript/ECMAscript minimum support for strings (there is an
implementation though in Lua, based on internal bindings to the native
C/C++ code in its library; there are some attempts to emulate it also in
Python; in C#/J# the implementation is performed by binding the native
C/C++ code; but it still causes deployment problems for distributed
applications that need to deliver code on the client side of web services:
only Java works for now, not Javascript except by using server-side helpers
with really _slow_ remote APIs).

When performance of applications on the client side is a problem (for
client-side applications needing to perform dynamic collations), full
collators are not implemented at all, and these applications use a much
simpler model (even if they don't work very well with lots of languages).

And the existing CLDR data about collation is simply not portable at all
outside contexts where ICU can be used. Instead, each application supports
its own (more or less limited) model implementing some unspecified part of
the CLDR collation data (which is then insufficiently reused and corrected
for handling real cases, even for the most frequently needed ones).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140223/a23ae7a3/attachment.html>

From markus.icu at gmail.com  Sun Feb 23 17:04:34 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Sun, 23 Feb 2014 15:04:34 -0800
Subject: Sorting notation
In-Reply-To: <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
Message-ID: <CAN49p6odSfRytT3DM2obvaFTjtWjboaoAQgtm2+5=6V3JpdQ7Q@mail.gmail.com>

On Sun, Feb 23, 2014 at 2:13 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-23 22:32 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
>> On Sun, 23 Feb 2014 20:49:24 +0100
>> Philippe Verdy <verdy_p at wanadoo.fr> wrote:*At least, referring to
>> Version 24 of the LFML specification, I assume
>> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
>> which purports to define the meaning of "&[before 2]..<<".  It's
>> conceivable that I am wrong, and the meaning of "&[before 2]? << ?" is
>> undefined.
>>
>
No, it's well-defined, and I believe that part of the spec is fairly
complete since CLDR 24.

This looks like a cryptic notation anyway. If we assume that there's an
> implicit reset at start of a collation rule, and that collation does not
> define any relative order for the empty string, you could simply write this
> reset at level 2 as:
>   << ? << ?
>

It might have made sense 15 years ago to permit relations without an
initial reset, because at the time the rules were applied on a blank slate.
Ever since ICU/CLDR collation rules were redefined to apply on top of DUCET
(and later on top of the CLDR root collation), you really need to reset to
something for the result to make sense. CLDR 24 forbids rules without
initial reset, and ICU 53 will follow suit.

instead of the mysterious notation (and in fact verbose and probably
> inconsistant in the way the same level 2 is further used with "<<"):
>   &[before 2]? << ?
>

It is true that the "2" and the strength of the operator are redundant, but
the notation is now well-defined.
I don't know your criteria for "mysterious" :-)
It does help to know the root collation mappings, or at least how they are
generally constructed; for example, that ? maps to two collation elements.

I don't thing the "&" is necessary except as a separator between separate
> rules (where all rules must implicitly start by a reset at some level).
>

See above.

The "monster" you describe belongs to ICU implementation (which is not part
> of any standard but now integrated in various products that have abandonned
> the idea of implementing (unstable and complex) collations themselves.
>

I think Richard refers to the "monster" because it is very, very tricky to
get one's head around the interaction of all of the pieces of the UCA,
Unicode normalization, and the CLDR additions. At least when it comes to
the heads of Richard, Mark, Ken, and my own...

Also, the implementation of UCA is easy if you don't care about data size
or speed of string comparisons. Once you care about size and speed and want
additional functionality (like in ICU), it's a major chunk of code. In the
case of ICU, that code had accreted functionality and changed with changing
specs and had gotten buggy and hard to maintain, so I am in the process of
reimplementing it, with hopes of getting it into ICU 53 in March. The code
and data actually got smaller, but it's still large.

My opinion is that this part of ICU should be detached from it in a
> completely separated project, to help simplifying it,
>

It's complex for reasons stated above, and it benefits from many
lower-level parts of ICU (Unicode properties, normalization, data loading,
data structures, ...).

It is notable that after so many years years, collation is still not
> implemented in Javascript, and still does not have a standardized API in
> Javascript/ECMAscript minimum support for strings
>

Collation was added to the ECMAScript standard in 2012, with several
browsers implementing it.
PyICU makes it available in Python.

If someone wanted to port code to JavaScript or Python, and wanted it to be
fast, the new (upcoming) ICU Java code might be a reasonable start.

When performance of applications on the client side is a problem (for
> client-side applications needing to perform dynamic collations), full
> collators are not implemented at all, and these applications use a much
> simpler model (even if they don't work very well with lots of languages).
>

Right. If the client code need not collate newly typed strings, then one
good technique is to have the server send the corresponding sort keys. By
the way, ICU makes a strong effort to write very short sort keys.

Best regards,
markus
-- 
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140223/c11f13ff/attachment.html>

From verdy_p at wanadoo.fr  Sun Feb 23 17:26:01 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 24 Feb 2014 00:26:01 +0100
Subject: Sorting notation
In-Reply-To: <CAN49p6odSfRytT3DM2obvaFTjtWjboaoAQgtm2+5=6V3JpdQ7Q@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <CAN49p6odSfRytT3DM2obvaFTjtWjboaoAQgtm2+5=6V3JpdQ7Q@mail.gmail.com>
Message-ID: <CAGa7JC2YFmwzh6hezq=NLJLj=jjayh58bEJGf40Ug4jNnXFZVA@mail.gmail.com>

2014-02-24 0:04 GMT+01:00 Markus Scherer <markus.icu at gmail.com>:

> Right. If the client code need not collate newly typed strings, then one
> good technique is to have the server send the corresponding sort keys. By
> the way, ICU makes a strong effort to write very short sort keys.
>
The size of sort keys does not really matters here. There are in the same
order of magnitude as the texts to sort. The problem in dynamic
applciations is that there may need to send lots of text to the server to
return lots of keys.

The client may want to cache them, but then the application will be bound
to the perfomance of the network and the server load, for round tripping
response times.

In most application web sites, this is simply not an option: users will
complain about the slow response time of the applicaton for something that
seems abvious for them, such as sortng columns in a long data report mixed
with user data input (without having to download it again for the same data
presented differently, and without loosing current user input). In some
cases, it is also not an option to send this data to the server because it
is private to the user and the uer wants that data to be stored elsewhere
securely (including on a server that performs nothing else than storage;
and cannot offer the collator service).

Other applications needing performance of collators are text editors (for
client-side search-and-replace while editing; possibly even with support
for regexps and collator-based text transforms...).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140224/4a1a69bc/attachment.html>

From everson at evertype.com  Sun Feb 23 23:29:47 2014
From: everson at evertype.com (Michael Everson)
Date: Sun, 23 Feb 2014 21:29:47 -0800
Subject: Sorting notation
In-Reply-To: <20140223213245.26f99657@JRWUBU2>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
Message-ID: <C7730E53-8194-4097-B4F6-995AA3CB136C@evertype.com>

On 23 Feb 2014, at 13:32, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> On Sun, 23 Feb 2014 20:49:24 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
>> It seems surprisng that Michael Everson asks the question, when he
>> already knows so much about Unicode algorithms (but may be less about
>> notations used in CLDR data)

Do me a favour, Mr Verdy. Don?t think about me. Thanks.

Michael Everson * http://www.evertype.com/


From verdy_p at wanadoo.fr  Mon Feb 24 02:36:33 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 24 Feb 2014 09:36:33 +0100
Subject: Sorting notation
In-Reply-To: <C7730E53-8194-4097-B4F6-995AA3CB136C@evertype.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <C7730E53-8194-4097-B4F6-995AA3CB136C@evertype.com>
Message-ID: <CAGa7JC2A5ycLn+9pHBwbqYbnWSoC1j+qmmADo9dXM7g3gcAw5w@mail.gmail.com>

2014-02-24 6:29 GMT+01:00 Michael Everson <everson at evertype.com>:

> > On Sun, 23 Feb 2014 20:49:24 +0100 Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
> >
> >> It seems surprisng that Michael Everson asks the question, when he
> >> already knows so much about Unicode algorithms (but may be less about
> >> notations used in CLDR data)
>
> Do me a favour, Mr Verdy. Don't think about me. Thanks.


Why? Didn't *you* ask the question to the list?
If you don't like the replies, that's possibly because you did not ask the
correct question or what you need to confirm. Or may be because you may
want to get opinions from others on something that is highly subject ot
variations and not really a widely adopted standard (the UCA algorithm is
standard, not the notations for tailorings and even the CLDR data has
changed is several times).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140224/438f25dd/attachment.html>

From wjgo_10009 at btinternet.com  Mon Feb 24 05:00:47 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 24 Feb 2014 11:00:47 +0000 (GMT)
Subject: Websites in Hindi
Message-ID: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>

An interesting thread about Websites in Hindi is on the Serif forum.

https://community.serif.com/forum/webplus/8615/websites-in-hindi

I know that there can be issues over the correct rendering of some Indian languages, though I do not know if that applies to Hindi specifically.

It is possible that browsers and Adobe Reader resolve those issues, but I do not know.

Could someone here say something about this please?

William Overington

24 February 2014


From richard.wordingham at ntlworld.com  Mon Feb 24 13:38:21 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 24 Feb 2014 19:38:21 +0000
Subject: Sorting notation
In-Reply-To: <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
Message-ID: <20140224193821.23fa0cee@JRWUBU2>

On Sun, 23 Feb 2014 23:13:53 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-23 22:32 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Sun, 23 Feb 2014 20:49:24 +0100
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:*At least, referring to
> > Version 24 of the LFML specification, I assume
> > Part 5 Section 3.5, which defines "&..<<", also applies to Section
> > 3.9, which purports to define the meaning of "&[before 2]..<<".
> > It's conceivable that I am wrong, and the meaning of "&[before 2]?
> > << ?" is undefined.

> This looks like a cryptic notation anyway. If we assume that there's
> an implicit reset at start of a collation rule, and that collation
> does not define any relative order for the empty string, you could
> simply write this reset at level 2 as:
>   << ? << ?
> instead of the mysterious notation (and in fact verbose and probably
> inconsistant in the way the same level 2 is further used with "<<"):
>   &[before 2]? << ?

My understanding of the meaning of the notation is that:

1) ? is to have the same number and type of collation elements as ?
currently has;
2) The last collation element of ? that has a positive weight at level
2 is to be immediately before the corresponding collation element of
? at the secondary level;
3) No collation element is to be ordered between these two collation
elements; and
4) Their other collation elements are to be the same.

Thus, before the operation we have a? << ? << ? << ?.  After it, we
have a? << ? << ? << ?.  Is this really what your notation "<< ? << ?"
is intended to mean?  If we are looking for a brief notation, I think
"&? >> ?" would be better.

Richard.


From verdy_p at wanadoo.fr  Tue Feb 25 14:02:47 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 25 Feb 2014 21:02:47 +0100
Subject: Sorting notation
In-Reply-To: <20140224193821.23fa0cee@JRWUBU2>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
Message-ID: <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>

2014-02-24 20:38 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> My understanding of the meaning of the notation is that:
>
> 1) ? is to have the same number and type of collation elements as ?
> currently has;
> 2) The last collation element of ? that has a positive weight at level
> 2 is to be immediately before the corresponding collation element of
> ? at the secondary level;
> 3) No collation element is to be ordered between these two collation
> elements; and
> 4) Their other collation elements are to be the same.
>

I disagree with point your point (1).

* The number of levels does not matter, the notation just indicates that
the relation does not specify any starting weight for levels lower than the
one indicated by the reset.
* And the effective number of collation elements does not matter: we should
assume that if one of the item has not enough collation elements, there's
for each level a zero weight for each missing level. In practive this only
affects the first element, except in case of contractions.

I disagree as well on point (2). The starting element (at the reset) may
have a null weight at that level, so that we can still order other elements
with the same null weight at that level, notably if they have non null
weights for higher levels.

I agree on your point (3) EXCEPT when the first item of a pair is a "reset"
(i.e. an empty string).

The point (4) is completely wrong. The other collaction elements in the
first pair may be arbitrary (also possibly with distinct weights, but at
higher levels) !!!


In fact the two items listed after the reset do not matter at all. All what
is important is the item for the reset itself, and the first non-empty item
ordered after it.

That's why I think that "&[before2] xxx" makes sense (even alone) and is in
fact the same as "& << xxx"  or even just "<< xxx" if you consider that
evey rule starts by an implicit reset in order to create a valid pair (in
the first pair, the 1st item of the pair is the reset itself, i.e. an empty
string, the second item is the first non-empty string indicated after it;
and the pair itself has a numeric property specifying its level, here 2).

The form "&a < b < c < d ..." is a compressed form of these rules: "<a";
"a<b", "b<c" (where the "<" is any king of comparator for some level). The
"reset" is then automatically the first item of each pair.

Internally in my implementation I do not really use pairs, but triplets of
the form (item1, weight, item2), where weight is a numeric value: initialy
it is set to an integer matchin the level number, but I may set it to a
fraction between two integers, to define frequencies (by adding a fraction
betwen 0 included and 1 excluded to indicate the probability of having
other collation elements with a lower weight at that level).
The fraction is then used to generate extremely compact collation keys
(using atithmetic coding), even shorter than what we get in ICU or with
Huffmann coding or other context-free prefix coding.

So my own syntax never needs any explicit reset, it just order collection
elements with simple rules, in which I can also add optional statistics
(used only for the generation of collation keys, but not needed at all for
comparing two strings).

My own syntax is even simpler: I don"t use at all multiple operators "<",
"<<", "<<<", I just specfy rules for each level separately, and for that I
do not even need any operator, byt just a common separator, so that I can
create the list of successive pairs (starting with the pair containing the
impliciit "reset" for that level). In the middle of the list however I may
set some %-tage value to indicate some relative frequency for the second
element of each pair (the 1st element of a pair with the reset always stars
at 0%, so the %tage is onlu indocated for the second element of each pair,
and this %age can never be 100% or more. Unless I specifiy a ?rcentage, the
list of pairs uses implicit %-tage values distributed lineaility in the
given list (internally the precision of a double is sufficient for
practical cases, though doubles are immediately converted to 64-bit
integers, which may be needed if the statistics given are extremely skewed;
if statistics are not used to generate collation keys, I could even use
32-bit integers only for building the lookup tables for comparing strings;
so a tuning parameter can simply be used to ignore/discard the statistics
provided in rules and the construction of lookup tables is significantly
faster and the binary lookup tables are 50% more compact).

Such specification is much simpler to edit; and in many cases I don't need
to specify everything because I also perform the normalization closures or
the rules. The result is that rules are much shorter (just to represent for
the DUCET itself, via a script parsing it and generating the rules, I get
strings converted to short JSON data that can be parsed easily (the
normalization closure being still the longest part of the algorithm when
instantiating a new collator). I also never need and use the collation
weights suggested in the DUCET except for their relative values to order
the elements in much more compact rules.

Building the effective table of weights is then extremely trivial.
performing the final encoding of collation keys is not much more complex
than a traditionnal arithmeric coder (it is a bit simpler if you use
Huffman, however it is *slower* when instanciating the collator, because
you need .to precompute the bit pattern, something absolutely no needed
with artihmetic coding which uses the statastics dirctly without needed to
prebuild an helper lookup table, also because of the complex bit stuffing
which is simpler to do with the arithmetic coder).

I already spoke about this several months ago on this list. However I've
not experimented it with lots of locales needing complex collation rules
(so I did not investigate all the data already specified in CLDR for some
locales).

My code remains experimental mainly because I do not handle specific rules
needing custom work breakers (e.g. for South-East Asian scripts). I've not
experimented it a lot also with CJK Sinograms (and the complex lookup they
require if we need to lookup for the supplementary data such as
translitteration mappings for Pinyin or similar). And I still don't handle
some of the preprocessing needed for some Indic scripts (includng Thai), or
ignorable "stop words" (that may be needed even for English or French for
some kinds of collators), and the initial code to handle parsing and
preprocessing numbers is broken in my "pipelined" implementation that
attempts to avoid most internal bufferings and internal copies of
transformed strings (to avoid stressing memory allocators and garbage
collectors).

The code is not written in C/C++ but in Java (easier to test, I have little
experience of Python, and don't like PHP, I could have used Lua, also good
for experimenting designs, but it still lacks a JIT compiler to native code
and its performances are too poor compared to current cimplementations of
Javascript). I'd like to have a version can be be deployed on the client
side using a scripted language that is easily deployed (this also excludes
C/C++ implementations like ICU4C that I consider simply not available if
it's only supported by specific framework libraries; but my initial attemps
with using Javascript also has shown that it was still too slow even with
JIT and still stressed the memory allocator too much; that code is also
very unsafe, errorprone and a nightmare to debug).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140225/8822c213/attachment.html>

From markus.icu at gmail.com  Tue Feb 25 15:29:53 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 25 Feb 2014 13:29:53 -0800
Subject: Sorting notation
In-Reply-To: <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
 <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
Message-ID: <CAN49p6r5iWFngRwxM0v9ScZ2Q7-5Uep-m-8NTc=nSJjdSQtvUw@mail.gmail.com>

On Tue, Feb 25, 2014 at 12:02 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-24 20:38 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
> My understanding of the meaning of the notation is that:
>>
>> 1) ? is to have the same number and type of collation elements as ?
>> currently has;
>> 2) The last collation element of ? that has a positive weight at level
>> 2 is to be immediately before the corresponding collation element of
>> ? at the secondary level;
>> 3) No collation element is to be ordered between these two collation
>> elements; and
>> 4) Their other collation elements are to be the same.
>>
>
> I disagree with point your point (1).
>

Philippe, Richard is correct with what the specific example of
    &[before 2]? << ?

should yield according to

http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Tailorings

Your opinions are not based on the LDML collation tailoring spec, but you
make it sound like they are.

I suggest the two of you agree on which spec to discuss, or you clarify
that what you are doing is comparing the LDML spec with some other spec (I
don't know which one that is).

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140225/447dc4cc/attachment.html>

From verdy_p at wanadoo.fr  Tue Feb 25 15:36:24 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 25 Feb 2014 22:36:24 +0100
Subject: Sorting notation
In-Reply-To: <CAN49p6r5iWFngRwxM0v9ScZ2Q7-5Uep-m-8NTc=nSJjdSQtvUw@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
 <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
 <CAN49p6r5iWFngRwxM0v9ScZ2Q7-5Uep-m-8NTc=nSJjdSQtvUw@mail.gmail.com>
Message-ID: <CAGa7JC3tLOMaoTo-ZwXwGr+XSBSvmP-SqrqzUbLdr+wg4zGQCg@mail.gmail.com>

I did not cite LDML, because it is far from being a stable standard for the
question of collation (I endorse the term "monster" used by someone else),
being adopted (and modified) mostly to document what ICU does (or does not
know how to do better).
As such this spec is still in a very alpha stage, and subject to various
experimentations.


2014-02-25 22:29 GMT+01:00 Markus Scherer <markus.icu at gmail.com>:

> On Tue, Feb 25, 2014 at 12:02 PM, Philippe Verdy <verdy_p at wanadoo.fr>wrote:
>
>> 2014-02-24 20:38 GMT+01:00 Richard Wordingham <
>> richard.wordingham at ntlworld.com>:
>>
>> My understanding of the meaning of the notation is that:
>>>
>>> 1) ? is to have the same number and type of collation elements as ?
>>> currently has;
>>> 2) The last collation element of ? that has a positive weight at level
>>> 2 is to be immediately before the corresponding collation element of
>>> ? at the secondary level;
>>> 3) No collation element is to be ordered between these two collation
>>> elements; and
>>> 4) Their other collation elements are to be the same.
>>>
>>
>> I disagree with point your point (1).
>>
>
> Philippe, Richard is correct with what the specific example of
>      &[before 2]? << ?
>
> should yield according to
>
> http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Tailorings
>
> Your opinions are not based on the LDML collation tailoring spec, but you
> make it sound like they are.
>
> I suggest the two of you agree on which spec to discuss, or you clarify
> that what you are doing is comparing the LDML spec with some other spec (I
> don't know which one that is).
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140225/68f13dc3/attachment.html>

From richard.wordingham at ntlworld.com  Tue Feb 25 18:08:27 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 26 Feb 2014 00:08:27 +0000
Subject: Sorting notation
In-Reply-To: <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
 <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
Message-ID: <20140226000827.6e189530@JRWUBU2>

On Tue, 25 Feb 2014 21:02:47 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-24 20:38 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

The immediately following text of mine is entirely concerned
with the interpretation of the LDML specification "&[before 2]? << ?".

> > My understanding of the meaning of the notation is that:
> >
> > 1) ? is to have the same number and type of collation elements as ?
> > currently has;
> > 2) The last collation element of ? that has a positive weight at
> > level 2 is to be immediately before the corresponding collation
> > element of ? at the secondary level;
> > 3) No collation element is to be ordered between these two collation
> > elements; and
> > 4) Their other collation elements are to be the same.

The terms collation element and weight as I use them are intended to be
used as in the Unicode Collation Algorithm.  It is conceivable that I
have missed some subtlety in the difference between the extended
weights of DUCET and the fractional weights preferred for the
expression of the CLDR default collation.

> I disagree with point your point (1).

> * The number of levels does not matter, the notation just indicates
> that the relation does not specify any starting weight for levels
> lower than the one indicated by the reset.

It does seem that what happens below the level of the reset is
irrelevant.  I couldn't construct a counter-example to show that it
can matter. I'd still recommend copying at the lower levels just in case
there is a subtle effect.

> * And the effective number of collation elements does not matter: we
> should assume that if one of the item has not enough collation
> elements, there's for each level a zero weight for each missing
> level. In practive this only affects the first element, except in
> case of contractions.

This makes no sense to me.  The collation elements for ? before the
application of the rule do not matter.  The requirements I gave on the
collation elements of ? are for its collation elements *immediately
after* the rule has been applied.  This incomprehension also applies to
your comments on points (2) to (4).

> I disagree as well on point (2). The starting element (at the reset)
> may have a null weight at that level, so that we can still order
> other elements with the same null weight at that level, notably if
> they have non null weights for higher levels.

> I agree on your point (3) EXCEPT when the first item of a pair is a
> "reset" (i.e. an empty string).
> 
> The point (4) is completely wrong. The other collaction elements in
> the first pair may be arbitrary (also possibly with distinct weights,
> but at higher levels) !!!

The specification "&[before 2]? << ?" has to be invalid if ? has no
non-zero secondary wei?hts.  The LDML specification doesn't mention this
input error.

<snip>
> That's why I think that "&[before2] xxx" makes sense (even alone) and
> is in fact the same as "& << xxx"  or even just "<< xxx" if you
> consider that evey rule starts by an implicit reset in order to
> create a valid pair (in the first pair, the 1st item of the pair is
> the reset itself, i.e. an empty string, the second item is the first
> non-empty string indicated after it; and the pair itself has a
> numeric property specifying its level, here 2).

This has nothing to do with the LDML notation.  As far as I can tell,
you are interpreting "<< xxx" to assign xxx a collating element with
zero primary weight.
 
> The form "&a < b < c < d ..." is a compressed form of these rules:
> "<a"; "a<b", "b<c" (where the "<" is any king of comparator for some
> level). The "reset" is then automatically the first item of each pair.

> So my own syntax never needs any explicit reset, it just order
> collection elements with simple rules, in which I can also add
> optional statistics (used only for the generation of collation keys,
> but not needed at all for comparing two strings).

No.  This is similar to the fallacy that a collation is defined by the
relative ordering (and degree of difference) of the collating
elements.  Are you relying on deferred binding?  And please try not to
use 'collation element' (a sequence of weights, one per level) when you
mean 'collating element' (either a string of characters or the ordered
pair of a string and its corresponding sequence of collation elements).

> And I still don't handle some of the preprocessing needed
> for some Indic scripts (includng Thai),...

Are you aware that Thai can be handled by contractions?  Compared with
how it might have been, Thai collation is extremely computer friendly.

Richard.


From verdy_p at wanadoo.fr  Tue Feb 25 22:34:43 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 26 Feb 2014 05:34:43 +0100
Subject: Sorting notation
In-Reply-To: <20140226000827.6e189530@JRWUBU2>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
 <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
 <20140226000827.6e189530@JRWUBU2>
Message-ID: <CAGa7JC31sDKeUUxAomMkcPdp8E0W9FHG9gnrbT0SOVTApaKMrw@mail.gmail.com>

2014-02-26 1:08 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> > And I still don't handle some of the preprocessing needed
> > for some Indic scripts (includng Thai),...
>
> Are you aware that Thai can be handled by contractions?  Compared with
> how it might have been, Thai collation is extremely computer friendly.
>
I did not write anything here about contraction but only about
preprocessing.

The "computer friendly" feautre of Thai is basically for its rendering (not
part of this topic), I'm not sure this is really true when discussing about
collations. Though as I said, I've not investigated time to test it in real
cases.
Only basic tests were performed (using some of the testcases listed in CLDR
data or in ICU, only for comparaison of results).

Also I absolutely don't care about compisite weights or fractional weights
used in ICU. For me these are implementation tricks and are irrelevant to
how a collator may work, they are one possible solution which in fact just
complicates the expression of problems to solve. There are the kind of
things that are (IMHO) overspecified only for documenting how ICU works.

(Note: I don't oppose ICU; but ICU is not universal and cannot be used
uniersally, even if it is integrated in more projects today).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140226/35cd738b/attachment.html>

From tomasek at etf.cuni.cz  Wed Feb 26 03:47:37 2014
From: tomasek at etf.cuni.cz (Petr Tomasek)
Date: Wed, 26 Feb 2014 10:47:37 +0100
Subject: Hebrew Extended Block(s)
In-Reply-To: <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com>
References: <CAPKujtSMDYQLRMmwT2TbPqXg2n-0o3Qqm87wixCs=A4AORWkfA@mail.gmail.com>
 <6ABEE65D-7D4B-4A53-ACFD-4325DC93EE92@evertype.com>
Message-ID: <20140226094737.GA20110@ebed.etf.cuni.cz>

On Sat, Feb 22, 2014 at 12:55:33PM -0800, Michael Everson wrote:
> On 22 Feb 2014, at 11:46, Robert Wheelock <rwhlk142 at gmail.com> wrote:
> 
> > There?s an empty subblock (U+00860 ? U+0089F) with 64 empty codepoints where we could put needed additional Hebrew characters in?
> 
> We?re not going to add a rake of pre-composed Hebrew characters though. 
> 


What about the Babylonian and Palestinian punctuation? Anything new currently?

Thanks!

Petr Tomasek


From qsjn4ukr at gmail.com  Wed Feb 26 07:30:05 2014
From: qsjn4ukr at gmail.com (QSJN 4 UKR)
Date: Wed, 26 Feb 2014 15:30:05 +0200
Subject: Old Cyrillic Yest
In-Reply-To: <CAHtVdmACWNaGaY-5PMCTsBeWMQThxXRvCZ7qW1=1pvUg4-6F8w@mail.gmail.com>
References: <20121112092156.665a7a7059d7ee80bb4d670165c8327d.cea44632cc.wbe@email03.secureserver.net>
 <CAHtVdmBR9nX5pi3jAkhirw=376A4=mzQZx79zjgc7h-O94ZqrA@mail.gmail.com>
 <6725ADA5AC2341D9B1AEF2F398F81BDC@DougEwell>
 <CAHtVdmB8NtbEqeP-DbobhsnQTwkPQ2ZdFgCKLOJrxExDy4CfiQ@mail.gmail.com>
 <7B5C469C-1EC7-4DCE-A1AC-2F22E7C69230@evertype.com>
 <CAHtVdmCM0VLE6u-u+qiuEm5HfznYm0qz5qGYJngsBrhb4m0MEw@mail.gmail.com>
 <CAHtVdmACWNaGaY-5PMCTsBeWMQThxXRvCZ7qW1=1pvUg4-6F8w@mail.gmail.com>
Message-ID: <CAHtVdmBAMk7yMM08FCqbw=CgkXOP+RHpDbsWQtX1Gc61d2QQzw@mail.gmail.com>

2012/11/12 QSJN 4 UKR <qsjn4ukr at gmail dot com> wrote:

 > Old Cyrillic letter YEST (?) has two variants: broad (also called
 > Yakornoye Yest) and narrow. They are saved in modern Ukrainian script
 > (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD
 > YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW
 > YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic
 > YEST, but it is unclear, how to distinguish the BROAD YEST and the
 > NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and
 > U+0415/0435 for the modern rectangle IE, some old-style fonts use only
 > the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454
 > at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for
 > the NARROW YEST...

2012/11/23 Doug Ewell <doug at ewellic.org>
>How many truly different letters, old and new, are we talking about? On November 12 you >wrote, "UKRAINIAN IE and BROAD YEST is the same letter in fact." It would not make >sense to assign a new BROAD YEST letter if it is really the same as UKRAINIAN IE, and if >existing texts already use UKRAINIAN IE to represent it.
>

Full picture
Meaning - Glyph - Codepoint
Old ChurchSlavonic:
Narrow Yest (regular form) - very narrow halfmoon - 0404/0454 (ambiguous) and
0415/0435 (probably wrong glyph will be rendered) (there are no
certain codepoints)
Broad Yest (special form, initial, plural disambiguator) - broad
halfmoon, identical
to Ukrainian Ie or maybe somehow grater (broking baseline) - 0404/0454
indeed
Modern imitation of Church Slavonic, or really old texts, or texts
where hard to distinct Broad and Narrow Yest:
Ambiguous Yest - identical to Ukrainian Ie or maybe like Narrow Yest
(in old-style font) -
0404/0454 sure
Modern languages:
Ie - rectangle capital / closed rounded small (identical to Latin) - 0415/0435
Ukrainian Ie - identical to ambiguous Yest - 0404/0454

So there are two steps. First. Required. Separate codepoint for Narrow
Yests. It is just impossible to work with ChurchSlavonic texts without
these. Because: wrong glyph is rendered almost always (you must
understand, we cant hope on language detection, cause the text
contains certain the mix, old text with modern translation) - or -
there is no way to show  Broad Yest at all.
Second. Optional. Separate codepoint for Broad Yests. That's only
necessary if one part of text contains the ambiguous Yests (coded as
now, 0404/0454, without changes!) but other part contains the Broad
Yests and the author can/wants to show this feature.

Am i the only man in the world who think that Unicode is poorly
adapted for ChurchSlavonic?


From samjnaa at gmail.com  Thu Feb 27 04:32:49 2014
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Thu, 27 Feb 2014 16:02:49 +0530
Subject: ?MP = Multi*lingual* plane?
Message-ID: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>

Given that Unicode encodes scripts and not languages, how appropriate is it
to call the BMP and the SMP as the multi*lingual* planes?

-- 
Shriramana Sharma ???????????? ????????????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140227/9ff381e7/attachment.html>

From everson at evertype.com  Thu Feb 27 09:23:53 2014
From: everson at evertype.com (Michael Everson)
Date: Thu, 27 Feb 2014 07:23:53 -0800
Subject: ?MP = Multi*lingual* plane?
In-Reply-To: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
References: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
Message-ID: <B38CAF35-27DE-46BA-AE90-449D08D8FE0E@evertype.com>

On 27 Feb 2014, at 02:32, Shriramana Sharma <samjnaa at gmail.com> wrote:

> Given that Unicode encodes scripts and not languages, how appropriate is it to call the BMP and the SMP as the multi*lingual* planes?

You are more than two decades late in asking this.

It may have seemed more appropriate in an 8-bit code page world where rather small subsets limited the number of languages accessible by one or another part of ISO/IEC 8859. 

A new term like ?multiscriptal? would not have been appropriate. File this under ?We know the term ?ideograph? is a misnomer."

Michael Everson * http://www.evertype.com/


From asmusf at ix.netcom.com  Thu Feb 27 09:30:59 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 27 Feb 2014 07:30:59 -0800
Subject: ?MP = Multi*lingual* plane?
In-Reply-To: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
References: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
Message-ID: <530F5A33.1010005@ix.netcom.com>

On 2/27/2014 2:32 AM, Shriramana Sharma wrote:
> Given that Unicode encodes scripts and not languages, how appropriate 
> is it to call the BMP and the SMP as the multi*lingual* planes?
>
Isn't it lovely how these things work?

A./


From richard.wordingham at ntlworld.com  Thu Feb 27 16:00:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 27 Feb 2014 22:00:09 +0000
Subject: Sorting notation
In-Reply-To: <CAGa7JC31sDKeUUxAomMkcPdp8E0W9FHG9gnrbT0SOVTApaKMrw@mail.gmail.com>
References: <93FF5E56-A343-4C88-B8CB-16081868CEC1@evertype.com>
 <CAGa7JC38dX+Yi2OCPCERyFu4B9AKJ_pfuanXWwM3zMGoUTtj0A@mail.gmail.com>
 <CAGa7JC2WeWSpGh+ni+ara1sh4pG9dvCRdcdc7umhA__kYMKgWQ@mail.gmail.com>
 <CAN49p6qETH5mNDUsednB838mjDXRF2VCGJ8HjYmXXpNKmNQ5SQ@mail.gmail.com>
 <CAGa7JC1UaO3BE7mgLkg2wwUmuE9XN4RkceTcQrd85yw4is-kkA@mail.gmail.com>
 <20140223213245.26f99657@JRWUBU2>
 <CAGa7JC2hPNUzmKORY8_0bTZzOrBmNcvRTi5PytFFk_GvDpYExw@mail.gmail.com>
 <20140224193821.23fa0cee@JRWUBU2>
 <CAGa7JC29LkgefA8XpxptNo2quVqkRmnUSKfD2WEwJA62P8wkTg@mail.gmail.com>
 <20140226000827.6e189530@JRWUBU2>
 <CAGa7JC31sDKeUUxAomMkcPdp8E0W9FHG9gnrbT0SOVTApaKMrw@mail.gmail.com>
Message-ID: <20140227220009.69655291@JRWUBU2>

On Wed, 26 Feb 2014 05:34:43 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-26 1:08 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

> > Compared
> > with how it might have been, Thai collation is extremely computer
> > friendly.

> The "computer friendly" feautre of Thai is basically for its
> rendering (not part of this topic), I'm not sure this is really true
> when discussing about collations.

You just swap the preposed vowels with the immediately following
consonant (which can be done by a contraction), and then it's a
straightforward sort of a system having characters with a secondary
weight.  You don't need to know anything more about the structure
of Thai words.  However, the first Thai-Thai dictionary had a very
different collation order - see
http://www.sealang.net/dictionary/bradley/theraphan1991lexicography.htm .
I think that order needs a very large collation element table.  It may
well be beyond the capability of the UCA - the description I cited
barely hints at the problems.

Richard.


From adam at nohejl.name  Fri Feb 28 12:56:43 2014
From: adam at nohejl.name (Adam Nohejl)
Date: Fri, 28 Feb 2014 19:56:43 +0100
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
Message-ID: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>

Hello,

I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary.

Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields?

Examples of the discrepancies:

(1) A very common character for "most, maximum".
U+6700	kRSKangXi	73.8
U+6700	kRSUnicode	13.10

(2) A funny character for autumn containing the turtle component.
U+9F9D	kRSKangXi	115.16
U+9F9D	kRSKanWa	115.16
U+9F9D	kRSUnicode	213.5

There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.

(3) The simplified turtle character (commonly assigned to the traditional radical #213):
U+4E80	kRSKangXi	213.0
U+4E80	kRSUnicode	5.10

(4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
U+66FB	kRSKangXi	72.7
U+66FB	kRSUnicode	73.7

- - -

[*] <http://www.unicode.org/reports/tr38/tr38-8.html>: "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard."

[**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: <http://ctext.org/dictionary.pl>


-- 
Adam Nohejl