"textels"

Eric Muller eric.muller at efele.net
Fri Sep 16 10:47:27 CDT 2016


On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
> Quote/Cytat - Eric Muller <eric.muller at efele.net> (pią, 16 wrz 2016, 
> 17:03:54):
>
>> On 9/16/2016 6:52 AM, Janusz S. Bień wrote:
>>> (when working on a corpus of historical Polish we
>>> noticed some cases where standard Unicode equivalence was not
>>> convenient).
>>
>> I'm very interested to know more about those cases.
>
> For our search engine we were unable to use compatibility equivalence 
> "out of the box" for splitting the ligature because it also converted 
> long s to short s while we wanted to preserve the distinction.

I am interested in the problems with *canonical* equivalence. I thought 
that you were talking about those before.

Compatibility equivalence is a completely different beast. It is, IMHO, 
too coarse a tool and best forgotten. For any particular task, it's 
typically doing too much (e.g. long/short s folding in your case) and 
too little (not everything you need). There was an attempt at improving 
the situation, by providing a whole bunch of fine grained, targeted 
transformations (http://www.unicode.org/reports/tr30/), but that did not 
pan out.

Eric.



Thanks,
Eric.



More information about the Unicode mailing list