block to script

Philippe Verdy via CLDR-Users cldr-users at unicode.org
Wed Feb 14 08:42:10 CST 2018


We were told the blocks cannot be split to smaller units than a single
column of 16 codepoints: if any one position is assigned to a block, the
remaining codepoints in that column cannot be assigned to another block...
So unassigned positions in an allocated column should still belong to the
same block and may infer a default script property from that block (which
may turn to be a wrong guess only if that unassigned position gets assigned
a COMMON/INHERITED script).
Note however that some characters (notably currency signs, symbols or
punctuations) sometimes get used across several scripts without necessarily
being given a COMMON/INHERITED script). Most of these symbols are
bidi-neutral and should do not form complex ligatures or clusters: it means
you can almost safely assume some properties from the unassigned positions
in these allocated columns (for exampel to tune the default behavior of a
text rendering engine, if it ever has to render a character which was once
unallocated may gets finally assigned and found to be mapped in some new
font).

2018-02-14 3:55 GMT+01:00 Asmus Freytag via CLDR-Users <
cldr-users at unicode.org>:

> On 2/13/2018 6:38 PM, Martin Hosken via CLDR-Users wrote:
>
>> Dear All,
>>
>> Is there a way to get from a UBlockCode to a UScriptCode?
>>
>> What? Aargh! No! Surely not! I hear you cry. But hold on a second. What
>> I'm wanting to do is to add some (not perfect) future proofing to my
>> application. When a new character is added to a block in Unicode, one can
>> infer the script of that character, even if the character itself is
>> unknown, from the block. But blocks get split! Yes they do. And this isn't
>> a perfect solution. But block splits are rare, and this solution will give
>> me a much better chance of an unknown character being handled
>> 'appropriately' than being sure that the run break will break and having to
>> wait however long until the next version of Unicode is released, ICU is
>> updated and the application updated to that version of ICU.
>>
>> Hence my question :)
>>
>
> Very simply count all the code points in the block that have a definite
> script assignment that's not COMMON/INHERITED (and not unassigned).
>
> If a single script far outweighs both the COMMON/INHERITED and any other
> scripts, then "guessing" that a new character will end up with that script
> assignments will give you results that are better than "random".
>
> And even if there is a combining mark assigned to a free spot, in many
> cases, whether you treat it as INHERITED or as having the script of its
> base character assigned to it makes no big difference (think script runs in
> a complex script).
>
> Your algorithm will detect symbol and punctuation blocks and can predict
> COMMON as a likely script value.
>
> Best thing is that for each  revision, your guesses will get better, that
> is, when you upgrade your application, it will improve not only assigned
> code points but the probabilistic guesses for some of the unassigned ones
> as well.
>
> As long as you are aware that it's a probabilistic gamble, you should be
> fine.
>
> Enjoy,
>
> A./
>
>
>> Yours,
>> Martin
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20180214/4b81df4c/attachment.html>


More information about the CLDR-Users mailing list