block to script

Martin Hosken via CLDR-Users cldr-users at unicode.org
Wed Feb 14 22:15:02 CST 2018


Dear Asmus,

> The probability of block-boundary change is far less than the probability that the "guess" of the future script property for a code point turns out wrong for any of the other possible reasons. Therefore, it disappears in the noise. As long as you are willing to engage in "guessing" in the first place, small changes in probability simply don't matter.
> 
> There's also the question of whether you are better off "guessing" based on code point value alone, or whether it makes more sense to also use the surrounding context. If assembling script runs, for example, an unassigned code point in the middle of a run should have a higher probability of continuing the run when it is also in (one of) the blocks that cover the script, but unless the remainder of text is marked by large script variability, that probability should normally already be high.

This is true for the complexities of Latin/Common, but when it comes to Non Roman scripts, things become a lot clearer.

> Whether it's worth making all these guesses is questionable, but I'm willing to go along and assume that some credible scenarios might exist.

For the use cases that do call for this, things are much more clear cut.

Here is my take on a block to script list. As you can see:

1. There are many (34%) full blocks for which any and no value is just fine.
2. UNKNOWN vs SYMBOLS vs COMMON is ambiguous and I've made a best case. (30%)

#include "unicode/uscript.h"

#define USCRIPT_FULL USCRIPT_INVALID_CODE
#define USCRIPT_MATH USCRIPT_MATHEMATICAL_NOTATION
#define _(x) USCRIPT_##x

UScriptCode block_script[] = {
    _(INVALID_CODE), _(FULL), _(FULL),  _(FULL),    _(FULL),    _(FULL),    _(FULL),    _(FULL),
    _(GREEK),   _(FULL),    _(ARMENIAN), _(HEBREW), _(ARABIC),  _(SYRIAC),  _(THAANA),  _(FULL),
    _(BENGALI), _(GURMUKHI), _(GUJARATI), _(ORIYA), _(TAMIL),   _(TELUGU),  _(KANNADA), _(MALAYALAM),
    _(SINHALA), _(THAI),    _(LAO),     _(TIBETAN), _(FULL),    _(GEORGIAN), _(HANGUL), _(ETHIOPIC),
    _(CHEROKEE), _(UCAS),   _(OGHMA),   _(RUNIC),   _(KHMER),   _(MONGOLIAN), _(FULL),  _(GREEK),
    _(COMMON),  _(COMMON),  _(COMMON),  _(INHERITED), _(FULL),  _(UNKNOWN), _(FULL),    _(FULL),
    _(FULL),    _(UNKNOWN), _(COMMON),  _(FULL),    _(FULL),    _(FULL),    _(FULL),    _(FULL),
    _(FULL),    _(FULL),    _(HAN),     _(HAN),     _(HAN),     _(FULL),    _(KATAKANI_OR_HIRAGANA), _(FULL),
    _(BOPOMOFO), _(HANGUL), _(FULL),    _(BOPOMOFO), _(HAN),    _(FULL),    _(HAN),     _(HAN),
    _(YI),      _(YI),      _(HANGUL),  _(UNKNOWN), _(UNKNOWN), _(UNKNOWN), _(UNKNOWN), _(HAN),
    _(UNKNOWN), _(ARABIC),  _(FULL),    _(FULL),    _(COMMON),  _(ARABIC),  _(UNKNOWN), _(UNKNOWN),
// Unicode 3.1
    _(OLD_ITALIC), _(GOTHIC), _(DESERET), _(SYMBOLS), _(SYMBOLS), _(MATH),  _(HAN),     _(HAN),
    _(UNKNOWN), _(FULL),    _(TAGALOG), _(HANUNOO), _(BUHID),   _(TAGBANWA), _(FULL),   _(FULL),
    _(FULL),    _(FULL),    _(FULL),    _(FULL),    _(FULL),    _(UNKNOWN), _(UNKNOWN), _(LIMBU),
// Unicode 4
    _(TAI_LE),  _(KHMER),   _(FULL),    _(SYMBOLS), _(FULL),    _(LINEAR_B), _(LINEAR_B), _(UNKNOWN),
    _(UGARITIC), _(FULL),   _(OSMANYA), _(CYPRIOT), _(UNKNOWN), _(FULL),    _(UNKNOWN), _(UNKNOWN),
    _(FULL),    _(BUGINESE), _(HAN),    _(INHERITED), _(COPTIC), _(ETHIOPIC), _(ETHIOPIC), _(GEORGIAN),
    _(GLAGOLITIC), _(KHAROSHTHI), _(FULL), _(NEW_TAI_LUE), _(OLD_PERSIAN), _(FULL), _(UNKNOWN), _(SYLOTI_NAGRI),
    _(TIFINAGH), _(UNKNOWN), _(NKO),    _(BALINESE), _(FULL),   _(FULL),    _(PHAGS_PA), _(PHOENECIAN),
    _(CUNEIFORM), _(CUNEIFORM), _(UNKNOWN), _(SUNDANESE), _(LEPCHA), _(OL_CHIKI), _(FULL), _(VAI),
    _(FULL),    _(SAURASHTRA), _(FULL), _(REJANG),  _(CHAM),    _(UNKNOWN), _(UNKNOWN), _(LYCIAN),
    _(CARIAN),  _(LYDIAN),  _(SYMBOLS), _(SYMBOLS), _(SAMARITAN), _(UCAS),  _(LANNA),   _(DEVANAGARI),
    _(FULL),    _(BAMUM),   _(DAVANAGARI), _(DEVANAGARI), _(HANGUL), _(JAVANESE), _(FULL), _(TAI_VIET),
    _(MEITEI_MAYEK), _(HANGUL), _(IMPERIAL_ARAMAIC), _(FULL), _(AVESTAN), _(INSCRIPTIONAL_PARTHIAN), _(INSCRIPTIONAL_PAHLAVI), _(ORKHON),
    _(UNKNOWN), _(KAITHI),  _(EGYPTIAN_HIEROGLYPHS), _(UNKNOWN), _(HAN), _(HAN), _(MANDAIC), _(BATAK),
    _(ETHIOPIC), _(BRAHMI), _(BAMUM),   _(KATAKANI_OR_HIRAGANA), _(SYMBOLS), _(SYMBOLS), _(SYMBOLS), _(SYMBOLS),
    _(SYMBOLS), _(HAN),     _(ARABIC),  _(SYMBOLS), _(CHAKMA),  _(MEITEI_MAYEK), _(MEROITIC_CURSIVE), _(FULL),
    _(MIAO),    _(SHARADA), _(SORA_SOMPENG), _(SUNDANESE), _(TAKRI), _(BASSA_VAH), _(CAUCASIAN_ALBANIAN), _(COPTIC),
    _(INHERITED), _(DUPLOYAN_SHORTAND), _(ELBASAN), _(SYMBOLS), _(GRANTHA), _(KHOJKI), _(KHUDAWADI), _(LATIN),
    _(LINEAR_A), _(MAHAJANI), _(MANICHAEAN), _(MENDE), _(MODI), _(MRO),     _(MYANMAR), _(NABATAEAN),
    _(FULL),    _(OLD_PERMIC), _(SYMBOLS), _(PAHAWH_HMONG), _(FULL), _(PAU_CIN_HAU), _(PSALTER_PAHLAVI), _(COMMON),
    _(SIDDHAM), _(SINHALA), _(SYMBOLS), _(TIRHUTA), _(WARANG_CITI)
};


> 
> A./
> 
> On 2/14/2018 6:42 AM, Philippe Verdy via CLDR-Users wrote:
> We were told the blocks cannot be split to smaller units than a single column of 16 codepoints: if any one position is assigned to a block, the remaining codepoints in that column cannot be assigned to another block...
> > So unassigned positions in an allocated column should still belong to the same block and may infer a default script property from that block (which may turn to be a wrong guess only if that unassigned position gets assigned a COMMON/INHERITED script).
> > Note however that some characters (notably currency signs, symbols or punctuations) sometimes get used across several scripts without necessarily being given a COMMON/INHERITED script). Most of these symbols are bidi-neutral and should do not form complex ligatures or clusters: it means you can almost safely assume some properties from the unassigned positions in these allocated columns (for exampel to tune the default behavior of a text rendering engine, if it ever has to render a character which was once unallocated may gets finally assigned and found to be mapped in some new font).
> > 
> > 2018-02-14 3:55 GMT+01:00 Asmus Freytag via CLDR-Users <cldr-users at unicode.org>:
> > On 2/13/2018 6:38 PM, Martin Hosken via CLDR-Users wrote:
> >> Dear All,
> >>> 
> >>> Is there a way to get from a UBlockCode to a UScriptCode?
> >>> 
> >>> What? Aargh! No! Surely not! I hear you cry. But hold on a second. What I'm wanting to do is to add some (not perfect) future proofing to my application. When a new character is added to a block in Unicode, one can infer the script of that character, even if the character itself is unknown, from the block. But blocks get split! Yes they do. And this isn't a perfect solution. But block splits are rare, and this solution will give me a much better chance of an unknown character being handled 'appropriately' than being sure that the run break will break and having to wait however long until the next version of Unicode is released, ICU is updated and the application updated to that version of ICU.
> >>> 
> >>> Hence my question :)
> >>> 
> >> Very simply count all the code points in the block that have a definite script assignment that's not COMMON/INHERITED (and not unassigned).
> >> 
> >> If a single script far outweighs both the COMMON/INHERITED and any other scripts, then "guessing" that a new character will end up with that script assignments will give you results that are better than "random".
> >> 
> >> And even if there is a combining mark assigned to a free spot, in many cases, whether you treat it as INHERITED or as having the script of its base character assigned to it makes no big difference (think script runs in a complex script).
> >> 
> >> Your algorithm will detect symbol and punctuation blocks and can predict COMMON as a likely script value.
> >> 
> >> Best thing is that for each  revision, your guesses will get better, that is, when you upgrade your application, it will improve not only assigned code points but the probabilistic guesses for some of the unassigned ones as well.
> >> 
> >> As long as you are aware that it's a probabilistic gamble, you should be fine.
> >> 
> >> Enjoy,
> >> 
> >> A./
> >> 
> >> 
> >>> Yours,
> >>> Martin
> >>> _______________________________________________
> >>> CLDR-Users mailing list
> >>> CLDR-Users at unicode.org
> >>> http://unicode.org/mailman/listinfo/cldr-users
> >>> 
> >>> 
> >> _______________________________________________
> >> CLDR-Users mailing list
> >> CLDR-Users at unicode.org
> >> http://unicode.org/mailman/listinfo/cldr-users
> >> 
> > 
> > 
> > _______________________________________________
> > CLDR-Users mailing list
> > CLDR-Users at unicode.org
> > http://unicode.org/mailman/listinfo/cldr-users
> > 
> 
> 



More information about the CLDR-Users mailing list