metric for block coverage

Thu Mar 8 08:18:19 CST 2018

Hi !

    I’ll just add two points to the various points raised in the 
previous conversation about block coverage :

Le 17/02/2018 à 23:18, Adam Borowski via Unicode a écrit :
> Hi!
> As a part of Debian fonts team work, we're trying to improve fonts review:
> ways to organize them, add metadata, pick which fonts are installed by
> default and/or recommended to users, etc.
>
> I'm looking for a way to determine a font's coverage of available scripts.
> It's probably reasonable to do this per Unicode block.  [...]
>
> A naïve way would be to count codepoints present in the font vs the number
> of all codepoints in the block.  Alas, there's way too much chaff for such
> an approach to be reasonable: þ or ą count the same as LATIN TURNED CAPITAL
> LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.
A slightly less naïve way would be to take care of when the code-points 
ere added to Unicode, with the rough idea that the most widespread use 
characters were added first. It also adds the nice feature that this 
metric is less ambiguous for the blocks which are not yet completed.

For example, if you have a 100% coverage of
Armenian for Unicode 10.0 (which I’ll call Armenian10.0 for short), it 
only implies a coverage of 89/91=97.8% of Armenian11.0, which will see 
the addition of two characters used in Armenian dialectology (ARMENIAN 
SMALL LETTER TURNED AYB and YI WITH STROKE).
If you look at the history of Armenian Block (e.g. here 
https://en.wikipedia.org/wiki/Armenian_(Unicode_block)),
Most (84) characters where added in 1.0, A ligature was added in 1.0, 
ARMENIAN HYPHEN was added in 3.0, a currency symbol in 6.1, two 
decorative symbols in 7.0 and two characters used in dialectology are 
planned in 11.0. I guess this roughly correspond to a ranking of the 
characters from the most used to the least used.

To take your examples, both þ and ą are in unicode since 1.1 (and, I 
guess 1.0), while LATIN TURNED CAPITAL
LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON is not 
encoded yet, so,they are not the same according to this metric...  To 
know what this means for othe Latin example, you can watch the Latin 
Extende-D block (history here 
https://en.wikipedia.org/wiki/Latin_Extended-D ) whith new characters in 
5.0, 5.1, 6.1, 7.0, 8.0, 9.0 and some accepted for 11.0 (SMALL CAPITAL 
Q, CAPITAL/SMALL LETTER U WITH STROKE), and later (15, for  Egyptology, 
Assyriology, medieval English and historical Pinyin)

Of course, this measure is only rough. A counter example is in the 
monetary symbol block, where € U+20AC EURO SIGN (in Unicode since 2.1) 
is much more used than ₣ U+20A3 FRENCH FRANC SIGN encode since Unicode 
1.1 (1.0?) but that I never saw, despite living in France for more than 
four decades.
> [...]

> I don't think I'm the first to have this question.  Any suggestions?

For the Han (CJK) script, the IRG (Ideographic Rapporteur Group) defined 
a set of less than 10k essential Han characters, IICore (International 
Ideographs Core, 
https://en.wikipedia.org/wiki/International_Ideographs_Core). This is 
described in the Unihan database in the Unihan_IRGSources.txt file, 
kIICore field (https://www.unicode.org/reports/tr38/#kIICore ). This 
field also includes a letter (A,B or C) indicating a priority value and 
some regional information. For Unicode 10.0, a simple grep tells that 
there are 9810 IICore characters, 7772 of hitch pritority A, 417 
priority B and 1621 priority C.

Note that IICore has been stable (as version 2.2) since 2004, but Ken 
Lunde, from Adobe, has recently proposed an update to it 
(https://www.unicode.org/L2/L2018/18066-iicore-changes.pdf), but only in 
the region tags, neither on the priorities nor on the list of 
characters. However, reading the associated blog post of Ken Lunde, it 
seems a few characters could be added to IICore in the future.

    Cheers,

             French