metric for block coverage
Frédéric Grosshans via Unicode
unicode at unicode.org
Thu Mar 8 08:18:19 CST 2018
I’ll just add two points to the various points raised in the
previous conversation about block coverage :
Le 17/02/2018 à 23:18, Adam Borowski via Unicode a écrit :
> As a part of Debian fonts team work, we're trying to improve fonts review:
> ways to organize them, add metadata, pick which fonts are installed by
> default and/or recommended to users, etc.
> I'm looking for a way to determine a font's coverage of available scripts.
> It's probably reasonable to do this per Unicode block. [...]
> A naïve way would be to count codepoints present in the font vs the number
> of all codepoints in the block. Alas, there's way too much chaff for such
> an approach to be reasonable: þ or ą count the same as LATIN TURNED CAPITAL
> LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.
A slightly less naïve way would be to take care of when the code-points
ere added to Unicode, with the rough idea that the most widespread use
characters were added first. It also adds the nice feature that this
metric is less ambiguous for the blocks which are not yet completed.
For example, if you have a 100% coverage of
Armenian for Unicode 10.0 (which I’ll call Armenian10.0 for short), it
only implies a coverage of 89/91=97.8% of Armenian11.0, which will see
the addition of two characters used in Armenian dialectology (ARMENIAN
SMALL LETTER TURNED AYB and YI WITH STROKE).
If you look at the history of Armenian Block (e.g. here
Most (84) characters where added in 1.0, A ligature was added in 1.0,
ARMENIAN HYPHEN was added in 3.0, a currency symbol in 6.1, two
decorative symbols in 7.0 and two characters used in dialectology are
planned in 11.0. I guess this roughly correspond to a ranking of the
characters from the most used to the least used.
To take your examples, both þ and ą are in unicode since 1.1 (and, I
guess 1.0), while LATIN TURNED CAPITAL
LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON is not
encoded yet, so,they are not the same according to this metric... To
know what this means for othe Latin example, you can watch the Latin
Extende-D block (history here
https://en.wikipedia.org/wiki/Latin_Extended-D ) whith new characters in
5.0, 5.1, 6.1, 7.0, 8.0, 9.0 and some accepted for 11.0 (SMALL CAPITAL
Q, CAPITAL/SMALL LETTER U WITH STROKE), and later (15, for Egyptology,
Assyriology, medieval English and historical Pinyin)
Of course, this measure is only rough. A counter example is in the
monetary symbol block, where € U+20AC EURO SIGN (in Unicode since 2.1)
is much more used than ₣ U+20A3 FRENCH FRANC SIGN encode since Unicode
1.1 (1.0?) but that I never saw, despite living in France for more than
> I don't think I'm the first to have this question. Any suggestions?
For the Han (CJK) script, the IRG (Ideographic Rapporteur Group) defined
a set of less than 10k essential Han characters, IICore (International
https://en.wikipedia.org/wiki/International_Ideographs_Core). This is
described in the Unihan database in the Unihan_IRGSources.txt file,
kIICore field (https://www.unicode.org/reports/tr38/#kIICore ). This
field also includes a letter (A,B or C) indicating a priority value and
some regional information. For Unicode 10.0, a simple grep tells that
there are 9810 IICore characters, 7772 of hitch pritority A, 417
priority B and 1621 priority C.
Note that IICore has been stable (as version 2.2) since 2004, but Ken
Lunde, from Adobe, has recently proposed an update to it
(https://www.unicode.org/L2/L2018/18066-iicore-changes.pdf), but only in
the region tags, neither on the priorities nor on the list of
characters. However, reading the associated blog post of Ken Lunde, it
seems a few characters could be added to IICore in the future.
More information about the Unicode