metric for block coverage

Tue Feb 27 09:36:55 CST 2018

You have clarified what exactly the usage is; you've only asked what it means to cover a script.

James Kass mentioned a font's OS/2 table. That is obsolete: as Khaled pointed out, there has never been a clear definition of "supported" and practice has been inconsistent. Moreover, the available bits were exhausted after Unicode 5.2, and we're now working on Unicode 11. Both Apple and Microsoft have started to use 'dlng' and 'slng' values in the 'meta' table of OpenType fonts to convey what a font can and is designed to support — a distinction that the OS/2 table never allows for, but that is actually more useful. (I'd also point out that, in the upcoming Windows 10 feature update, the 'dlng' entries in fonts is used to determine what preview strings to use in the Fonts settings UI.) For scripts like Latin that have a large set of characters, most of which have infrequent usage, there can still be a challenge in characterizing the font, but the mechanism does provide flexibility in what is declared.

But again, you haven't said what data to put into fonts is your issue. If you are trying to determine whether a given font supports a particular language, the OS/2 and 'meta' table provide heuristics — with 'meta' being recommended; but the only way to know for absolute certain is to compare an exemplar character list for the particular language with the font's cmap table. But note, that can only tell you that a font _is able to support_ the language, which doesn't necessarily imply that it's actually a good choice for users of that language. For example, every font in Windows includes Basic Latin characters, but that definitely doesn't mean that the fonts are useful for an English speaker. This is why the 'dlng' entry in the 'meta' table was created.

Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Adam Borowski via Unicode
Sent: Saturday, February 17, 2018 2:18 PM
To: unicode at unicode.org
Subject: metric for block coverage

Hi!
As a part of Debian fonts team work, we're trying to improve fonts review:
ways to organize them, add metadata, pick which fonts are installed by default and/or recommended to users, etc.

I'm looking for a way to determine a font's coverage of available scripts. 
It's probably reasonable to do this per Unicode block.  Also, it's a safe assumption that a font which doesn't know a codepoint can do no complex shaping of such a glyph, thus looking at just codepoints should be adequate for our purposes.

A naïve way would be to count codepoints present in the font vs the number of all codepoints in the block.  Alas, there's way too much chaff for such an approach to be reasonable: þ or ą count the same as LATIN TURNED CAPITAL LETTER SAMPI WITH HORNS AND TAIL WITH SMALL LETTER X WITH CARON.

Another idea would be giving every codepoint a weight equal to the number of languages which currently use such a letter.

Too bad, that wouldn't work for symbols, or for dead scripts: a good runic font will have a complete coverage of elder futhark, anglo-saxon, younger and medieval, while only a completionist would care about franks casket or Tolkien's inventions.

I don't think I'm the first to have this question.  Any suggestions?

ᛗᛖᛟᚹ!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.