Usage stats?

Michael Norton michaelanortonster at
Fri Mar 27 14:57:57 CDT 2015

Why wouldn't Unicode itself have it?

On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <kenwhistler at> wrote:

> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would
> tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
> Here is some discussion of a similar question posted on stackoverflow:
> character-usage-statistics
> --Ken
> On 3/27/2015 9:31 AM, Michael Norton wrote:
>> Hello and thank you for an incredible service (just joining the list).
>>  Is there a list of usage statistics per character of the Unicode set
>> available somewhere?
> _______________________________________________
> Unicode mailing list
> Unicode at


Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home:

"All great actors are mere mathematical masters of speech and the human
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list