verdy_p at wanadoo.fr
Sat Mar 28 19:33:20 CDT 2015
[Note: message resent using another domain. Visibly the Unicode mailing list rejects as spam all emails posted from Gmail's webmail, and containing all relevant tracking mime headers and
regularly signed by Google and my proven identity].
2015-03-28 12:30 GMT+01:00 Michael Norton :
> Thanks Doug. I did not know there exists a representative sample of the world's text. :)
> I do know that 400 years ago there were about 10,000 languages; now there are about 6,500.
> Time flies!
> Your frequency chart is great.The average char appearance is 2.91%. Only 34% from your list exceed 10% of it.
> Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far > 2.91%).
> In fact, it's almost >50% greater than the next most-appearing character.
> So from the two frequency lists you've given me (my email and yours) we begin to see some patterns emerge.
> Provided prior data and observation, most useful patterns prevail over other more obscure ones
> and present a provocative opportunity for webbers out there...
> While this is probably out of context for most of the 700 Unicode members, I can report that it's good news.
Long time ago I learned a "word" (or is it an acronym? it's not really an abbreviation by itself even if it is pronounceable) used by French cryptanalists (using simple encryption schemes by
substitution): "ESARTINULOC" (some older sources gave "ESANTIRULO"). Which is the ordered list of most frequently basic letters used in French (ignoring case and diacritic differences). It's
also used implicitly by gamers (e.g. playing or composing crosswords, or playing games such as Scrabble(TM), where the top letters of the list have lower scoring values, different between
French Scrabble and English Scrabble).
That "word" is slightly different in English, or in the limited "global" counting Doug did (over an extremely limited set of source texts); but of course in French the SPACE would also lead the
list before that "word" (but that does not enter into account for crosswords or Scrabble, even in languages that don't use spaces for word separation).
More accurate statistics may be found using statistics collected by databases with plain-text search capabilities (in the structure of their index), provided they correctly track the language used
and their data concerns a large enough set of domains (e.g. statistics of plain-text search engines for each **localized** edition of Wikipedia, Wiktionnary, or Wikisource). If you want "global"
statistics it will be more difficult (Wikimedia Commons is insufficiently translated, with a too wide presence of English), but what you may do is to estimate the rate of usages for each main
language (or macrolanguage) and weight the statistics collected for each language to return an estimated "global" frequency list.
But be careful, each language has its own set of collation rules such that letters that are considered having the same primary weight in one language are distinguished and counted separately
in some other language: you may find that a source "ü" or "ä" had its rate actuelly computed as "UE" or "AE" in German, but only as "U" or "A" in English or French, and this wil not allow you
to correctly estimate the global frequency rates of "U", "A" and "E". A simple linear mathematic transform (scalar products of usage rates of languages and usage rates of letters per
language) would not work: the global usage rate of "E" would be underestimated where it also represents the German umlaut, and both "U" and "A" would be overestimated...
More information about the Unicode