Submitted by orgtre t3_xujlqk in dataisbeautiful
orgtre OP t1_iqvqi5z wrote
This was created from the Google Books Ngram Corpus Version 3 using Python (seaborn/matplotlib). The code is available in this repository. It's a simple-looking graph but it is based on the analysis of hundreds of billion words!
draypresct t1_iqwk3zv wrote
Where did the data on the most common words/language come from? The same books as you used in your comparison?
In other words, if I used similar methods on a bunch of statistics textbooks, would I show high levels of comprehensibility with relatively small vocabularies based disproportionately on statistical jargon?
orgtre OP t1_iqx4i2s wrote
Yes, the data comes from the same books. For each language I create an ordered list of the most frequent words, looking like this. The graph then just plots the rank of the word on the x-axis and the cumulative relative frequency (column "cumshare" in the csv files) on the y-axis.
The answer to your last question is hence also yes. It brings up the question of how representative the underlying corpus is. I wrote a bit about that here and there is also this paper. To be very precise the y-axis title should be "% words in a typical book from the Google Books Ngram corpus one can understand"; to the extent that one thinks the corpus is representative of a typical book one might read, the "from the Google Books Ngram corpus" part can be omitted.
Viewing a single comment thread. View all comments