orgtre OP t1_iqvqi5z wrote on October 3, 2022 at 1:33 PM

This was created from the Google Books Ngram Corpus Version 3 using Python (seaborn/matplotlib). The code is available in this repository. It's a simple-looking graph but it is based on the analysis of hundreds of billion words!

draypresct t1_iqwk3zv wrote on October 3, 2022 at 4:55 PM

Where did the data on the most common words/language come from? The same books as you used in your comparison?

In other words, if I used similar methods on a bunch of statistics textbooks, would I show high levels of comprehensibility with relatively small vocabularies based disproportionately on statistical jargon?

orgtre OP t1_iqx4i2s wrote on October 3, 2022 at 7:05 PM

Yes, the data comes from the same books. For each language I create an ordered list of the most frequent words, looking like this. The graph then just plots the rank of the word on the x-axis and the cumulative relative frequency (column "cumshare" in the csv files) on the y-axis.

The answer to your last question is hence also yes. It brings up the question of how representative the underlying corpus is. I wrote a bit about that here and there is also this paper. To be very precise the y-axis title should be "% words in a typical book from the Google Books Ngram corpus one can understand"; to the extent that one thinks the corpus is representative of a typical book one might read, the "from the Google Books Ngram corpus" part can be omitted.