Visualization of frequency distributions
ZipfExplorer tool
Discourse comparison
Lexical diversity
If the word types in a text are ranked in decreasing order of their frequencies, the frequency $n$ of a word with rank $r$ is approximately inversely proportional to its rank (Zipf, 1935, 1949)
$$ n \propto r^{-z} $$Where $z$ typically has a value close to $1$
Taking the logarithm of both sides
$$ \log{n} = -z \log r $$yields Zipf's familiar rank-frequency plot when plotted in log-log space with slope $\approx -z$
https://zipfexplorer.herokuapp.com
Zipf's rank-frequency function is equivalent to the complementary cumulative distribution function of the Pareto distribution (Newman, 2005)
If the frequency $n$ of a word with rank $r$ is $r^{-z}$, its derivative is a continuous approximation of the probability of encountering a type with frequency $n$
The new exponent $\alpha$ is related to the exponent of Zipf's Law:
$$ \alpha = 1+\frac{1}{z} $$Low entropy: words re-used
Maximum entropy: $log_{2}n$
Alstott, J., Bullmore, E. & Plenz, D. (2014). Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS ONE 9(1).
Bird, S., Loper, E. & Klein, E. (2019). Natural language processing with Python. Newton, MA: O'Reilly.
Bokeh Development Team. (2020). Bokeh: Python library for interactive visualization. http://www.bokeh.pydata.org, last accessed 01/03/2020.
Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law distributions in empirical data. SIAM Review 51(4), 661–703.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74.
Francis, W. N. & Kučera, H. (1979). A standard corpus of present-day edited American English, for use with digital computers. Providence, RI: Brown University.
Hundt, M., Sand, A. & Skandera, P. (1999). Manual of information to accompany The Freiburg – Brown Corpus of American English ('Frown'). Freiburg, Germany: Department of English, Albert-Ludwigs-Universität.
Moreno-Sánchez, Font-Clos, F. & Corral, Á. (2016). Large-scale analysis of Zipf's Law in English texts. PLoS ONE 11(1): e0147073. https://doi.org/10.1371/journal.pone.0147073
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46(5), 323–351.
Rayson, P. & Garside, R. (2000). Comparing corpora using frequency profiling. In: WCC '00 proceedings of the workshop on comparing corpora, pp. 1–6. New York: ACM.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379–423; 623–656.
Stubbs, M. (2010). Three concepts of keywords. In: Bondi, M., Scott, M. (eds.), Keyness in texts, pp. 21–42. Amsterdam: John Benjamins.
Zipf, George K. (1935). The psycho-biology of language. Cambrdige, MA: Riverside Press.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley.