Large-scale analysis of Zipf's law in English texts

Moreno-Sánchez, Isabel; Font-Clos, Francesc; Corral, Álvaro

doi:10.1371/journal.pone.0147073

Statistics > Applications

arXiv:1509.04486 (stat)

[Submitted on 15 Sep 2015]

Title:Large-scale analysis of Zipf's law in English texts

Authors:Isabel Moreno-Sánchez, Francesc Font-Clos, Álvaro Corral

View PDF

Abstract:Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic.
We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value) and with only one free parameter (the exponent).

Subjects:	Applications (stat.AP); Physics and Society (physics.soc-ph)
Cite as:	arXiv:1509.04486 [stat.AP]
	(or arXiv:1509.04486v1 [stat.AP] for this version)
	https://doi.org/10.48550/arXiv.1509.04486
Related DOI:	https://doi.org/10.1371/journal.pone.0147073

Submission history

From: Isabel Moreno-Sánchez [view email]
[v1] Tue, 15 Sep 2015 10:41:03 UTC (673 KB)

Statistics > Applications

Title:Large-scale analysis of Zipf's law in English texts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Applications

Title:Large-scale analysis of Zipf's law in English texts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators