Information Theory and the Length Distribution of all Discrete Systems

Hatton, Les; Warr, Gregory

Abstract:We begin with the extraordinary observation that the length distribution of 80 million proteins in UniProt, the Universal Protein Resource, measured in amino acids, is qualitatively identical to the length distribution of large collections of computer functions measured in programming language tokens, at all scales. That two such disparate discrete systems share important structural properties suggests that yet other apparently unrelated discrete systems might share the same properties, and certainly invites an explanation.
We demonstrate that this is inevitable for all discrete systems of components built from tokens or symbols. Departing from existing work by embedding the Conservation of Hartley-Shannon information (CoHSI) in a classical statistical mechanics framework, we identify two kinds of discrete system, heterogeneous and homogeneous. Heterogeneous systems contain components built from a unique alphabet of tokens and yield an implicit CoHSI distribution with a sharp unimodal peak asymptoting to a power-law. Homogeneous systems contain components each built from just one kind of token unique to that component and yield a CoHSI distribution corresponding to Zipf's law.
This theory is applied to heterogeneous systems, (proteome, computer software, music); homogeneous systems (language texts, abundance of the elements); and to systems in which both heterogeneous and homogeneous behaviour co-exist (word frequencies and word length frequencies in language texts). In each case, the predictions of the theory are tested and supported to high levels of statistical significance. We also show that in the same heterogeneous system, different but consistent alphabets must be related by a power-law. We demonstrate this on a large body of music by excluding and including note duration in the definition of the unique alphabet of notes.

Comments:	70 pages, 53 figures, inc. 30 pages of Appendices
Subjects:	Other Quantitative Biology (q-bio.OT); Information Theory (cs.IT); Biological Physics (physics.bio-ph); Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE)
ACM classes:	D.3.2
Cite as:	arXiv:1709.01712 [q-bio.OT]
	(or arXiv:1709.01712v1 [q-bio.OT] for this version)
	https://doi.org/10.48550/arXiv.1709.01712

Quantitative Biology > Other Quantitative Biology

Title:Information Theory and the Length Distribution of all Discrete Systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators