Mathematics > Dynamical Systems
[Submitted on 27 Sep 2011 (v1), revised 12 Apr 2013 (this version, v2), latest version 8 Jan 2014 (v4)]
Title:Coding Sequence Density Estimation Via Topological Pressure
View PDFAbstract:Inspired by concepts from ergodic theory, we give a new approach to coding sequence (CDS) density estimation, focusing on the human genome. Our approach is based on the introduction and study of topological pressure, which assigns a real number to any finite sequence based on an appropriate notion of `weighted information content'. For human DNA sequences, each codon is assigned a suitable weight and we can train these parameters so that the topological pressure and the observed CDS density correlate arbitrarily well. We cross-validate our results to check that we are not overfitting, and use the parameters obtained by training on the human genome for ab initio estimation of CDS density on the genomes of Rhesus Macaque and Mus Musculus. We demonstrate that our predictions compare well with those given by the standard single genome ab initio gene-finding programs. Our method is entirely combinatorial, and extracts information from our training data without reference to any biological model. As such, our method can be adapted for the analysis of any eukaryote. An anticipated application is the identification of gene-rich regions, which is of particular importance to the study of plant genomes. Inspired again by ergodic theory, we use the weightings on the codons to define a probability distribution on finite sequences, which we show to be effective in distinguishing between coding and non-coding human DNA sequences of lengths between 750bp and 5,000bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at this http URL.
Submission history
From: David Koslicki [view email][v1] Tue, 27 Sep 2011 19:38:52 UTC (644 KB)
[v2] Fri, 12 Apr 2013 14:32:20 UTC (290 KB)
[v3] Mon, 6 Jan 2014 18:17:29 UTC (259 KB)
[v4] Wed, 8 Jan 2014 17:40:04 UTC (245 KB)
Current browse context:
math.DS
Change to browse by:
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.