Quantitative Biology
See recent articles
Showing new listings for Friday, 7 March 2025
- [1] arXiv:2503.03773 [pdf, html, other]
-
Title: A Phylogenetic Approach to Genomic Language ModelingComments: 15 pages, 7 figuresSubjects: Genomics (q-bio.GN); Machine Learning (cs.LG)
Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.
- [2] arXiv:2503.03783 [pdf, html, other]
-
Title: Passive Heart Rate Monitoring During Smartphone Use in Everyday LifeShun Liao, Paolo Di Achille, Jiang Wu, Silviu Borac, Jonathan Wang, Xin Liu, Eric Teasley, Lawrence Cai, Yun Liu, Daniel McDuff, Hao-Wei Su, Brent Winslow, Anupam Pathak, Shwetak Patel, Jameson K. Rogers, Ming-Zher PohSubjects: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) < 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error < 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring.
- [3] arXiv:2503.03784 [pdf, html, other]
-
Title: Neural Models of Task Adaptation: A Tutorial on Spiking Networks for Executive ControlComments: 6 pagesSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Understanding cognitive flexibility and task-switching mechanisms in neural systems requires biologically plausible computational models. This tutorial presents a step-by-step approach to constructing a spiking neural network (SNN) that simulates task-switching dynamics within the cognitive control network. The model incorporates biologically realistic features, including lateral inhibition, adaptive synaptic weights through unsupervised Spike Timing-Dependent Plasticity (STDP), and precise neuronal parameterization within physiologically relevant ranges. The SNN is implemented using Leaky Integrate-and-Fire (LIF) neurons, which represent excitatory (glutamatergic) and inhibitory (GABAergic) populations. We utilize two real-world datasets as tasks, demonstrating how the network learns and dynamically switches between them. Experimental design follows cognitive psychology paradigms to analyze neural adaptation, synaptic weight modifications, and emergent behaviors such as Long-Term Potentiation (LTP), Long-Term Depression (LTD), and Task-Set Reconfiguration (TSR). Through a series of structured experiments, this tutorial illustrates how variations in task-switching intervals affect performance and multitasking efficiency. The results align with empirically observed neuronal responses, offering insights into the computational underpinnings of executive function. By following this tutorial, researchers can develop and extend biologically inspired SNN models for studying cognitive processes and neural adaptation.
- [4] arXiv:2503.03786 [pdf, html, other]
-
Title: Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation LearningComments: 8 pages, 5 figuresSubjects: Tissues and Organs (q-bio.TO); Image and Video Processing (eess.IV)
Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on co-registered Computed Tomography (CT) annotations, overlooking the problem of modality mismatch. In this study, we propose LDOS, a novel CT-free ultra-LDPET organ segmentation pipeline. Inspired by Masked Autoencoders (MAE), we reinterpret LDPET as a naturally masked version of Full-Dose PET (FDPET). LDOS adopts a simple yet effective architecture: a shared encoder extracts generalized features, while task-specific decoders independently refine outputs for denoising and segmentation. By integrating CT-derived organ annotations into the denoising process, LDOS improves anatomical boundary recognition and alleviates the PET/CT misalignments. Experiments demonstrate that LDOS achieves state-of-the-art performance with mean Dice scores of 73.11% (18F-FDG) and 73.97% (68Ga-FAPI) across 18 organs in 5% dose PET. Our code is publicly available.
- [5] arXiv:2503.03790 [pdf, html, other]
-
Title: DDCSR: A Novel End-to-End Deep Learning Framework for Cortical Surface Reconstruction from Diffusion MRIChengjin Li, Yuqian Chen, Nir A. Sochen, Wei Zhang, Carl-Fredrik Westin, Rathi Yogesh, Lauren J. O'Donnell, Ofer Pasternak, Fan ZhangComments: 9 pages, 3 figuresSubjects: Tissues and Organs (q-bio.TO); Graphics (cs.GR); Image and Video Processing (eess.IV)
Diffusion MRI (dMRI) plays a crucial role in studying brain white matter connectivity. Cortical surface reconstruction (CSR), including the inner whiter matter (WM) and outer pial surfaces, is one of the key tasks in dMRI analyses such as fiber tractography and multimodal MRI analysis. Existing CSR methods rely on anatomical T1-weighted data and map them into the dMRI space through inter-modality registration. However, due to the low resolution and image distortions of dMRI data, inter-modality registration faces significant challenges. This work proposes a novel end-to-end learning framework, DDCSR, which for the first time enables CSR directly from dMRI data. DDCSR consists of two major components, including: (1) an implicit learning module to predict a voxel-wise intermediate surface representation, and (2) an explicit learning module to predict the 3D mesh surfaces. Compared to several baseline and advanced CSR methods, we show that the proposed DDCSR can largely increase both accuracy and efficiency. Furthermore, we demonstrate a high generalization ability of DDCSR to data from different sources, despite the differences in dMRI acquisitions and populations.
- [6] arXiv:2503.03913 [pdf, other]
-
Title: Proton Flows, Proton Gradients and Subcellular Architecture in Biological Energy ConversionSubjects: Subcellular Processes (q-bio.SC)
Hydrogen ions, or protons, provide the medium by which energy is stored and converted in biological systems. Such pre-eminence relies on the interplay between interfacial and bulk chemical transformations, according to mechanisms that are shared by organisms in all phyla of life. The present work provides an introduction to the fundamental aspects of biological energy management by focusing on the relationship between vectorial proton flows and the geometry of energy producing organelles in eukaryotes. The leading models of proton-mediated energy conversion, the delocalised proton (or chemiosmotic) model and the localised proton model, are presented in a complementary perspective. While the delocalised model provides a description that relies on equilibrium thermodynamics, the localised model addresses dynamic processes that are better described using out-of-equilibrium thermodynamics. The work reviews the salient aspects of such mechanisms, traces the development of our present understanding, and highlights areas that are open to future developments.
- [7] arXiv:2503.03950 [pdf, html, other]
-
Title: The Nature of Organization in Living SystemsComments: 22 pages, 7 figuresSubjects: Quantitative Methods (q-bio.QM); Populations and Evolution (q-bio.PE)
Living systems are thermodynamically open but closed in their organization. In other words, even though their material components turn over constantly, a material-independent property persists, which we call organization. Moreover, organization comes from within organisms themselves, which requires us to explain how this self-organization is established and maintained. In this paper we propose a mathematical and conceptual framework to understand the kinds of organized systems that living systems are, aiming to explain how self-organization emerges from more basic elemental processes. Additionally, we map our own notions to existing traditions in theoretical biology and philosophy, aiming to bring the main formal ideas into conceptual congruence.
- [8] arXiv:2503.03989 [pdf, html, other]
-
Title: Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic FlowsXiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu MaComments: Accepted to ICLR 2025Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
- [9] arXiv:2503.04069 [pdf, other]
-
Title: Integrating network pharmacology, metabolomics, and gut microbiota analysis to explore the effects of Jinhong tablets on chronic superficial gastritisLihao Xiao, Tingyu Zhang, Yun Liu, Chayanis Sutcharitchan, Qingyuan Liu, Xiaoxue Fan, Jian Feng, Huifang Gao, Tong Zhang, Shao LiSubjects: Molecular Networks (q-bio.MN)
Chronic superficial gastritis (CSG) severely affects quality of life and can progress to worse gastric pathologies. Traditional Chinese Medicine (TCM) effectively treats CSG, as exemplified by Jinhong Tablets (JHT) with known anti-inflammatory properties, though their mechanism remains unclear. This study integrated network pharmacology, untargeted metabolomics, and gut microbiota analyses to investigate how JHT alleviates CSG. A rat CSG model was established and evaluated via H&E staining. We identified JHT's target profiles and constructed a multi-layer biomolecular network. Differential metabolites in plasma were determined by untargeted metabolomics, and gut microbiota diversity/composition in fecal and cecal samples was assessed via 16S rRNA sequencing. JHT markedly reduced gastric inflammation. Network pharmacology highlighted metabolic pathways, particularly lipid and nitric oxide metabolism, as essential to JHT's therapeutic effect. Metabolomics identified key differential metabolites including betaine (enhancing gut microbiota), phospholipids, and citrulline (indicating severity of CSG). Pathway enrichment supported the gut microbiota's involvement. Further microbiota analysis showed that JHT increased betaine abundance, improved short-chain fatty acid production, and elevated Faecalibaculum and Bifidobacterium, thereby alleviating gastric inflammation. In conclusion, JHT alleviates CSG via diverse metabolic processes, especially lipid and energy metabolism, and influences metabolites like betaine alongside gut microbes such as Faecalibaculum and Bifidobacterium. These findings underscore JHT's therapeutic potential and deepen our understanding of TCM's role in CSG management.
- [10] arXiv:2503.04200 [pdf, html, other]
-
Title: DeepSilencer: A Novel Deep Learning Model for Predicting siRNA Knockdown EfficiencySubjects: Biomolecules (q-bio.BM)
Background: Small interfering RNA (siRNA) is a promising therapeutic agent due to its ability to silence disease-related genes via RNA interference. While traditional machine learning and early deep learning methods have made progress in predicting siRNA efficacy, there remains significant room for improvement. Advanced deep learning techniques can enhance prediction accuracy, reducing the reliance on extensive wet-lab experiments and accelerating the identification of effective siRNA sequences. This approach also provides deeper insights into the mechanisms of siRNA efficacy, facilitating more targeted and efficient therapeutic strategies.
Methods: We introduce DeepSilencer, an innovative deep learning model designed to predict siRNA knockdown efficiency. DeepSilencer utilizes advanced neural network architectures to capture the complex features of siRNA sequences. Our key contributions include a specially designed deep learning model, an innovative online data sampling method, and an improved loss function tailored for siRNA prediction. These enhancements collectively boost the model's prediction accuracy and robustness.
Results: Extensive evaluations on multiple test sets demonstrate that DeepSilencer achieves state-of-the-art performance using only siRNA sequences and basic physicochemical properties. Our model surpasses several other methods and shows superior predictive performance, particularly when incorporating thermodynamic parameters.
Conclusion: The advancements in data sampling, model design, and loss function significantly enhance the predictive capabilities of DeepSilencer. These improvements underscore its potential to advance RNAi therapeutic design and development, offering a powerful tool for researchers and clinicians. - [11] arXiv:2503.04339 [pdf, other]
-
Title: Reproductive system and interaction with fauna in a Mediterranean Pyrophite shrubSubjects: Other Quantitative Biology (q-bio.OT)
The ULEX model, in its present state, involves the study of the biomass and the population of the shrub Ulex parviflorus Pourret, but while being a dynamic model, it is static in the sense that it does not imply the appearance of new specimens of this plant. As a complement to the ULEX model in its two dynamic and spatial aspects, and with the idea of extending the model, the authors have introduced from a biological and statistical point of view four characteristics of this species, flowering, pollination, fructification, taking special interest in the role played by the pollinators (bees) and dispersion of seeds.
- [12] arXiv:2503.04477 [pdf, html, other]
-
Title: Exact first passage time distribution for nonlinear chemical reaction networks II: monomolecular reactions and a A + B - C type of second-order reaction with arbitrary initial conditionsComments: 13 pages, 5 figures, 4 tablesSubjects: Molecular Networks (q-bio.MN)
In biochemical reaction networks, the first passage time (FPT) of a reaction quantifies the time it takes for the reaction to first occur, from the initial state. While the mean FPT historically served as a summary metric, a far more comprehensive characterization of the dynamics of the network is contained within the complete FPT distribution. The relatively uncommon theoretical treatments of the FPT distribution that have been given in the past have been confined to linear systems, with zero and first-order processes. Recently, we presented theoretically exact solutions for the FPT distribution, within nonlinear systems involving two-particle collisions, such as A+B - C. Although this research yielded invaluable results, it was based upon the assumption of initial conditions in the form of a Poisson distribution. This somewhat restricts its relevance to real-world biochemical systems, which frequently display intricate behaviour and initial conditions that are non-Poisson in nature. Our current study extends prior analyses to accommodate arbitrary initial conditions, thereby expanding the applicability of our theoretical framework and providing a more adaptable tool for capturing the dynamics of biochemical reaction networks.
- [13] arXiv:2503.04648 [pdf, html, other]
-
Title: Assessing the performance of compartmental and renewal models for learning $R_{t}$ using spatially heterogeneous epidemic simulations on real geographiesMatthew Ghosh, Yunli Qi, Abbie Evans, Tom Reed, Lara Herriott, Ioana Bouros, Ben Lambert, David J. Gavaghan, Katherine M. Shepherd, Richard Creswell, Kit GallagherComments: 48 pages, 4 figures, 7 supplementary figuresSubjects: Populations and Evolution (q-bio.PE)
The time-varying reproduction number ($R_t$) gives an indication of the trajectory of an infectious disease outbreak. Commonly used frameworks for inferring $R_t$ from epidemiological time series include those based on compartmental models (such as the SEIR model) and renewal equation models. These inference methods are usually validated using synthetic data generated from a simple model, often from the same class of model as the inference framework. However, in a real outbreak the transmission processes, and thus the infection data collected, are much more complex. The performance of common $R_t$ inference methods on data with similar complexity to real world scenarios has been subject to less comprehensive validation. We therefore propose evaluating these inference methods on outbreak data generated from a sophisticated, geographically accurate agent-based model. We illustrate this proposed method by generating synthetic data for two outbreaks in Northern Ireland: one with minimal spatial heterogeneity, and one with additional heterogeneity. We find that the simple SEIR model struggles with the greater heterogeneity, while the renewal equation model demonstrates greater robustness to spatial heterogeneity, though is sensitive to the accuracy of the generation time distribution used in inference. Our approach represents a principled way to benchmark epidemiological inference tools and is built upon an open-source software platform for reproducible epidemic simulation and inference.
New submissions (showing 13 of 13 entries)
- [14] arXiv:2503.04221 (cross-list from cond-mat.stat-mech) [pdf, html, other]
-
Title: Random search with stochastic resetting: when finding the target is not enoughComments: 19 pages, 11 figuresSubjects: Statistical Mechanics (cond-mat.stat-mech); Quantitative Methods (q-bio.QM)
In this paper we consider a random search process with stochastic resetting and a partially accessible target $\calU$. That is, when the searcher finds the target by attaching to its surface $\partial \calU$ it does not have immediate access to the resources within the target interior. After a random waiting time, the searcher either gains access to the resources within or detaches and continues its search process. We also assume that the searcher requires an alternating sequence of periods of bulk diffusion interspersed with local surface interactions before being able to attach to the surface. The attachment, detachment and target entry events are the analogs of adsorption, desorption and absorption of a particle by a partially reactive surface in physical chemistry. In applications to animal foraging, the resources could represent food or shelter while resetting corresponds to an animal returning to its home base. We begin by considering a Brownian particle on the half-line with a partially accessible target at the origin $x=0$. We calculate the non-equilibrium stationary state (NESS) in the case of reversible adsorption and obtain the corresponding first passage time (FPT) density for absorption when adsorption is only partially reversible. We then reformulate the stochastic process in terms of a pair of renewal equations that relate the probability density and FPT density for absorption in terms of the corresponding quantities for irreversible adsorption. The renewal equations allow us to incorporate non-Markovian models of absorption and desorption. They also provide a useful decomposition of quantities such as the mean FPT (MFPT) in terms of the number of desorption events and the statistics of the waiting time density. Finally, we consider various extensions of the theory, including higher-dimensional search processes and an encounter-based model of absorption.
- [15] arXiv:2503.04347 (cross-list from cs.LG) [pdf, html, other]
-
Title: Large Language Models for Zero-shot Inference of Causal Structures in BiologyComments: ICLR 2025 Workshop on Machine Learning for Genomics ExplorationsSubjects: Machine Learning (cs.LG); Genomics (q-bio.GN)
Genes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.
- [16] arXiv:2503.04362 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug DiscoveryYiheng Zhu, Mingyang Li, Junlong Liu, Kun Fu, Jiansheng Wu, Qiuyi Li, Mingze Yin, Jieping Ye, Jian Wu, Zheng WangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross-domain relationships pivotal to SBDD. To fill this gap, we propose a general-purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture-of-Domain-Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture-of-Structure-Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture-of-experts approach enables BIT to achieve both deep fusion and domain-specific encoding, effectively capturing fine-grained molecular interactions within protein-ligand complexes. Then, we perform cross-domain pre-training on the shared Transformer backbone via several unified self-supervised denoising tasks. Experimental results on various benchmarks demonstrate that BIT achieves exceptional performance in downstream tasks, including binding affinity prediction, structure-based virtual screening, and molecular property prediction.
- [17] arXiv:2503.04365 (cross-list from stat.AP) [pdf, other]
-
Title: A Protocol to Exposure Path Analysis for Multiple Stressors Associated with Cardiovascular Disease Risk: A Novel Approach Using NHANES DataComments: 20 pages, 4 figuresSubjects: Applications (stat.AP); Quantitative Methods (q-bio.QM)
Background: Multiple medical and non-medical stressors, along with the complicity of their exposure pathways, have posted significant challenges to the epidemiological interpretation of the non-communicable diseases, including cardiovascular disease (CVD). Objective: To develop a protocol for deconstructing the complex exposure pathways linking various stressors to adverse outcomes and to elucidate the sequential determinants contributing to CVD risk in depth. Methods: In this study, we developed a Path-Lasso approach, rooted in Adaptive Lasso regression, to construct the network and paths to interpret the determinants of CVD in an in-depth way by using data from the National Health and Nutrition Examination Survey (NHANES). Univariate logistic regression was initially employed to screen out all potential factors of influencing CVD. Then a programmed approach, using Path-Lasso technique, stratified covariates and established a causal network to predict CVD risk. Results: Age, smoking and waist circumference were identified as the most significant predictors of CVD risk. Other factors, such as race, marital status, physical activity, cadmium exposure and diabetes acted as the intermediary or proximal variables. All these stressors (or nodes) formed the network with paths (or edges to link the CVD), in which the latent layer variables that causally associate to the outcome are linearly formed by the stressors in each layer. Discussion: The Path-Lasso approach revealed the epidemiological pathways, linking covariates to CVD risk, which is instrumental in elucidating the inter-covariate transitions of their predication to the outcome, and providing the hierarchal network for foundation of the assessment of CVD risk and the beyond.
- [18] arXiv:2503.04483 (cross-list from stat.ML) [pdf, html, other]
-
Title: InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network InferenceComments: ICLR 2025 AI4NA Oral, ICLR 2025 MLGenX Spotlight, ICLR 2025 LMRLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.
- [19] arXiv:2503.04490 (cross-list from cs.CL) [pdf, html, other]
-
Title: Large Language Models in Bioinformatics: A SurveySubjects: Computation and Language (cs.CL); Genomics (q-bio.GN)
Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
- [20] arXiv:2503.04527 (cross-list from math.DS) [pdf, html, other]
-
Title: The nexus between disease surveillance, adaptive human behavior and epidemic containmentComments: 17 pages, 12 figuresSubjects: Dynamical Systems (math.DS); Populations and Evolution (q-bio.PE)
Epidemics exhibit interconnected processes that operate at multiple time and organizational scales, a hallmark of complex adaptive systems. Modern epidemiological modeling frameworks incorporate feedback between individual-level behavioral choices and centralized interventions. Nonetheless, the realistic operational course for disease detection, planning, and response is often overlooked. Disease detection is a dynamic challenge, shaped by the interplay between surveillance efforts and transmission characteristics. It serves as a tipping point that triggers emergency declarations, information dissemination, adaptive behavioral responses, and the deployment of public health interventions. Evaluating the impact of disease surveillance systems as triggers for adaptive behavior and public health interventions is key to designing effective control policies.
We examine the multiple behavioral and epidemiological dynamics generated by the feedback between disease surveillance and the intertwined dynamics of information and disease propagation. Specifically, we study the intertwined dynamics between: $(i)$ disease surveillance triggering health emergency declarations, $(ii)$ risk information dissemination producing decentralized behavioral responses, and $(iii)$ centralized interventions. Our results show that robust surveillance systems that quickly detect a disease outbreak can trigger an early response from the population, leading to large epidemic sizes. The key result is that the response scenarios that minimize the final epidemic size are determined by the trade-off between the risk information dissemination and disease transmission, with the triggering effect of surveillance mediating this trade-off. Finally, our results confirm that behavioral adaptation can create a hysteresis-like effect on the final epidemic size. - [21] arXiv:2503.04572 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Social Imitation Dynamics of Vaccination Driven by Vaccine Effectiveness and BeliefsComments: 10 pages, 6 figures, comments are welcomeSubjects: Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE)
Declines in vaccination coverage for vaccine-preventable diseases, such as measles and chickenpox, have enabled their surprising comebacks and pose significant public health challenges in the wake of growing vaccine hesitancy. Vaccine opt-outs and refusals are often fueled by beliefs concerning perceptions of vaccine effectiveness and exaggerated risks. Here, we quantify the impact of competing beliefs -- vaccine-averse versus vaccine-neutral -- on social imitation dynamics of vaccination, alongside the epidemiological dynamics of disease transmission. These beliefs may be pre-existing and fixed, or coevolving attitudes. This interplay among beliefs, behaviors, and disease dynamics demonstrates that individuals are not perfectly rational; rather, they base their vaccine uptake decisions on beliefs, personal experiences, and social influences. We find that the presence of a small proportion of fixed vaccine-averse beliefs can significantly exacerbate the vaccination dilemma, making the tipping point in the hysteresis loop more sensitive to changes in individuals' perceived costs of vaccination and vaccine effectiveness. However, in scenarios where competing beliefs spread concurrently with vaccination behavior, their double-edged impact can lead to self-correction and alignment between vaccine beliefs and behaviors. The results show that coevolution of vaccine beliefs and behaviors makes populations more sensitive to abrupt changes in perceptions of vaccine cost and effectiveness compared to scenarios without beliefs. Our work provides valuable insights into harnessing the social contagion of even vaccine-neutral attitudes to overcome vaccine hesitancy.
- [22] arXiv:2503.04659 (cross-list from cond-mat.soft) [pdf, html, other]
-
Title: Predicting Heteropolymer Phase Separation Using Two-Chain Contact MapsSubjects: Soft Condensed Matter (cond-mat.soft); Biomolecules (q-bio.BM)
Phase separation in polymer solutions often correlates with single-chain and two-chain properties, such as the single-chain radius of gyration, Rg, and the pairwise second virial coefficient, B22. However, recent studies have shown that these metrics can fail to distinguish phase-separating from non-phase-separating heteropolymers, including intrinsically disordered proteins (IDPs). Here we introduce an approach to predict heteropolymer phase separation from two-chain simulations by analyzing contact maps, which capture how often specific monomers from the two chains are in physical proximity. Whereas B22 summarizes the overall attraction between two chains, contact maps preserve spatial information about their interactions. To compare these metrics, we train phase-separation classifiers for both a minimal heteropolymer model and a chemically specific, residue-level IDP model. Remarkably, simple statistical properties of two-chain contact maps predict phase separation with high accuracy, vastly outperforming classifiers based on Rg and B22 alone. Our results thus establish a transferable and computationally efficient method to uncover key driving forces of IDP phase behavior based on their physical interactions in dilute solution.
- [23] arXiv:2503.04677 (cross-list from cond-mat.soft) [pdf, html, other]
-
Title: Capacitive response of biological membranesSubjects: Soft Condensed Matter (cond-mat.soft); Biological Physics (physics.bio-ph); Subcellular Processes (q-bio.SC)
We present a minimal model to analyze the capacitive response of a biological membrane subjected to a step voltage via blocking electrodes. Through a perturbative analysis of the underlying electrolyte transport equations, we show that the leading-order relaxation of the transmembrane potential is governed by a capacitive timescale, ${\tau_{\rm C} =\dfrac{\lambda_{\rm D}L}{D}\left(\dfrac{2+\Gamma\delta^{\rm M}/L}{4+\Gamma\delta^{\rm M}/\lambda_{\rm D}}\right)}$, where $\lambda_{\rm D}$ is the Debye screening length, $L$ is the electrolyte width, $\Gamma$ is the ratio of the dielectric permittivity of the electrolyte to the membrane, $\delta^{\rm M}$ is the membrane thickness, and $D$ is the ionic diffusivity. This timescale is considerably shorter than the traditional RC timescale ${\lambda_{\rm D} L / D}$ for a bare electrolyte due to the membrane's low dielectric permittivity and finite thickness. Beyond the linear regime, however, salt diffusion in the bulk electrolyte drives a secondary, nonlinear relaxation process of the transmembrane potential over a longer timescale ${\tau_{\rm L} =L^2/4\pi^2 D}$. A simple equivalent-circuit model accurately captures the linear behavior, and the perturbation expansion remains applicable across the entire range of observed physiological transmembrane potentials. Together, these findings underscore the importance of the faster capacitive timescale and nonlinear effects on the bulk diffusion timescale in determining transmembrane potential dynamics for a range of biological systems.
- [24] arXiv:2503.04716 (cross-list from physics.bio-ph) [pdf, html, other]
-
Title: Optimal Cell Shape for Accurate Chemical Gradient Sensing in Eukaryote ChemotaxisSubjects: Biological Physics (physics.bio-ph); Cell Behavior (q-bio.CB)
Accurate gradient sensing is crucial for efficient chemotaxis in noisy environments, but the relationship between cell shape deformations and sensing accuracy is not well understood. Using a theoretical framework based on maximum likelihood estimation, we show that the receptor dispersion, quantified by cell shape convex hull, fundamentally limits gradient sensing accuracy. Cells with a concave shape and isotropic error space achieve optimal performance in gradient detection. This concave shape, resulting from active protrusions or contractions, can significantly improve gradient sensing accuracy at the cost of increased energy expenditure. By balancing sensing accuracy and deformation cost, we predict that a concave, three-branched shape as optimal for cells in shallow gradients. To achieve efficient chemotaxis, our theory suggests that a cell should adopt a repeating "run-and-expansion" cycle. Our theoretical predictions align well with experimental observations, implying that the fast amoeboid cell motion is optimized near the physical limit for chemotaxis. This study highlights the crucial role of active cell shape deformation in facilitating accurate chemotaxis.
Cross submissions (showing 11 of 11 entries)
- [25] arXiv:2309.00061 (replaced) [pdf, other]
-
Title: GeneFEAST: the pivotal, gene-centric step in functional enrichment analysis interpretationComments: This article has been accepted for publication in Bioinformatics Published by Oxford University Press. This version has been peer-reviewed, is the Version of Record, and replaces the previous version deposited here. Main text: 5 pages, 2 figures. Supplementary information is available at Bioinformatics onlineJournal-ref: Bioinformatics, 2025, btaf100Subjects: Quantitative Methods (q-bio.QM)
Summary: GeneFEAST, implemented in Python, is a gene-centric functional enrichment analysis summarisation and visualisation tool that can be applied to large functional enrichment analysis (FEA) results arising from upstream FEA pipelines. It produces a systematic, navigable HTML report, making it easy to identify sets of genes putatively driving multiple enrichments and to explore gene-level quantitative data first used to identify input genes. Further, GeneFEAST can compare FEA results from multiple studies, making it possible, for example, to highlight patterns of gene expression amongst genes commonly differentially expressed in two sets of conditions, and giving rise to shared enrichments under those conditions. GeneFEAST offers a novel, effective way to address the complexities of linking up many overlapping FEA results to their underlying genes and data, advancing gene-centric hypotheses, and providing pivotal information for downstream validation experiments.
Availability: GeneFEAST is available at this https URL
Contact: this http URL@well.this http URL - [26] arXiv:2407.08974 (replaced) [pdf, html, other]
-
Title: Topology-enhanced machine learning model (Top-ML) for anticancer peptide predictionSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); General Topology (math.GN); Biomolecules (q-bio.BM)
Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced machine learning model (Top-ML) for anticancer peptides prediction. Our Top-ML employs peptide topological features derived from its sequence "connection" information characterized by vector and spectral descriptors. Our Top-ML model, employing an Extra-Trees classifier, has been validated on the AntiCP 2.0 and mACPpred 2.0 benchmark datasets, achieving state-of-the-art performance or results comparable to existing deep learning models, while providing greater interpretability. Our results highlight the potential of leveraging novel topology-based featurization to accelerate the identification of anticancer peptides.
- [27] arXiv:2410.04542 (replaced) [pdf, html, other]
-
Title: Generative Flows on Synthetic Pathway for Drug DesignComments: Accepted to ICLR 2025, 32 pages, 17 figures, code: this https URLSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Generative models in drug discovery have recently gained attention as efficient alternatives to brute-force virtual screening. However, most existing models do not account for synthesizability, limiting their practical use in real-world scenarios. In this paper, we propose RxnFlow, which sequentially assembles molecules using predefined molecular building blocks and chemical reaction templates to constrain the synthetic chemical pathway. We then train on this sequential generating process with the objective of generative flow networks (GFlowNets) to generate both highly rewarded and diverse molecules. To mitigate the large action space of synthetic pathways in GFlowNets, we implement a novel action space subsampling method. This enables RxnFlow to learn generative flows over extensive action spaces comprising combinations of 1.2 million building blocks and 71 reaction templates without significant computational overhead. Additionally, RxnFlow can employ modified or expanded action spaces for generation without retraining, allowing for the introduction of additional objectives or the incorporation of newly discovered building blocks. We experimentally demonstrate that RxnFlow outperforms existing reaction-based and fragment-based models in pocket-specific optimization across various target pockets. Furthermore, RxnFlow achieves state-of-the-art performance on CrossDocked2020 for pocket-conditional generation, with an average Vina score of -8.85 kcal/mol and 34.8% synthesizability.
- [28] arXiv:2410.05972 (replaced) [pdf, html, other]
-
Title: Node-reconfiguring multilayer networks of human brain functionTarmo Nurmi (1), Pietro De Luca (1), Maria Hakonen (2,3), Mikko Kivelä (1), Onerva Korhonen (1,4) ((1) Department of Computer Science, Aalto University, Helsinki, Finland, (2) Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital Charlestown, Boston, MA, USA, (3) Department of Radiology, Harvard Medical School, Boston, MA, USA, (4) Faculty of Science, Forestry and Technology, University of Eastern Finland, Joensuu, Finland)Comments: 25+6 pages, 8+4 figures; textual edits and additional references in various sections of the manuscript, results unchanged; added Maria Hakonen who collected and preprocessed the data as an authorSubjects: Neurons and Cognition (q-bio.NC); Computational Physics (physics.comp-ph)
Functional brain network properties are heavily influenced by how the the network nodes are defined. A common approach uses Regions of Interest (ROIs), i.e., predetermined collections of functional magnetic resonance imaging (fMRI) measurement voxels, as nodes. Their definition is always a compromise, as static ROIs cannot capture the dynamics and temporal reconfigurations of the brain areas. Consequently, the ROIs do not align with the functionally homogeneous regions, which can explain the low functional homogeneity values observed for the ROIs. This is in violation of the underlying homogeneity assumption in functional brain network analysis pipelines, which can cause serious problems such as spurious network structure. We introduce the node-reconfiguring multilayer network model, where nodes represent ROIs with boundaries optimized for high functional homogeneity in each time window. In this representation, network layers correspond to time windows, intralayer links depict functional connectivity between ROIs, and interlayer links quantify the overlap between ROIs on different layers. The ROI optimization approach increases functional homogeneity notably, yielding an over 10-fold increase in the fraction of ROIs with high homogeneity compared to static ROIs from the Brainnetome atlas. The optimized ROIs reorganize non-trivially at short time scales of consecutive time windows and across several windows. The amount of reorganization across time windows is connected to intralayer hubness: ROIs with intermediate levels of reorganization have stronger intralayer links than extremely stable or unstable ROIs. Our results demonstrate that reconfiguring parcellations yield more accurate network models of brain function. This supports the ongoing paradigm shift towards the chronnectome that sees the brain as a set of sources with continuously reconfiguring spatial and connectivity profiles.
- [29] arXiv:2411.15078 (replaced) [pdf, other]
-
Title: Functional dissociations versus post-hoc selection: Moving beyond the Stockart et al. (2025) compromiseComments: 30 pages, 4 figures. In this version, we added a footnote in response to comments in Stockart et al.'s (2025) new version of the paper. The most important change is the new Figure 3, which shows an empirical ROC curve illustrating our major argument that post-hoc sorting is based on a statistical fallacy. All other changes are minorSubjects: Neurons and Cognition (q-bio.NC)
Stockart et al. (2025) recommend guidelines for best practices in the field of unconscious cognition. However, they condone the repeatedly criticized technique of excluding trials with high visibility ratings or of participants with high sensitivity for the critical stimulus. Based on standard signal detection theory for discrimination judgments, we show that post-hoc trial selection only isolates points of neutral response bias but remains consistent with uncomfortably high levels of sensitivity. We argue that post-hoc selection constitutes a sampling fallacy that capitalizes on chance, generates regression artifacts, and wrongly ascribes unconscious processing to stimulus conditions that may be far from indiscriminable. As an alternative, we advocate the study of functional dissociations, where direct (D) and indirect (I) measures are conceptualized as spanning up a two-dimensional D-I-space and where single, sensitivity, and double dissociations appear as distinct curve patterns. While Stockart et al.'s recommendations cover only a single line of that space where D is close to zero, functional dissociations can utilize the entire space, circumventing requirements like null visibility and exhaustive reliability, and allowing for the planful measurement of theoretically meaningful functional relationships between experimentally controlled variables.
- [30] arXiv:2411.15684 (replaced) [pdf, html, other]
-
Title: Disentangling the Complex Multiplexed DIA Spectra in De Novo Peptide SequencingSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Data-Independent Acquisition (DIA) was introduced to improve sensitivity to cover all peptides in a range rather than only sampling high-intensity peaks as in Data-Dependent Acquisition (DDA) mass spectrometry. However, it is not very clear how useful DIA data is for de novo peptide sequencing as the DIA data are marred with coeluted peptides, high noises, and varying data quality. We present a new deep learning method DIANovo, and address each of these difficulties, and improves the previous established system DeepNovo-DIA by from 25% to 81%, averaging 48%, for amino acid recall, and by from 27% to 89%, averaging 57%, for peptide recall, by equipping the model with a deeper understanding of coeluted DIA spectra. This paper also provides criteria about when DIA data could be used for de novo peptide sequencing and when not to by providing a comparison between DDA and DIA, in both de novo and database search mode. We find that while DIA excels with narrow isolation windows on older-generation instruments, it loses its advantage with wider windows. However, with Orbitrap Astral, DIA consistently outperforms DDA due to narrow window mode enabled. We also provide a theoretical explanation of this phenomenon, emphasizing the critical role of the signal-to-noise profile in the successful application of de novo sequencing.
- [31] arXiv:2502.11395 (replaced) [pdf, other]
-
Title: Targeting C99 Mediated Metabolic Disruptions with Ketone Therapy in Alzheimer's DiseaseSubjects: Neurons and Cognition (q-bio.NC)
The role of ketone bodies in Alzheimers disease (AD) remains incompletely understood, particularly regarding their influence on amyloid pathology. While beta}hydroxybutyrate (BHB) has been implicated in neuroprotection, direct evidence for its effects on amyloid beta(Abeta) deposition, aggregation, or clearance is lacking. Furthermore, whether BHB acts as a disease modifying factor or merely confers transient metabolic benefits remains unclear. Addressing this gap is crucial for evaluating the therapeutic potential of ketone metabolism in AD. Here, we investigated the impact of ketone bodies on amyloidogenic toxicity using a Drosophila melanogaster model with targeted expression of human amyloid precursor protein (APP), beta secretase 1 (BACE1), Abeta, and the C99 fragment, an essential intermediate in Abeta generation. Surprisingly, we found that Abeta alone elicited minimal neurotoxicity, whereas C99 expression induced pronounced pathological effects, suggesting a critical, underappreciated role of C99 in AD progression. Further analysis revealed that C99 driven toxicity was associated with autophagic and lysosomal dysfunction, leading to impaired protein clearance, oxidative stress, and mitochondrial abnormalities. Using confocal microscopy and lysosomal pH sensitive markers, we demonstrated that BHB treatment restored lysosomal function and alleviated these pathological changes. Protein protein interaction network analysis in C99 expressing Drosophila brains identified protein phosphatase methylesterase 1 (PPME1) activation as a key driver of autophagic impairment, further supported by machine learning predictions. Finally, mathematical similarity analysis of PPI networks suggested that BHB may exert its neuroprotective effects through mTOR inhibition, positioning it as a potential endogenous modulator of AD related pathology.
- [32] arXiv:2502.17504 (replaced) [pdf, other]
-
Title: Protein Large Language Models: A Comprehensive SurveyYijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei WangComments: 24 pages, 4 figures, 5 tablesSubjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at this https URL.
- [33] arXiv:2403.09193 (replaced) [pdf, html, other]
-
Title: Can We Talk Models Into Seeing the World Differently?Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, Janis KeuperComments: Accepted at ICLR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful.
URL: this https URL - [34] arXiv:2406.19787 (replaced) [pdf, html, other]
-
Title: Approximate solutions of a general stochastic velocity-jump model subject to discrete-time noisy observationsComments: Main: 36 pages, 9 figures. Supplementary Information: 25 pages, 5 figuresSubjects: Data Analysis, Statistics and Probability (physics.data-an); Quantitative Methods (q-bio.QM)
Advances in experimental techniques allow the collection of high-resolution spatio-temporal data that track individual motile entities over time. These tracking data motivate the use of mathematical models to characterise the motion observed. In this paper, we aim to describe the solutions of velocity-jump models for single-agent motion in one spatial dimension, characterised by successive Markovian transitions within a finite network of n states, each with a specified velocity and a fixed rate of switching to every other state. In particular, we focus on obtaining the solutions of the model subject to discrete-time noisy observations, with no direct access to the agent state. The lack of direct observation of the hidden state makes the problem of finding the exact distributions generally intractable. Therefore, we derive a series of approximations for the data distributions. We verify the accuracy of these approximations by comparing them to the empirical distributions generated through simulations of four example model structures. These comparisons confirm that the approximations are accurate given sufficiently infrequent state switching relative to the imaging frequency. The approximate distributions computed can be used to obtain fast forwards predictions, to give guidelines on experimental design, and as likelihoods for inference and model selection.
- [35] arXiv:2502.07272 (replaced) [pdf, html, other]
-
Title: GENERator: A Long-Context Generative Genomic Foundation ModelSubjects: Computation and Language (cs.CL); Genomics (q-bio.GN)
Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions. Implementation details and supplementary resources are available at this https URL.
- [36] arXiv:2503.00638 (replaced) [pdf, other]
-
Title: POSERS: Steganography-Driven Molecular Tagging Using Randomized DNA SequencesSubjects: Cryptography and Security (cs.CR); Probability (math.PR); Biomolecules (q-bio.BM)
Counterfeiting poses a significant challenge across multiple industries, leading to financial losses and health risks. While DNA-based molecular tagging has emerged as a promising anti-counterfeiting strategy, existing methods rely on predefined DNA sequences, making them vulnerable to replication as sequencing and synthesis technologies advance. To address these limitations, we introduce POSERS (Position-Oriented Scattering of Elements among a Randomized Sequence), a steganographic tagging system embedded within DNA sequences. POSERS ensures copy- and forgery-proof authentication by adding restrictions within randomized DNA libraries, enhancing security against counterfeiting attempts. The POSERS design allows the complexity of the libraries to be adjusted based on the customer's needs while ensuring they withstand the ongoing improvements in DNA synthesis and sequencing technologies. We mathematically validate its security properties and experimentally demonstrate its effectiveness using Next-Generation Sequencing and an authentication test, successfully distinguishing genuine POSERS tags from counterfeit ones. Our results highlight the potential of POSERS as a long-term, adaptable solution for secure product authentication.