Quantitative Methods
See recent articles
- [1] arXiv:2406.12949 [pdf, html, other]
-
Title: Integrating time-resolved $nrf2$ gene-expression data into a full GUTS model as a proxy for toxicodynamic damage in zebrafish embryoSubjects: Quantitative Methods (q-bio.QM); Dynamical Systems (math.DS); Applications (stat.AP)
The immense production of the chemical industry requires an improved predictive risk assessment that can handle constantly evolving challenges while reducing the dependency of risk assessment on animal testing. Integrating 'omics data into mechanistic models offers a promising solution by linking cellular processes triggered after chemical exposure with observed effects in the organism. With the emerging availability of time-resolved RNA data, the goal of integrating gene expression data into mechanistic models can be approached. We propose a biologically anchored TKTD model, which describes key processes that link the gene expression level of the stress regulator $nrf2$ to detoxification and lethality by associating toxicodynamic damage with $nrf2$ expression. Fitting such a model to complex datasets consisting of multiple endpoints required the combination of methods from molecular biology, mechanistic dynamic systems modeling and Bayesian inference. In this study we successfully integrate time-resolved gene expression data into TKTD models, and thus provide a method for assessing the influence of molecular markers on survival. This novel method was used to test whether, $nrf2$, can be applied to predict lethality in zebrafish embryos. With the presented approach we outline a method to successively approach the goal of a predictive risk assessment based on molecular data.
- [2] arXiv:2406.12950 [pdf, html, other]
-
Title: MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property PredictionSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at this https URL.
- [3] arXiv:2406.13292 [pdf, html, other]
-
Title: An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's diseaseGiorgio Dolci (1,2), Federica Cruciani (1), Md Abdur Rahaman (2), Anees Abrol (2), Jiayu Chen (2), Zening Fu (2), Ilaria Boscolo Galazzo (1), Gloria Menegaz (1), Vince D. Calhoun (2) ((1) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (2) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)Comments: 27 pages, 7 figures, submitted to a journalSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.
- [4] arXiv:2406.13489 [pdf, html, other]
-
Title: Efficient gPC-based quantification of probabilistic robustness for systems in neuroscienceSubjects: Quantitative Methods (q-bio.QM)
We introduce and analyze generalised polynomial chaos (gPC), considering both intrusive and non-intrusive approaches, as an uncertainty quantification method in studies of probabilistic robustness. The considered gPC methods are complementary to Monte Carlo (MC) methods and are shown to be fast and scalable, allowing for comprehensive and efficient exploration of parameter spaces. These properties enable robustness analysis of a wider set of models, compared to computationally expensive MC methods, while retaining desired levels of accuracy. We discuss the application of gPC methods to systems in biology and neuroscience, notably subject to multiple parametric uncertainties, and we examine a well-known model of neural dynamics as a case study.
- [5] arXiv:2406.13889 [pdf, html, other]
-
Title: Network-community analysis of cellular senescenceAlda Sabalic, Victoria Moiseeva, Andres Cisneros, Oleg Deryagin, Eusebio Perdiguero, Pura Muñoz-Canoves, Jordi Garcia-OjalvoComments: 20 pages, 11 figuresSubjects: Quantitative Methods (q-bio.QM)
Most cellular phenotypes are genetically complex. Identifying the set of genes that are most closely associated with a specific cellular state is still an open question in many cases. Here we study the transcriptional profile of cellular senescence using a combination of network-based approaches, which include eigenvector centrality feature selection and community detection. We apply our method to cell-type-resolved RNA sequencing data obtained from injured muscle tissue in mice. The analysis identifies some genetic markers consistent with previous findings, and other previously unidentified ones, which are validated with previously published single-cell RNA sequencing data in a different type of tissue. The key identified genes, both those previously known and the newly identified ones, are transcriptional targets of factors known to be associated with established hallmarks of senescence, and can thus be interpreted as molecular correlates of such hallmarks. The method proposed here could be applied to any complex cellular phenotype even when only bulk RNA sequencing is available, provided the data is resolved by cell type.
- [6] arXiv:2406.14062 [pdf, html, other]
-
Title: An agent-based model of behaviour change calibrated to reversal learning dataComments: 23 pages, 5 figuresSubjects: Quantitative Methods (q-bio.QM); Biological Physics (physics.bio-ph); Computation (stat.CO)
Behaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individual-based model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individual-level behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.
- [7] arXiv:2406.14142 [pdf, other]
-
Title: Geometric Self-Supervised Pretraining on 3D Protein Structures using SubgraphsSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in applying these methods to 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. The motivation for this method is twofold. First, the relative spatial arrangements and geometric relationships among different regions of a protein are crucial for its function. Moreover, proteins are often organized in a hierarchical manner, where smaller substructures, such as secondary structure elements, assemble into larger domains. By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization. We experimentally show that our proposed pertaining strategy leads to significant improvements in the performance of 3D GNNs in various protein classification tasks.
- [8] arXiv:2406.14246 [pdf, html, other]
-
Title: Non-Negative Universal Differential Equations With Applications in Systems BiologyComments: 6 pages, This work has been submitted to IFAC for possible publication. Initial submission was March 18, 2024Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
Universal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present non-negative UDE (nUDEs), a constrained UDE variant that guarantees non-negative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.
New submissions for Friday, 21 June 2024 (showing 8 of 8 entries )
- [9] arXiv:2406.13162 (cross-list from cs.LG) [pdf, html, other]
-
Title: AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining RegionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).
- [10] arXiv:2406.13284 (cross-list from physics.med-ph) [pdf, other]
-
Title: The association of domain-specific physical activity and sedentary activity with stroke: A prospective cohort studySubjects: Medical Physics (physics.med-ph); Quantitative Methods (q-bio.QM)
Background The incidence of stroke places a heavy burden on both society and individuals. Activity is closely related to cardiovascular health. This study aimed to investigate the relationship between the varying domains of PA, like occupation-related Physical Activity (OPA), transportation-related Physical Activity (TPA), leisure-time Physical Activity (LTPA), and Sedentary Activity (SA) with stroke. Methods Our analysis included 30,400 participants aged 20+ years from 2007 to 2018 National Health and Nutrition Examination Survey (NHANES). Stroke was identified based on the participant's self-reported diagnoses from previous medical consultations, and PA and SA were self-reported. Multivariable logistic and restricted cubic spline models were used to assess the associations. Results Participants achieving PA guidelines (performing PA more than 150 min/week) were 35.7% less likely to have a stroke based on both the total PA (odds ratio [OR] 0.643, 95% confidence interval [CI] 0.523-0.790) and LTPA (OR 0.643, 95% CI 0.514-0.805), while OPA or TPA did not demonstrate lower stroke risk. Furthermore, participants with less than 7.5 h/day SA levels were 21.6% (OR 0.784, 95% CI 0.665-0.925) less likely to have a stroke. The intensities of total PA and LTPA exhibited nonlinear U-shaped associations with stroke risk. In contrast, those of OPA and TPA showed negative linear associations, while SA intensities were positively linearly correlated with stroke risk. Conclusions LTPA, but not OPA or TPA, was associated with a lower risk of stroke at any amount, suggesting that significant cardiovascular health would benefit from increased PA. Additionally, the positive association between SA and stroke indicated that prolonged sitting was detrimental to cardiovascular health. Overall, increased PA within a reasonable range reduces the risk of stroke, while increased SA elevates it.
- [11] arXiv:2406.13644 (cross-list from math.NA) [pdf, html, other]
-
Title: Kinetic Monte Carlo methods for three-dimensional diffusive capture problems in exterior domainsComments: 32 pages, 10 figuresSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
Cellular scale decision making is modulated by the dynamics of signalling molecules and their diffusive trajectories from a source to small absorbing sites on the cellular surface. Diffusive capture problems are computationally challenging due to the complex geometry and the applied boundary conditions together with intrinsically long transients that occur before a particle is captured. This paper reports on a particle-based Kinetic Monte Carlo (KMC) method that provides rapid accurate simulation of arrival statistics for (i) a half-space bounded by a surface with a finite collection of absorbing traps and (ii) the domain exterior to a convex cell again with absorbing traps. We validate our method by replicating classical results and in addition, newly developed boundary homogenization theories and matched asymptotic expansions on capture rates. In the case of non-spherical domains, we describe a new shielding effect in which geometry can play a role in sharpening cellular estimates on the directionality of diffusive sources.
- [12] arXiv:2406.14021 (cross-list from cs.CL) [pdf, html, other]
-
Title: HIGHT: Hierarchical Graph Tokenization for Graph-Language AlignmentComments: Preliminary version of an ongoing project: this https URLSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.
- [13] arXiv:2406.14287 (cross-list from eess.IV) [pdf, html, other]
-
Title: Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens DistortionSoroush Oskouei, Marit Valla, André Pedersen, Erik Smistad, Vibeke Grotnes Dale, Maren Høibø, Sissel Gyrid Freim Wahl, Mats Dehli Haugum, Thomas Langø, Maria Paula Ramnefjell, Lars Andreas Akslen, Gabriel Kiss, Hanne SorgerComments: 16 pages, 7 figures, submitted to Scientific ReportsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.
Cross submissions for Friday, 21 June 2024 (showing 5 of 5 entries )
- [14] arXiv:2312.01646 (replaced) [pdf, html, other]
-
Title: Enhancing data-limited assessments with random effects: A case study on Korea chub mackerel (Scomber japonicus)Kyuhan Kim (1), Nokuthaba Sibanda (2), Richard Arnold (2), Teresa A'mar (1) ((1) Dragonfly Data Science, Wellington, New Zealand, (2) School of Mathematics and Statistics, Victoria University of Wellington, Wellington, New Zealand)Comments: 78 pages, 21 figuresSubjects: Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
In a state-space framework, temporal variations in fishery-dependent processes can be modeled as random effects. This modeling flexibility makes state-space models (SSMs) powerful tools for data-limited assessments. Though SSMs enable the model-based inference of the unobserved processes, their flexibility can lead to overfitting and non-identifiability issues. To address these challenges, we developed a suite of state-space length-based age-structured models and applied them to the Korean chub mackerel (Scomber japonicus) stock. Our research demonstrated that incorporating temporal variations in fishery-dependent processes can rectify model mis-specification but may compromise robustness, which can be diagnosed through a series of model checking processes. To tackle non-identifiability, we used a non-degenerate estimator, implementing a gamma distribution as a penalty for the standard deviation parameters of observation errors. This penalty function enabled the simultaneous estimation of both process and observation error variances with minimal bias, a notably challenging task in SSMs. These results highlight the importance of model checking and the effectiveness of the penalized approach in estimating SSMs. Additionally, we discussed novel assessment outcomes for the mackerel stock.
- [15] arXiv:2312.06824 (replaced) [pdf, html, other]
-
Title: A picture guide to cancer progression and monotonic accumulation models: evolutionary assumptions, plausible interpretations, and alternative usesComments: Abstract 200 words; added details to BML; consistent British spelling. [Previous changes: Iain G. Johnston coauthor; clarified LOD/POM; clarified scenarios by moving some text to new section; comment Schill et al. 2024 selection bias; clarifications and fixed typos; additional annotation in some figures and figure legends. Added URLs and DOIs to references; corrected typos; added URL to software]Subjects: Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
Cancer progression and monotonic accumulation models were developed to discover dependencies in the irreversible acquisition of binary traits from cross-sectional data. They have been used in computational oncology and virology but also in widely different problems such as malaria progression. These methods have been applied to predict future states of the system, identify routes of feature acquisition, and improve patient stratification, and they hold promise for evolutionary-based treatments. New methods continue to be developed.
But these methods have shortcomings, which are yet to be systematically critiqued, regarding key evolutionary assumptions and interpretations. After an overview of the available methods, we focus on why inferences might not be about the processes we intend. Using fitness landscapes, we highlight difficulties that arise from bulk sequencing and reciprocal sign epistasis, from conflating lines of descent, path of the maximum, and mutational profiles, and from ambiguous use of the idea of exclusivity. We examine how the previous concerns change when bulk sequencing is explicitly considered, and underline opportunities for addressing dependencies due to frequency-dependent selection. This review identifies major standing issues, and should encourage the use of these methods in other areas with a better alignment between entities and model assumptions. - [16] arXiv:2402.11729 (replaced) [pdf, html, other]
-
Title: Prospector Heads: Generalized Feature Attribution for Large Models & DataGautam Machiraju, Alexander Derry, Arjun Desai, Neel Guha, Amir-Hossein Karimi, James Zou, Russ Altman, Christopher Ré, Parag MallickComments: 30 pages, 16 figures, 8 tables. Accepted to ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.
- [17] arXiv:2405.18051 (replaced) [pdf, other]
-
Title: Predicting Progression Events in Multiple Myeloma from Routine Blood WorkMaximilian Ferle, Nora Grieb, Markus Kreuz, Uwe Platzbecker, Thomas Neumuth, Kristin Reiche, Alexander Oeser, Maximilian MerzComments: 18 pages, 8 figures, 4, tablesSubjects: Applications (stat.AP); Quantitative Methods (q-bio.QM)
The ability to accurately predict disease progression is paramount for optimizing multiple myeloma patient care. This study introduces a hybrid neural network architecture, combining Long Short-Term Memory networks with a Conditional Restricted Boltzmann Machine, to predict future blood work of affected patients from a series of historical laboratory results. We demonstrate that our model can replicate the statistical moments of the time series ($0.95~\pm~0.01~\geq~R^2~\geq~0.83~\pm~0.03$) and forecast future blood work features with high correlation to actual patient data ($0.92\pm0.02~\geq~r~\geq~0.52~\pm~0.09$). Subsequently, a second Long Short-Term Memory network is employed to detect and annotate disease progression events within the forecasted blood work time series. We show that these annotations enable the prediction of progression events with significant reliability (AUROC$~=~0.88~\pm~0.01$), up to 12 months in advance (AUROC($t+12~$mos)$~=0.65~\pm~0.01$). Our system is designed in a modular fashion, featuring separate entities for forecasting and progression event annotation. This structure not only enhances interpretability but also facilitates the integration of additional modules to perform subsequent operations on the generated outputs. Our approach utilizes a minimal set of routine blood work measurements, which avoids the need for expensive or resource-intensive tests and ensures accessibility of the system in clinical routine. This capability allows for individualized risk assessment and making informed treatment decisions tailored to a patient's unique disease kinetics. The represented approach contributes to the development of a scalable and cost-effective virtual human twin system for optimized healthcare resource utilization and improved patient outcomes in multiple myeloma care.
- [18] arXiv:2406.06479 (replaced) [pdf, html, other]
-
Title: Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular DataSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Data sets with imbalanced class sizes, often where one class size is much smaller than that of others, occur extremely often in various applications, including those with biological foundations, such as drug discovery and disease diagnosis. Thus, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to detect can result in heavy costs. However, many data classification algorithms do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this paper, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) techniques and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification problems on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed method not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer model based on an attention mechanism for self-supervised learning. Additionally, the method implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed model is validated using six molecular data sets, and we also provide a thorough comparison to other competing algorithms. The computational experiments show that the proposed method performs better than competing techniques even when the class imbalance ratio is very high.