Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Lee, Ingoo; Nam, Hojung

Quantitative Biology > Quantitative Methods

arXiv:2205.00084 (q-bio)

[Submitted on 20 Apr 2022]

Title:Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Authors:Ingoo Lee, Hojung Nam

View PDF

Abstract:The simplified molecular-input line-entry system (SMILES) is the most popular representation of chemical compounds. Therefore, many SMILES-based molecular property prediction models have been developed. In particular, transformer-based models show promising performance because the model utilizes a massive chemical dataset for self-supervised learning. However, there is no transformer-based model to overcome the inherent limitations of SMILES, which result from the generation process of SMILES. In this study, we grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES. First, we pretrained the transformers with substructural tokens, which were parsed from SMILES. Then, we used the training strategy 'same compound model' to better understand SMILES grammar. In addition, we injected knowledge of connectivity and type into the transformer with knowledge adapters. As a result, our representation model outperformed previous compound representations for the prediction of molecular properties. Finally, we analyzed the attention of the transformer model and adapters, demonstrating that the proposed model understands the grammar of SMILES.

Comments:	8 pages, 4 figures
Subjects:	Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2205.00084 [q-bio.QM]
	(or arXiv:2205.00084v1 [q-bio.QM] for this version)
	https://doi.org/10.48550/arXiv.2205.00084

Submission history

From: Ingoo Lee [view email]
[v1] Wed, 20 Apr 2022 01:25:18 UTC (936 KB)

Quantitative Biology > Quantitative Methods

Title:Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Quantitative Methods

Title:Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators