Completing Partial Reaction Equations with Rule and Language Model-based Methods

Journal article (2024)

Authors

Matthijs van Wijngaarden Student

G. Vogel Pattern Recognition and Bioinformatics -

J.M. Weber Pattern Recognition and Bioinformatics -

Research Group

Pattern Recognition and Bioinformatics () (TU Delft)

Chemical reaction completion Language models Molecular transformer Reaction SMILES Rule-based methods

To reference this document use:

http://resolver.tudelft.nl/uuid:5b70ada9-60f6-4a06-bc60-e4b7a50375a6

More Info

expand_more

Published Date

2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Intelligent Systems

Research Group

Pattern Recognition and Bioinformatics

Abstract

Large chemical reaction data sets often suffer from incompleteness, such as missing molecules or stoichiometric information. Incomplete chemical reaction equations currently hinder us to perform automated mass balances across large sets of chemical reactions. In this work, we integrate two approaches for computational completion of partial reaction equations. Specifically, we combine a rule-based method and a machine learning model, a tailored version of the pre-trained Molecular Transformer, to complete reactions. The rule-based method takes sets of helper species into a linear solver and therewith balances some incomplete reactions. The machine learning model is trained to take partial reactions as inputs and predicts missing molecules and stoichiometries. We apply our methodology to the USPTO STEREO chemical reaction data set. The rule-based method completes about 50 % of the reactions. The language model shows a top 1 accuracy of 88.3 % on our test set and high validity (> 99 % of outputs are valid SMILES).

Files

1-s2.0-B978044328824150524X-ma... (pdf)

(pdf | 0.42 Mb)

File under embargo until 31-12-2024