Completing Partial Reaction Equations with Rule and Language Model-based Methods

More Info
expand_more

Abstract

Large chemical reaction data sets often suffer from incompleteness, such as missing molecules or stoichiometric information. Incomplete chemical reaction equations currently hinder us to perform automated mass balances across large sets of chemical reactions. In this work, we integrate two approaches for computational completion of partial reaction equations. Specifically, we combine a rule-based method and a machine learning model, a tailored version of the pre-trained Molecular Transformer, to complete reactions. The rule-based method takes sets of helper species into a linear solver and therewith balances some incomplete reactions. The machine learning model is trained to take partial reactions as inputs and predicts missing molecules and stoichiometries. We apply our methodology to the USPTO STEREO chemical reaction data set. The rule-based method completes about 50 % of the reactions. The language model shows a top 1 accuracy of 88.3 % on our test set and high validity (> 99 % of outputs are valid SMILES).

Files

1-s2.0-B978044328824150524X-ma... (pdf)
(pdf | 0.42 Mb)
warning

File under embargo until 31-12-2024