Chemical reaction completion: a hybrid rule-based and language model-based approach

More Info
expand_more

Abstract

Large chemical reaction databases often suffer from incompleteness, such as missing molecules or stoichiometric information. Concurrently, numerous computational models are being developed in predictive chemistry that rely on reaction databases and would hugely benefit from complete reaction equations. Also, research in sustainable chemistry often focuses on automated mass balance tasks, which require a full reaction to properly evaluate. In this work, we present a hybrid approach for computational completion of reaction equations. Specifically, we combine a rule-based method and a machine learning (ML) model to complete reactions. The rule-based approach constructs a balance of atoms and charge on either side of the reaction in an attempt to find missing molecules. We tailor the pre-trained transformer model on the chemical language domain to take partial reactions as inputs and predict missing molecules. Furthermore, we present a novel approach to measure the correctness of our model, which is useful when we apply it to the uncurated dataset and the ground-truth is unknown.