Hunting for chemicals with favorable pharmacological, toxicological, and pharmacokinetic properties remains a formidable challenge for drug discovery. Deep learning provides us with powerful tools to build predictive models that are appropriate for the rising amounts of data, but the gap between what these neural networks learn and what human beings can comprehend is growing. Moreover, this gap may induce distrust and restrict deep learning applications in practice. Here, we introduce a new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery data sets. We demonstrate that Attentive FP achieves state-of-the-art predictive performances on a variety of data sets and that what it learns is interpretable. The feature visualization for Attentive FP suggests that it automatically learns nonlocal intramolecular interactions from specified tasks, which can help us gain chemical insights directly from data beyond human perception.
Motivation
Identifying compound-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional (3D) structure account for a large part of potential biological targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias, and splitting datasets inappropriately, resulting in overestimation of their prediction performance.
Results
To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal experiment to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new experiments, and it can be deconvolved to highlight important interacting regions of protein sequences and compound atoms, which may contribute chemical biology studies with useful guidance for further ligand structural optimization.
Supplementary information
Supplementary data are available at Bioinformatics online.
Availability and implementation
https://github.com/lifanchen-simm/transformerCPI
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.