Machine
learning methods have revolutionized modern science, providing
fast and accurate solutions to multiple problems. However, they are
commonly treated as “black boxes”. Therefore, in important
scientific fields such as medicinal chemistry and drug discovery,
machine learning methods are restricted almost exclusively to the
task of performing predictions of large and heterogeneous data sets
of chemicals. The lack of interpretability prevents the full exploitation
of the machine learning models as generators of new chemical knowledge.
This work focuses on the development of an ensemble learning model
for the prediction and design of potent dual heat shock protein 90
(Hsp90) inhibitors. The model displays accuracy higher than 80% in
both training and test sets. To use the ensemble model as a generator
of new chemical knowledge, three steps were followed. First, a physicochemical
and/or structural interpretation was provided for each molecular descriptor
present in the ensemble learning model. Second, the term “pseudolinear
equation” was introduced within the context of machine learning
to calculate the relative quantitative contributions of different
molecular fragments to the inhibitory activity against the two Hsp90
isoforms studied here. Finally, by assembling the fragments with positive
contributions, new molecules were designed, being predicted as potent
Hsp90 inhibitors. According to Lipinski’s rule of five, the
designed molecules were found to exhibit potentially good oral bioavailability,
a primordial property that chemicals must have to pass early stages
in drug discovery. The present approach based on the combination of
ensemble learning and fragment-based topological design holds great
promise in drug discovery, and it can be adapted and applied to many
different scientific disciplines.