Since the Simplified Molecular Input Line Entry System
(SMILES)
is oriented to the atomic-level representation of molecules and is
not friendly in terms of human readability and editable, however,
IUPAC is the closest to natural language and is very friendly in terms
of human-oriented readability and performing molecular editing, we
can manipulate IUPAC to generate corresponding new molecules and produce
programming-friendly molecular forms of SMILES. In addition, antiviral
drug design, especially analogue-based drug design, is also more appropriate
to edit and design directly from the functional group level of IUPAC
than from the atomic level of SMILES, since designing analogues involves
altering the R group only, which is closer to the knowledge-based
molecular design of a chemist. Herein, we present a novel data-driven
self-supervised pretraining generative model called “TransAntivirus”
to make select-and-replace edits and convert organic molecules into
the desired properties for design of antiviral candidate analogues.
The results indicated that TransAntivirus is significantly superior
to the control models in terms of novelty, validity, uniqueness, and
diversity. TransAntivirus showed excellent performance in the design
and optimization of nucleoside and non-nucleoside analogues by chemical
space analysis and property prediction analysis. Furthermore, to validate
the applicability of TransAntivirus in the design of antiviral drugs,
we conducted two case studies on the design of nucleoside analogues
and non-nucleoside analogues and screened four candidate lead compounds
against anticoronavirus disease (COVID-19). Finally, we recommend
this framework for accelerating antiviral drug discovery.