A lot of high quality
data on the biological activity of chemical
compounds are required throughout the whole drug discovery process:
from development of computational models of the structure–activity
relationship to experimental testing of lead compounds and their validation
in clinics. Currently, a large amount of such data is available from
databases, scientific publications, and patents. Biological data are
characterized by incompleteness, uncertainty, and low reproducibility.
Despite the existence of free and commercially available databases
of biological activities of compounds, they usually lack unambiguous
information about peculiarities of biological assays. On the other
hand, scientific papers are the primary source of new data disclosed
to the scientific community for the first time. In this study, we
have developed and validated a data-mining approach for extraction
of text fragments containing description of bioassays. We have used
this approach to evaluate compounds and their biological activity
reported in scientific publications. We have found that categorization
of papers into relevant and irrelevant may be performed based on the
machine-learning analysis of the abstracts. Text fragments extracted
from the full texts of publications allow their further partitioning
into several classes according to the peculiarities of bioassays.
We demonstrate the applicability of our approach to the comparison
of the endpoint values of biological activity and cytotoxicity of
reference compounds.