With advances in
machine learning (ML) techniques, the quantitative
structure–activity relationship (QSAR) approach is becoming
popular for evaluating chemicals. However, the QSAR approach requires
that the chemical structure of the target compound is known and that
it should be convertible to molecular descriptors. These requirements
lead to limitations in predicting the properties and toxicities of
chemicals distributed in the environment as in the PubChem database;
the structural information on only 14% of compounds is available.
This study proposes a new ML-based QSAR approach that can predict
the properties and toxicities of compounds using analytical descriptors
of mass spectrum and retention index obtained via gas chromatography–mass
spectrometry without requiring exact structural information. The model
was developed based on the XGBoost ML method. The root-mean-square
errors (
RMSE
s) for log
K
o-w
, log (molecular weight), melting point,
boiling point, log (vapor pressure), log (water solubility), log (LD
50
) (rat, oral), and log (LD
50
) (mouse, oral) are
0.97, 0.052, 51, 23, 0.74, 1.1, 0.74, and 0.6, respectively. The model
performed well on a chemical standard mixture measurement, with similar
results to those of model validation. It also performed well on a
measurement of contaminated oil with spectral deconvolution. These
results indicate that the model is suitable for investigating unknown-structured
chemicals detected in measurements. Any online user can execute the
model through a web application named Detective-QSAR (
). The analytical descriptor-based approach is expected to create
new opportunities for the evaluation of unknown chemicals around us.