Predictive modeling (calibration or training) with various data formats, such as near-infrared (NIR) spectra and quantitative structure−activity relationship (QSAR) data, provides essential information if a proper model is selected. Similarly, with a general model selection approach, spectral model maintenance (updating) from original modeling conditions to new conditions can be performed for dynamic modeling. Fundamental modeling (partial least-squares (PLS) and others) and maintenance processes (domain adaptation or transfer learning and others) require selection of tuning parameter(s) values to isolate models that can accurately predict new samples or molecules, e.g., number of PLS latent variables to predict analyte concentration. Regardless of the modeling task, model selection is complex and without a reliable protocol. Tuning parameter selection typically depends on only one model quality measure assessing model bias using prediction accuracy. Developed in this paper is a generic model selection process using concepts from consensus modeling and QSAR activity landscapes. It is a consensus filtering approach that prioritizes model diversity (MD) while conserving prediction similarity (PS) fused with a common bias-variance trade-off measure. A significant feature of MDPS is that a cross-validation scheme is not needed because models are selected relative to predicting new samples or molecules, i.e., model selection uses unlabeled samples (without reference values) for active predictions. The versatility and reliability of MDPS model selection is shown using four NIR data sets and a QSAR data set. The study also substantiates the Rashomon effect where there is not one best model tuning parameter value that provides accurate predictions.
Updating
a calibration model formed in original (primary)
sample and spectral measurement conditions to predict analyte values
in novel (secondary) conditions is an essential activity
in analytical chemistry in order to avoid a complete recalibration.
Established model updating methods require sample analyte reference
values for a small set of secondary domain samples (labeled data)
to be used in updating processes. Because obtaining reference values
is time consuming and is the costly part of any calibration, methods
are needed that do not require labeled secondary samples, thereby
allowing on demand model updating. This paper compares model updating
methods with and without labeled secondary samples. A hybrid model
updating approach is also developed and evaluated. Unfortunately,
a major impediment to adapting a model without secondary analyte reference
values has been model selection. Because multiple tuning parameters
are commonly involved in model updating methods, thousands of models
are formed, making model selection complex. A recently developed framework
is evaluated for automatic model selection of several two to three
tuning parameter-based model updating methods without secondary analyte
reference values (labels). The model selection method is based on
model diversity and prediction similarity (MDPS) of the unlabeled
samples to be predicted. The new secondary samples to be predicted
can be used to form the updated models and again to select the final
predicting models. Because models are formed and selected on demand
to directly predict target samples, complicated cross-validation processes
are not needed. Four near-infrared data sets covering 40 model updating
situations are evaluated showing that MDPS can select reliable updated
models outperforming or rivaling prediction errors from total recalibrations
with secondary reference values.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.