A novel method called sampling error profile analysis (SEPA) based on Monte Carlo sampling and error profile analysis is proposed for outlier detection, cross validation, pretreatment method and wavelength selection, and model evaluation in multivariate calibration. With the Monte Carlo sampling in SEPA, a number of submodels are prepared and the subsequent error profile analysis yields a median and a standard deviation of the root-mean-square error (RMSE) for the submodels. The median coupled with the standard deviation is an estimation of the RMSE that is more predictive and robust because it uses representative submodels produced by Monte Carlo sampling, unlike the normal method, which uses only 1 model. The error profile analysis also calculates skewness and kurtosis for an auxiliary judgment of the estimated RMSE, which is useful for model optimization and model evaluation. The proposed method is evaluated with 3 near-infrared datasets for wheat, corn, and tobacco. The results show that SEPA can diagnose outliers with more parameters, select more reasonable pretreatment method and wavelength points, and evaluate the model more accurately and precisely. Compared with the results reported in published papers, a better model could be obtained with SEPA concerning RMSECV, RMSEC, and RMSEP estimated with an independent prediction set. KEYWORDS model evaluation, Monte Carlo sampling, multivariate calibration, near-infrared, sampling error profile analysis (SEPA)
| INTRODUCTIONMultivariate calibration is an important chemometric technique and effective tool for mining the intrinsic quantitative relations between spectra and the properties of samples of interest. In common research and practical applications, where the aim of calibration is to construct a robust and precise model, near-infrared (NIR) spectroscopy has gained increasing interest 1-4 in quantitative and qualitative spectroscopic analyses. In NIR, spectroscopic analysis model optimization, such as selecting the number of latent variables (LVs), selecting the spectral pretreatment method and wavelength, and model evaluation, are significant concerns.Cross validation (CV) is a commonly used method for selecting the number of LVs (nLVs) and can be used for model optimization. In a typical CV, the calibration and CV sets must cross over in successive rounds such that each sample has a chance of being validated against. 5 Leave-one-out (LOO) CV is the simplest and most commonly used method, but it often causes overfitting and underestimations of the true predictive error. 6-8 Then, K-fold CV was proposed to resolve such problems 9,10 ; in this process, samples are stratified prior to being split into K-folds. Stratification is the process of rearranging the data as to ensure that each fold is a good representative of the whole. The CV contributes a PRESS versus nLVs curve, where PRESS is the predicted residual sum of squares. With the curve, not only can the nLVs be selected but the optimized pretreatment method and wavelength selection can also be carried o...