Este artigo propõe uma implementação de regressões seqüenciais para o algoritmo das projeções sucessivas (APS), que é uma técnica de seleção de variáveis para regressão linear múltipla. Para ilustração, apresenta-se um exemplo envolvendo a determinação de proteína em trigo por espectrometria no infravermelho próximo. As previsões do modelo resultante exibiram um coeficiente de correlação de 0.989 e um RMSEP (erro médio quadrático de predição) de 0.2% m/m na faixa de10.2-16.2% m/m. A implementação proposta proporcionou ganhos computacionais de até cinco vezes.This short report proposes a sequential regression implementation for the successive projections algorithm (SPA), which is a variable selection technique for multiple linear regression. An example involving the near-infrared determination of protein in wheat is presented for illustration. The resulting model predictions exhibited a correlation coefficient of 0.989 and an RMSEP (rootmean-square error of prediction) value of 0.2% m/m in the range 10.2-16.2% m/m. The proposed implementation provided computational gains of up to five-fold.Keywords: successive projections algorithm, multivariate calibration, sequential regressions, computational efficiency, near-infrared spectrometry, wheat
IntroductionThe successive projections algorithm (SPA) is a variable selection technique designed to minimize multicollinearity problems in multiple linear regression (MLR).1 In several applications concerning UV-Vis, 1,2 ICP-OES, 3 FT-IR 4 and NIR spectrometry, 4-8 SPA was found to provide models with good predictive performance. It has also been successfully employed in other fields such as QSAR (quantitative structure activity relationships) 9 and classification.
10,11A graphic user interface for SPA is freely available at . SPA comprises three main phases. 7,12 Phase 1 consists of projection operations carried out on the matrix of instrumental responses. These projections are used to generate chains of variables with successively more elements. Each element in a chain is selected in order to display the least collinearity with the previous ones. In Phase 2, candidate subsets of variables are extracted from the chains and evaluated according to the predictive performance of the resulting MLR model. Such a performance can be assessed by using cross-validation or a separate validation set. 13 Finally, Phase 3 consists of a variable elimination procedure aimed at improving the parsimony of the model. 7,12 Due to the need of building an MLR model for each subset of variables under consideration, Phase 2 may be considerably more demanding, in computational terms, as compared to Phases 1 and 3. For example, in a problem involving 389 calibration samples, 193 validation samples and 690 variables, Phases 1, 2 and 3 account for 1.9, 98.1 and 0.02% of the total time, respectively. These results were obtained by using the setup described in the Experimental section and may be slightly different if Soares et al. 761 Vol. 21, No. 4, 2010 anothe...