The pls package implements principal component regression (PCR) and partial least squares regression (PLSR) in R (R Development Core Team 2006b), and is freely available from the Comprehensive R Archive Network (CRAN), licensed under the GNU General Public License (GPL).The user interface is modelled after the traditional formula interface, as exemplified by lm. This was done so that people used to R would not have to learn yet another interface, and also because we believe the formula interface is a good way of working interactively with models. It thus has methods for generic functions like predict, update and coef. It also has more specialised functions like scores, loadings and RMSEP, and a flexible crossvalidation system. Visual inspection and assessment is important in chemometrics, and the pls package has a number of plot functions for plotting scores, loadings, predictions, coefficients and RMSEP estimates.The package implements PCR and several algorithms for PLSR. The design is modular, so that it should be easy to use the underlying algorithms in other functions. It is our hope that the package will serve well both for interactive data analysis and as a building block for other functions or packages using PLSR or PCR.We will here describe the package and how it is used for data analysis, as well as how it can be used as a part of other packages. Also included is a section about formulas and data frames, for people not used to the R modelling idioms.
Baselines are often chosen by visual inspection of their effect on selected spectra. A more objective procedure for choosing baseline correction algorithms and their parameter values for use in statistical analysis is presented. When the goal of the baseline correction is spectra with a pleasing appearance, visual inspection can be a satisfactory approach. If the spectra are to be used in a statistical analysis, objectivity and reproducibility are essential for good prediction. Variations in baselines from dataset to dataset means we have no guarantee that the best-performing algorithm from one analysis will be the best when applied to a new dataset. This paper focuses on choosing baseline correction algorithms and optimizing their parameter values based on the performance of the quality measure from the given analysis. Results presented in this paper illustrate the potential benefits of the optimization and points out some of the possible pitfalls of baseline correction.
Background: Large multigene sequence alignments have over recent years been increasingly employed for phylogenomic reconstruction of the eukaryote tree of life. Such supermatrices of sequence data are preferred over single gene alignments as they contain vastly more information about ancient sequence characteristics, and are thus more suitable for resolving deeply diverging relationships. However, as alignments are expanded, increasingly numbers of sites with misleading phylogenetic information are also added. Therefore, a major goal in phylogenomic analyses is to maximize the ratio of information to noise; this can be achieved by the reduction of fast evolving sites.
This paper presents results from simulations based on real data, comparing several competing mean squared error of prediction (MSEP) estimators on principal component regression (PCR) and partial least squares regression (PLSR): leave-one-out cross-validation, K-fold and adjusted K-fold crossvalidation, the ordinary bootstrap estimate, the bootstrap smoothed cross-validation (BCV) estimate and the 0.632 bootstrap estimate. The overall performance of the estimators is compared in terms of their bias, variance and squared error. The results indicate that the 0.632 estimate and leave-one-out cross-validation are preferable when one can afford the computation. Otherwise adjusted 5-or 10-fold cross-validation are good candidates because of their computational efficiency.
SUMMARYThis paper presents a discussion of the collinearity problem in regression and discriminant analysis. The paper describes reasons why the collinearity is a problem for the prediction ability and classification ability of the classical methods. The discussion is based on established formulae for prediction errors. Special emphasis is put on differences and similarities between regression and classification. Some typical ways of handling the collinearity problems based on PCA will be described. The theoretical discussion will be accompanied by empirical illustrations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.