Classifying groups of individuals based on their metabolic profile is one of the main topics in metabolomics research. Due to the low number of individuals compared to the large number of variables, this is not an easy task. PLSDA is one of the data analysis methods used for the classification. Unfortunately this method eagerly overfits the data and rigorous validation is necessary. The validation however is far from straightforward. Is this paper we will discuss a strategy based on cross model validation and permutation testing to validate the classification models. It is also shown that too optimistic results are obtained when the validation is not done properly. Furthermore, we advocate against the use of PLSDA score plots for inference of class differences.
A new method is introduced for the analysis of 'omics' data derived from crossover designed drug or nutritional intervention studies. The method aims at finding systematic variations in metabolic profiles after a drug or nutritional challenge and takes advantage of the crossover design in the data. The method, which can be considered as a multivariate extension of a paired t test, generates different multivariate submodels for the between- and the within-subject variation in the data. A major advantage of this variation splitting is that each submodel can be analyzed separately without being confounded with the other variation sources. The power of the multilevel approach is demonstrated in a human nutritional intervention study which used NMR-based metabolomics to assess the metabolic impact of grape/wine extract consumption. The variations in the urine metabolic profiles are studied between and within the human subjects using the multilevel analysis. After variation splitting, multilevel PCA is used to investigate the experimental and biological differences between the subjects, whereas a multilevel PLS-DA model is used to reveal the net treatment effect within the subjects. The observed treatment effect is validated with cross model validation and permutations. It is shown that the statistical significance of the multilevel classification model ( p << 0.0002) is a major improvement compared to a ordinary PLS-DA model ( p = 0.058) without variation splitting. Finally, rank products are used to determine which NMR signals are most important in the multilevel classification model.
SELDI-TOF-MS is rapidly gaining popularity as a screening tool for clinical applications of proteomics. Application of adequate statistical techniques in all the stages from measurement to information is obligatory. One of the statistical methods often used in proteomics is classification: the assignment of subjects to discrete categories, for example healthy or diseased. Lately, many new classification methods have been developed, often specifically for the analysis of X-omics data. For proteomics studies a good strategy for evaluating classification results is of prime importance, because usually the number of objects will be small and it would be wasteful to set aside part of these as a 'mere' test set. The present paper offers such a strategy in the form of a protocol which can be used for choosing among different statistical classification methods and obtaining figures of merit of their performance. This paper also illustrates the usefulness of proteomics in a clinical setting, serum samples from Gaucher disease patients, when used in combination with an appropriate classification method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.