Many modern analytical methods are used to analyse samples coming from an experimental design, for example, in medical, biological, or agronomic fields. Those methods generate most of the time highly multivariate data like spectra or images. This is the case of "omics" technologies used to detect genes (genomics), mRNA (transcriptomics), proteins (proteomics), or metabolites (metabolomics) in a specific biological sample. Those technologies produce high-dimensional multivariate databases where the number of variables (descriptors) tends to be much larger than the number of experimental units. Moreover, experiments in omics often follow designs aimed at understanding the effect of several factors on biological systems. Therefore, multivariate statistical tools are needed to highlight variables that are consistently modified by different biological states. It is in this context that 2 recent methods combine analysis of variance (ANOVA) and principal component analysis (PCA), namely, ASCA (ANOVA-simultaneous component analysis) and APCA (ANOVA-PCA). They provide powerful tools to visualize multivariate structures in the space of each effect of the statistical model linked to the experimental design.Their main limitation is that they provide biased estimators of the factor effects when the design of experiment is unbalanced. This paper introduces 2 new methods, ASCA+ and APCA+, that allow, respectively, to extend the use of ASCA and APCA to unbalanced designs using several principles from the theory of general linear models. Both methods are applied on real-life metabolomics data, clearly demonstrating the capacity of ASCA+ and APCA+ methods to highlight correct biomarkers corresponding to effects of interest in unbalanced designs.
Compared with the widely used 1 H-NMR spectroscopy, two-dimensional NMR experiments provide more sophisticated spectra which should facilitate the identification of relevant spectral zones or biomarkers in metabolomics. This paper focuses on 1 H-1 H COrrelation SpectroscopY (COSY) spectral data. In spite of longer inherent acquisition times, it is commonly accepted by users (biologists, healthcare professionals) that the introduction of an additional dimension probably represents a huge qualitative step for investigations in terms of metabolites identification. Moreover, it seems natural that more information leads to more predictive power. But, until now, very few statistical studies clearly proved this assumption. Therefore a fundamental question is ''Is this supplementary information relevant?''. In order to extend the statistical properties developed for 1D spectroscopy to the challenges raised by 2D spectra, a rigorous study of the performances of COSY spectra is needed as a prerequisite. Having introduced new pre-processing concepts, such as the Global Peak List or an ad hoc 2D ''bucketing'', this paper presents an innovative methodology based on multivariate clustering algorithms to evaluate this question. Numerical clustering quality indexes and graphical results are proposed, based both on the spectral presence or absence of peaks (binary position vectors) and on peak intensities, and through different levels of spectral resolution. The second goal of this paper is to compare clustering performances obtained on COSY and on 1 H-NMR spectra, with the aim of understanding to what extent the COSY spectra carry more Metabolomic Informative Content about the signal than 1D ones. The methodology is applied to two real experimental designs involving different groups of spectra (which define the signal): a 4-mixture cell culture media containing various supervised metabolites and a complex human serum based design. It is shown that COSY spectra appear to be statistically powerful and, in addition, provide better clustering results than corresponding 1 H-NMR when using unlabeled information. Consequently, additional information appears to be relevant for metabolomics applications.
of PLS which promotes an inner variable/feature selection, is an interesting existing solution. But a new intuitive algorithm is proposed in this paper to combine sparsity and the advantages of an orthogonalization step: the "Light-sparse-OPLS" (L-sOPLS). L-sOPLS promotes sparsity on a previously optimized deflated matrix which implies the removal of the Y-orthogonal components. Results A discussion around the compromise between sparsity and predictive modelling performances is provided and it is shown that L-sOPLS produces convincing results, illustrated principally on the basis of 1 H-NMR spectral data but also on genomic RT-qPCR data. Conclusion The L-sOPLS algorithm allows to reach better predictive performances than (O)PLS and sPLS while taking into account only a very small number of relevant descriptors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.