A strategy for optimizing LC-MS metabolomics data processing is proposed. We applied this strategy on the XCMS open source package written in R on both human and plant biology data. The strategy is a sequential design of experiments (DoE) based on a dilution series from a pooled sample and a measure of correlation between diluted concentrations and integrated peak areas. The reliability index metric, used to define peak quality, simultaneously favors reliable peaks and disfavors unreliable peaks using a weighted ratio between peaks with high and low response linearity. DoE optimization resulted in the case studies in more than 57% improvement in the reliability index compared to the use of the default settings. The proposed strategy can be applied to any other data processing software involving parameters to be tuned, e.g., MZmine 2. It can also be fully automated and used as a module in a complete metabolomics data processing pipeline.
In metabolomics studies there is a clear increase of data. This indicates the necessity of both having a battery of suitable analysis methods and validation procedures able to handle large amounts of data. In this review, an overview of the metabolomics data processing pipeline is presented. A selection of recently developed and most cited data processing methods is discussed. In addition, commonly used chemometric and machine learning analysis methods as well as validation approaches are described.
We have developed a multistep strategy that integrates data from several large-scale experiments that suffer from systematic between-experiment variation. This strategy removes such variation that would otherwise mask differences of interest. It was applied to the evaluation of wood chemical analysis of 736 hybrid aspen trees: wild-type controls and transgenic trees potentially involved in wood formation. The trees were grown in four different greenhouse experiments imposing significant variation between experiments. Pyrolysis coupled to gas chromatography/mass spectrometry (Py-GC/MS) was used as a high throughput-screening platform for fingerprinting of wood chemotype. Our proposed strategy includes quality control, outlier detection, gene specific classification, and consensus analysis. The orthogonal projections to latent structures discriminant analysis (OPLS-DA) method was used to generate the consensus chemotype profiles for each transgenic line. These were thereafter compiled to generate a global dataset. Multivariate analysis and cluster analysis techniques revealed a drastic reduction in between-experiment variation that enabled a global analysis of all transgenic lines from the four independent experiments. Information from in-depth analysis of specific transgenic lines and independent peak identification validated our proposed strategy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.