Microbial communities are ubiquitous and often influence macroscopic properties of the ecosystems they inhabit. However, deciphering the functional relationship between specific microbes and ecosystem properties is an ongoing challenge owing to the complexity of the communities. This challenge can be addressed, in part, by integrating the advances in DNA sequencing technology with computational approaches like machine learning. Although machine learning techniques have been applied to microbiome data, use of these techniques remains rare, and user-friendly platforms to implement such techniques are not widely available. We developed a tool that implements neural network and random forest models to perform regression and feature selection tasks on microbiome data. In this study, we applied the tool to analyze soil microbiome (16S rRNA gene profiles) and dissolved organic carbon (DOC) data from a 44-day plant litter decomposition experiment. The microbiome data includes 1709 total bacterial operational taxonomic units (OTU) from 300+ microcosms. Regression analysis of predicted and actual DOC for a held-out test set of 51 samples yield Pearson’s correlation coefficients of.636 and.676 for neural network and random forest approaches, respectively. Important taxa identified by the machine learning techniques are compared to results from a standard tool (indicator species analysis) widely used by microbial ecologists. Of 1709 bacterial taxa, indicator species analysis identified 285 taxa as significant determinants of DOC concentration. Of the top 285 ranked features determined by machine learning methods, a subset of 86 taxa are common to all feature selection techniques. Using this subset of features, prediction results for random permutations of the data set are at least equally accurate compared to predictions determined using the entire feature set. Our results suggest that integration of multiple methods can aid identification of a robust subset of taxa within complex communities that may drive specific functional outcomes of interest.
Predicting the dynamics and functions of microbiomes constructed from the bottom-up is a key challenge in exploiting them to our benefit. Current models based on ecological theory fail to capture complex community behaviors due to higher order interactions, do not scale well with increasing complexity and in considering multiple functions. We develop and apply a long short-term memory (LSTM) framework to advance our understanding of community assembly and health-relevant metabolite production using a synthetic human gut community. A mainstay of recurrent neural networks, the LSTM learns a high dimensional data-driven non-linear dynamical system model. We show that the LSTM model can outperform the widely used generalized Lotka-Volterra model based on ecological theory. We build methods to decipher microbe-microbe and microbe-metabolite interactions from an otherwise black-box model. These methods highlight that Actinobacteria, Firmicutes and Proteobacteria are significant drivers of metabolite production whereas Bacteroides shape community dynamics. We use the LSTM model to navigate a large multidimensional functional landscape to design communities with unique health-relevant metabolite profiles and temporal behaviors. In sum, the accuracy of the LSTM model can be exploited for experimental planning and to guide the design of synthetic microbiomes with target dynamic functions.
Discovering widespread microbial processes that drive unexpected variation in carbon cycling may improve modeling and management of soil carbon (Prescott, 2010; Wieder et al., 2015a, 2018). A first step is to identify community features linked to carbon cycle variation. We addressed this challenge using an epidemiological approach with 206 soil communities decomposing Ponderosa pine litter in 618 microcosms. Carbon flow from litter decomposition was measured over a 6-week incubation. Cumulative CO 2 from microbial respiration varied twofold among microcosms and dissolved organic carbon (DOC) from litter decomposition varied five-fold, demonstrating large functional variation despite constant environmental conditions where strong selection is expected. To investigate microbial features driving DOC concentration, two microbial community cohorts were delineated as "high" and "low" DOC. For each cohort, communities from the original soils and from the final microcosm communities after the 6-week incubation with litter were taxonomically profiled. A logistic model including total biomass, fungal richness, and bacterial richness measured in the original soils or in the final microcosm communities predicted the DOC cohort with 72 (P < 0.05) and 80 (P < 0.001) percent accuracy, respectively. The strongest predictors of the DOC cohort were biomass and either fungal richness (in the original soils) or bacterial richness (in the final microcosm communities). Successful forecasting of functional patterns after lengthy community succession in a new environment reveals strong historical contingencies. Forecasting future community function is a key advance beyond correlation of functional variance with end-state community features. The importance of taxon richness-the same feature linked to carbon fate in gut microbiome studies-underscores the need for increased understanding of biotic mechanisms that can shape richness in microbial communities independent of physicochemical conditions.
Microbial communities are ubiquitous and often influence macroscopic properties of the ecosystems they inhabit. However, deciphering the functional relationship between specific microbes and ecosystem properties is an ongoing challenge owing to the complexity of the communities. This challenge can be addressed, in part, by integrating the advances in DNA sequencing technology with computational approaches like machine learning. Although machine learning techniques have been applied to microbiome data, use of these techniques remains rare, and user-friendly platforms to implement such techniques are not widely available. We developed a tool that implements neural network and random forest models to perform regression and feature selection tasks on microbiome data. In this study, we applied the tool to analyze soil microbiome (16S rRNA gene profiles) and dissolved organic carbon (DOC) data from a 44-day plant litter decomposition experiment. The microbiome data includes 1709 total June 4, 2019 1/23 bacterial operational taxonomic units (OTU) from 300+ microcosms. Regression analysis of predicted and actual DOC for a held-out test set of 51 samples yield Pearson's correlation coefficients of .636 and .676 for neural network and random forest approaches, respectively. Important taxa identified by the machine learning techniques are compared to results from a standard tool (indicator species analysis) widely used by microbial ecologists. Of 1709 bacterial taxa, indicator species analysis identified 285 taxa as significant determinants of DOC concentration. Of the top 285 ranked features determined by machine learning methods, a subset of 86 taxa are common to all feature selection techniques. Using this subset of features, prediction results for random permutations of the data set are at least equally accurate compared to predictions determined using the entire feature set. Our results suggest that integration of multiple methods can aid identification of a robust subset of taxa within complex communities that may drive specific functional outcomes of interest. Introduction 1 Microbial communities mediate essential functions in diverse ecosystems. While the 2 microbiome controls many interesting macroscopic properties, elucidating the 3 relationship between specific microbes and ecosystem functions remains a complex 4 problem in ecology. Recent advances in DNA sequencing technology make it easy to 5 acquire metagenomic data representing the taxonomic profile of bacteria and fungi in 6 microbial communities. This opens the door to deciphering which components of the 7 microbiome can drive changes in macroscopic properties. However, analysis of 8 metagenomic microbial data poses several difficulties. The data are typically high 9 dimensional (many taxa) with a small number of samples collected in each study. 10 Additionally, sequencing results are noisy and yield sparse data sets [1]. 11 Machine learning techniques provide a means to analyze high-dimensional data [2, 3] 12 and could be used to elucidate relationships be...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.