The cryptic diversity of microbial communities represent an untapped biotechnological resource for biomining, biorefining and synthetic biology. Revealing this information requires the recovery of the exact sequence of DNA bases (or "haplotype") that constitutes the genes and genomes of every individual present. This is a computationally difficult problem complicated by the requirement for environmental sequencing approaches (metagenomics) due to the resistance of the constituent organisms to culturing in vitro.Haplotypes are identified by their unique combination of DNA variants. However, standard approaches for working with metagenomic data require simplifications that violate assumptions in the process of identifying such variation. Furthermore, current haplotyping methods lack objective mechanisms for choosing between alternative haplotype reconstructions from microbial communities. To address this, we have developed a novel probabilistic approach for reconstructing haplotypes from complex microbial communities and propose the "metahaplome" as a definition for the set of haplotypes for any particular genomic region of interest within a metagenomic dataset. Implemented in the twin software tools Hansel and Gretel, the algorithm performs incremental probabilistic haplotype recovery using Naive Bayes -an efficient and effective technique. Our approach is capable of reconstructing the haplotypes with the highest likelihoods from metagenomic datasets without a priori knowledge or making assumptions of the distribution or number of variants. Additionally, the algorithm is robust to sequencing and alignment error without altering or discarding observed variation and uses all available evidence from aligned reads. We validate our approach using synthetic metahaplomes constructed from sets of real genes, and demonstrate its capability using metagenomic data from a complex HIV-1 strain mix. The results show that the likelihood framework can allow recovery from microbial communities of cryptic functional isoforms of genes with 100% accuracy.
1. CC-BY 4.0 International license not peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/117838 doi: bioRxiv preprint first posted online Mar. 17, 2017; Genomic research is progressing beyond the use of consensus DNA sequences to represent species, towards the ultimate goal of complete characterisation of the genetic diversity that exists across their populations.So far, research has focused on characterising specific aspects of this diversity, for example: identifying the entire gene-set of all strains of a species (the pangenome) [1]; identifying the groups of genes (or genetic variants within) that are inherited together in organisms across entire populations (the haplome) [2] or in viruses, identifying strains related by mutations in a highly mutagenic environment (the quasispecies) [3].However many communities (and especially microbial communities) maintain a fine balance b...