Identifying and quantifying the microbial composition of a complex biological or environmental sample is one of the primary challenges in microbiology. Many software tools have been developed to classify metagenomic sequencing reads originating from a mixture of bacterial or viral genomes, and to estimate the microbial abundance profile of the mixture. Unfortunately the accuracy of these tools significantly degrade in the presence of large portions of shared content among the genomes in the mixture or the genomic database in use. Here we introduce CAMMiQ, a novel combinatorial solution to the microbial identification and abundance estimation problem, which improves all available tools with respect to the number of correctly classified reads (i.e., specificity) by an order of magnitude and resolves possible mixtures of similar genomes, possibly at the strain level. The key contribution of CAMMiQ is its use of arbitrary length, doubly-unique substrings, i.e. substrings that appear in exactly two genomes in the input database, instead of fixed-length, unique substrings. In order to resolve the ambiguity in the genomic origin of doubly-unique substrings, CAMMiQ employs a combinatorial optimization formulation, which can be solved surprisingly quickly. CAMMiQ's index consists of a sparsified subset of the shortest unique and doubly-unique substrings of each genome in the database, within a user specified length range and as such it is fairly compact. In short, CAMMiQ offers more accurate genomic identification and abundance estimation than the best known k-mer based and marker gene based alternatives through the use of comparable computational resources. Availability: https://github.com/algo-cancer/CAMMiQ
Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.
Understanding the relationship between transposable elements (TEs) and their associated genes in the host genome is a key point to explore their potential role in genome evolution. Transposable elements can regulate and affect gene expression not only because of their mobility within the genome but also because of its transcriptional activity. Gene expression can be suppressed, decreased or increased and cellular signalling pathways can be activated through the act of the nearby TE expression itself or subsequent TE replication intermediates. We implemented a pipeline which is capable to reveal the relationship between TEs and adjacent gene distribution in the host genome.
Our tool is freely available here : https://github.com/marieBvr/TEs_genes_relationship_pipeline
Understanding the relationship between transposable elements (TEs) and their closest positional genes in the host genome is a key point to explore their potential role in genome evolution. Transposable elements can regulate and affect gene expression not only because of their mobility within the genome but also because of their transcriptional activity. A comprehensive knowledge of structural organization between transposable elements and neighboring genes is important to study TE functional role in gene regulation. We implemented a pipeline which is capable to reveal the positional and directional relationship between TEs and adjacent gene distribution in the host genome. Our tool is freely available here: https://github.com/marieBvr/TEs_genes_relationship_pipeline
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.