Background Next Generation Sequencing (NGS) experiments produce millions of short sequences that, mapped to a reference genome, provide biological insights at genomic, transcriptomic and epigenomic level. Typically the amount of reads that correctly maps to the reference genome ranges between 70% and 90%, leaving in some cases a consistent fraction of unmapped sequences. This ’misalignment’ can be ascribed to low quality bases or sequence differences between the sample reads and the reference genome. Investigating the source of the unmapped reads is definitely important to better assess the quality of the whole experiment and to check for possible downstream or upstream ’contamination’ from exogenous nucleic acids. Results Here we propose DecontaMiner, a tool to unravel the presence of contaminating sequences among the unmapped reads. It uses a subtraction approach to identify bacteria, fungi and viruses genome contamination. DecontaMiner generates several output files to track all the processed reads, and to provide a complete report of their characteristics. The good quality matches on microorganism genomes are counted and compared among samples. DecontaMiner builds an offline HTML page containing summary statistics and plots. The latter are obtained using the state-of-the-art D3 javascript libraries. DecontaMiner has been mainly used to detect contamination in human RNA-Seq data. The software is freely available at http://www-labgtp.na.icar.cnr.it/decontaminer . Conclusions DecontaMiner is a tool designed and developed to investigate the presence of contaminating sequences in unmapped NGS data. It can suggest the presence of contaminating organisms in sequenced samples, that might derive either from laboratory contamination or from their biological source, and in both cases can be considered as worthy of further investigation and experimental validation. The novelty of DecontaMiner is mainly represented by its easy integration with the standard procedures of NGS data analysis, while providing a complete, reliable, and automatic pipeline. Electronic supplementary material The online version of this article (10.1186/s12859-019-2684-x) contains supplementary material, which is available to authorized users.
The sea represents a major source of biodiversity. It exhibits many different ecosystems in a huge variety of environmental conditions where marine organisms have evolved with extensive diversification of structures and functions, making the marine environment a treasure trove of molecules with potential for biotechnological applications and innovation in many different areas. Rapid progress of the omics sciences has revealed novel opportunities to advance the knowledge of biological systems, paving the way for an unprecedented revolution in the field and expanding marine research from model organisms to an increasing number of marine species. Multi-level approaches based on molecular investigations at genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, and metabolomic levels are essential to discover marine resources and further explore key molecular processes involved in their production and action. As a consequence, omics approaches, accompanied by the associated bioinformatic resources and computational tools for molecular analyses and modeling, are boosting the rapid advancement of biotechnologies. In this review, we provide an overview of the most relevant bioinformatic resources and major approaches, highlighting perspectives and bottlenecks for an appropriate exploitation of these opportunities for biotechnology applications from marine resources.
Arabidopsis thaliana became the model organism for plant studies because of its small diploid genome, rapid lifecycle and short adult size. Its genome was the first among plants to be sequenced, becoming the reference in plant genomics. However, the Arabidopsis genome is characterized by an inherently complex organization, since it has undergone ancient whole genome duplications, followed by gene reduction, diploidization events and extended rearrangements, which relocated and split up the retained portions. These events, together with probable chromosome reductions, dramatically increased the genome complexity, limiting its role as a reference. The identification of paralogs and single copy genes within a highly duplicated genome is a prerequisite to understand its organization and evolution and to improve its exploitation in comparative genomics. This is still controversial, even in the widely studied Arabidopsis genome. This is also due to the lack of a reference bioinformatics pipeline that could exhaustively identify paralogs and singleton genes. We describe here a complete computational strategy to detect both duplicated and single copy genes in a genome, discussing all the methodological issues that may strongly affect the results, their quality and their reliability. This approach was used to analyze the organization of Arabidopsis nuclear protein coding genes, and besides classifying computationally defined paralogs into networks and single copy genes into different classes, it unraveled further intriguing aspects concerning the genome annotation and the gene relationships in this reference plant species. Since our results may be useful for comparative genomics and genome functional analyses, we organized a dedicated web interface to make them accessible to the scientific community
Background Obesity is a complex disorder associated with an increased risk of developing several comorbid chronic diseases, including postmenopausal breast cancer. Although many studies have investigated this issue, the link between body weight and either risk or poor outcome of breast cancer is still to characterize. Systems biology approaches, based on the integration of multiscale models and data from a wide variety of sources, are particularly suitable for investigating the underlying molecular mechanisms of complex diseases. In this scenario, GEnome-scale metabolic Models (GEMs) are a valuable tool, since they represent the metabolic structure of cells and provide a functional scaffold for simulating and quantifying metabolic fluxes in living organisms through constraint-based mathematical methods. The integration of omics data into the structural information described by GEMs allows to build more accurate descriptions of metabolic states. Results In this work, we exploited gene expression data of postmenopausal breast cancer obese and lean patients to simulate a curated GEM of the human adipocyte, available in the Human Metabolic Atlas database. To this aim, we used a published algorithm which exploits a data-driven approach to overcome the limitation of defining a single objective function to simulate the model. The flux solutions were used to build condition-specific graphs to visualise and investigate the reaction networks and their properties. In particular, we performed a network topology differential analysis to search for pattern differences and identify the principal reactions associated with significant changes across the two conditions under study. Conclusions Metabolic network models represent an important source to study the metabolic phenotype of an organism in different conditions. Here we demonstrate the importance of exploiting Next Generation Sequencing data to perform condition-specific GEM analyses. In particular, we show that the qualitative and quantitative assessment of metabolic fluxes modulated by gene expression data provides a valuable method for investigating the mechanisms associated with the phenotype under study, and can foster our interpretation of biological phenomena. Electronic supplementary material The online version of this article (10.1186/s12859-019-2685-9) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.