Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present "Spaced Words Projection (SWeeP)", a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeep vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.
Deoxynivalenol (DON) is a toxic secondary metabolite produced by fungi that contaminates many crops, mainly wheat, maize, and barley. It affects animal health, causing intestinal barrier impairment and immunostimulatory effect in low doses and emesis, reduction in feed conversion rate, and immunosuppression in high doses. As it is very hard to completely avoid DON’s production in the field, mitigatory methods have been developed. Biodegradation has become a promising method as new microorganisms are studied and new enzymatic routes are described. Understanding the common root of bacteria with DON degradation capability and the relationship with their place of isolation may bring insights for more effective ways to find DON-degrading microorganisms. The purpose of this review is to bring an overview of the occurrence, regulation, metabolism, and toxicology of DON as addressed in recent publications focusing on animal production, as well as to explore the enzymatic routes described for DON’s degradation by microorganisms and the phylogenetic relationship among them.
Tools for genomic island prediction use strategies for genomic comparison analysis and sequence composition analysis. The goal of comparative analysis is to identify unique regions in the genomes of related organisms, whereas sequence composition analysis evaluates and relates the composition of specific regions with other regions in the genome. The goal of this study was to qualitatively and quantitatively evaluate extant genomic island predictors. We chose tools reported to produce significant results using sequence composition prediction, comparative genomics, and hybrid genomics methods. To maintain diversity, the tools were applied to eight complete genomes of organisms with distinct characteristics and belonging to different families. Escherichia coli CFT073 was used as a control and considered as the gold standard because its islands were previously curated in vitro. The results of predictions with the gold standard were manually curated, and the content and characteristics of each predicted island were analyzed. For other organisms, we created GenBank (GBK) files using Artemis software for each predicted island. We copied only the amino acid sequences from the coding sequence and constructed a multi-FASTA file for each predictor. We used BLASTp to compare all results and generate hits to evaluate similarities and differences among the predictions. Comparison of the results with the gold standard revealed that GIPSy produced the best results, covering ~91% of the composition and regions of the islands, followed by Alien Hunter (81%), IslandViewer (47.8%), Predict Bias (31%), GI Hunter (17%), and Zisland Explorer (16%). The tools with the best results in the analyzes of the set of organisms were the same ones that presented better performance in the tests with the gold standard.
Among other attributes, the Betaproteobacterial genus Azoarcus has biotechnological importance for plant growth-promotion and remediation of petroleum waste-polluted water and soils. It comprises at least two phylogenetically distinct groups. The “plant-associated” group includes strains that are isolated from the rhizosphere or root interior of the C4 plant Kallar Grass, but also strains from soil and/or water; all are considered to be obligate aerobes and all are diazotrophic. The other group (now partly incorporated into the new genus Aromatoleum) comprises a diverse range of species and strains that live in water or soil that is contaminated with petroleum and/or aromatic compounds; all are facultative or obligate anaerobes. Some are diazotrophs. A comparative genome analysis of 32 genomes from 30 Azoarcus-Aromatoleum strains was performed in order to delineate generic boundaries more precisely than the single gene, 16S rRNA, that has been commonly used in bacterial taxonomy. The origin of diazotrophy in Azoarcus-Aromatoleum was also investigated by comparing full-length sequences of nif genes, and by physiological measurements of nitrogenase activity using the acetylene reduction assay. Based on average nucleotide identity (ANI) and whole genome analyses, three major groups could be discerned: (i) Azoarcus comprising Az. communis, Az. indigens and Az. olearius, and two unnamed species complexes, (ii) Aromatoleum Group 1 comprising Ar. anaerobium, Ar. aromaticum, Ar. bremense, and Ar. buckelii, and (iii) Aromatoleum Group 2 comprising Ar. diolicum, Ar. evansii, Ar. petrolei, Ar. toluclasticum, Ar. tolulyticum, Ar. toluolicum, and Ar. toluvorans. Single strain lineages such as Azoarcus sp. KH32C, Az. pumilus, and Az. taiwanensis were also revealed. Full length sequences of nif-cluster genes revealed two groups of diazotrophs in Azoarcus-Aromatoleum with nif being derived from Dechloromonas in Azoarcus sensu stricto (and two Thauera strains) and from Azospira in Aromatoleum Group 2. Diazotrophy was confirmed in several strains, and for the first time in Az. communis LMG5514, Azoarcus sp. TTM-91 and Ar. toluolicum TT. In terms of ecology, with the exception of a few plant-associated strains in Azoarcus (s.s.), across the group, most strains/species are found in soil and water (often contaminated with petroleum or related aromatic compounds), sewage sludge, and seawater. The possession of nar, nap, nir, nor, and nos genes by most Azoarcus-Aromatoleum strains suggests that they have the potential to derive energy through anaerobic nitrate respiration, so this ability cannot be usefully used as a phenotypic marker to distinguish genera. However, the possession of bzd genes indicating the ability to degrade benzoate anaerobically plus the type of diazotrophy (aerobic vs. anaerobic) could, after confirmation of their functionality, be considered as distinguishing phenotypes in any new generic delineations. The taxonomy of the Azoarcus-Aromatoleum group should be revisited; retaining the generic name Azoarcus for its entirety, or creating additional genera are both possible outcomes.
Background Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS 3 G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS 3 G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. Conclusion In general, RAFTS 3 G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS 3 G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS 3 G process. Electronic supplementary material The online version of this article (10.1186/s12859-019-2973-4) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.