Source code and binaries freely available for download at http://lyle.smu.edu/IDA/EMMSA/, implemented in JAVA and supported on MS Windows.
Genomic sequencing projects are generating vast stores of data that provide opportunities and challenges in data analysis. Investigations of trends in codon usage have proven to be a rich area of study in this field. There are a number of methods for isolating codon usage bias in microbial organisms, each designed to capture a specific aspect of the bias. We posit that each species has evolved under the influence of a unique set of environmental constraints that has governed the shaping of the organism's codon usage. Analysis of codon usage data should, therefore, provide insights into the selection process at work influencing genomic composition. To this end, we describe the large-scale mining of genome-level data from several codon usage bias isolation techniques to determine whether this information can be used to predict the phylum and class to which each organism belongs. Successful prediction is an indication that the forces molding the codon usage of a given phylum/class are indeed distinctive, and that it would be of use in understanding the evolutionary forces involved. Additionally, it supports using this method to aid in, and validate existing taxonomic classification techniques.
Many classes of algorithms that suffer from large complexities when implemented on conventional computers may be reformulated resulting in greatly reduced complexity when implemented on quantum computers. The dramatic reductions in complexity for certain types of quantum algorithms coupled with the computationally challenging problems in some bioinformatics problems motivates researchers to devise efficient quantum algorithms for sequence (DNA, RNA, protein) analysis. This chapter shows that the important sequence classification problem in bioinformatics is suitable for formulation as a quantum algorithm. This chapter leverages earlier research for sequence classification based on Extensible Markov Model (EMM) and proposes a quantum computing alternative. The authors utilize sequence family profiles built using EMM methodology which is based on using pre-counted word data for each sequence. Then a new method termed quantum seeding is proposed for generating a key based on high frequency words. The key is applied in a quantum search based on Grover algorithm to determine a candidate set of models resulting in a significantly reduced search space. Given Z as a function of M models of size N, the quantum version of the seeding algorithm has a time complexity in the order of as opposed to O(Z) for the standard classic version for large values of Z.
Abstract-Phylogenetic tree analysis using molecular sequences continues to expand beyond the 16S rRNA marker. By addressing the multi-copy issue known as the intra-heterogeneity, this paper restores the focus in using the 16S rRNA marker. Through use of a novel learning and model building algorithm, the multiple gene copies are integrated into a compact complex signature using the Extensible Markov Model (EMM). The method clusters related sequence segments while preserving their inherent order to create an EMM signature for a microbial organism. A library of EMM signatures is generated from which samples are drawn for phylogenetic analysis. By matching the components of two signatures, referred to as quasi-alignment, the differences are highlighted and scored. Scoring quasi-alignments is done using adapted Karlin-Altschul statistics to compute a novel distance metric. The metric satisfies conditions of identity, symmetry, triangular inequality and the four point rule required for a valid evolution distance metric. The resulting distance matrix is input to PHYologeny Inference Package (PHYLIP) to generate phylogenies using neighbor joining algorithms. Through control of clustering in signature creation, the diversity of similar organisms and their placement in the phylogeny is explained. The experiments include analysis of genus Burkholderia, a random microbial sample spanning several phyla and a diverse sample that includes RNA of Eukaryotic origin. The NCBI sequence data for 16S rRNA is used for validation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.