The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.[Supplemental material is available for this article. LAST software is freely available at http://last.cbrc.jp.]Biomedical research is being revolutionized by multi-gigabase DNA data sets. This began with the sequencing of whole large genomes, such as the human (;3 billion bases), allowing us to see our species' genetic blueprint. More recently, new sequencing technologies have enabled small-scale laboratories to produce gigabases of DNA sequence. These technologies have been used to explore DNA from environmental samples, transcribed RNA in tissues and cell lines, chromatin structure, and personal genomes, to name just a few applications (Metzker 2010).In all cases, the data largely remain an uninterpretable sea of As, Cs, Gs, and Ts, unless we make connections by comparing the sequences to each other. For example, we can predict the taxonomy and function of environmental DNA reads by comparing them to all known protein sequences (via the genetic code). We can interpret DNA reads from an extinct organism (e.g., the saber tooth tiger) by mapping them to the genome of a surviving organism (e.g., the cat). In all cases, the initial task is to find similar regions between huge sequence data sets.The classic tool for this task is BLAST (and similar methods such as PatternHunter, BLAT, BLASTZ, YASS, and many others) (Altschul et al. 1997;Kent 2002;Ma et al. 2002;Schwartz et al. 2003;Kucherov et al. 2006). These methods rely on a seed-andextend heuristic. They rapidly find similarities between the ''query'' sequence and the ''target'' sequence by using short matches called seeds. These seeds act as starting points for the subsequent timeconsuming alignment extensions. The simplest kind of seed consists of exact matches of a fixed-length (e.g., 12 bases). Short seed lengths can improve sensitivity, but at a high cost in running time, because they yield more seed matches and thus more extensions. On the other hand, long seeds are matched rarely and lead to decreased sensitivity.In this work, we propose adaptive seeds as an alternative to fixed-length seeds. As implied by the name, fixed-length seeds have a constant length l. In contrast, adaptive seeds vary in length-seeds are lengthened until the number of matches in...
Supporting information and the CentroidFold software are available online at: http://www.ncrna.org/software/centroidfold/.
The CentroidFold web server (http://www.ncrna.org/centroidfold/) is a web application for RNA secondary structure prediction powered by one of the most accurate prediction engine. The server accepts two kinds of sequence data: a single RNA sequence and a multiple alignment of RNA sequences. It responses with a prediction result shown as a popular base-pair notation and a graph representation. PDF version of the graph representation is also available. For a multiple alignment sequence, the server predicts a common secondary structure. Usage of the server is quite simple. You can paste a single RNA sequence (FASTA or plain sequence text) or a multiple alignment (CLUSTAL-W format) into the textarea then click on the ‘execute CentroidFold’ button. The server quickly responses with a prediction result. The major advantage of this server is that it employs our original CentroidFold software as its prediction engine which scores the best accuracy in our benchmark results. Our web server is freely available with no login requirement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.