SummaryMultiple sequence alignment is a foundational technique in bioinformatics, and is often the first step in DNA and protein sequence analyses. However, it can be a slow step for genomic scale datasets, a problem that will only get worse as the sheer scale of biological sequence analyses continues to increase. Sequence alignment is also potentially inappropriate when there have been many small-and large-scale rearrangements among the sequences to be aligned, and subsequent analyses may be sensitive to uncertainties in the alignment. In this paper, we propose an alignment-free methodology for sequence comparison, based on n-gram frequency vectors, and demonstrate its ability to detect ontological relationships in biological literature and DNA sequence families (specifically kinases, Alu repeats and promoter sequences of co-expression networks). The methodology is versatile for clustering methods such as classical hierarchical clustering, as well as non-negative matrix factorization. It is also highly efficient in terms of computational time and space requirements, and we foresee it becoming an indispensable tool in genomic sequence analysis.
IntroductionAugust Schleicher, Ernst Haeckel, and other 19 th century
AbstractIn the post-genomic era, drawing inferences from multiple massive data sets is a ubiquitous challenge in the computational life sciences. Multiple sequence alignment has played a key role in genomics (and other "omics") as a means of summarizing and representing relationships between sequences. However, two problems with alignment-based strategies are apparent: the computational expense of constructing alignments and the sensitivity of subsequent analyses to alignment uncertainties.Here we present a novel alignment-free alternative. We use frequency profiles (or n-gram vectors) for sequence comparison, a method inspired by lexical statistics. Such profiles can be used to infer relationships between texts or between biological sequences, and we demonstrate that two statistical techniques -hierarchical clustering (HC) and non-negative matrix factorization (NMF) -provide invaluable insights in both contexts.We present four case studies. First, we show that bigram frequency profiles can be used to reconstruct the ontology of 102,402 PubMed titles selected for their relevance to nine drugs and nine therapeutic proteins. Second, we apply the same methodology to classify 63 protein kinase coding DNA sequences into functional categories, based on trigram frequency profiles. The two major classes (Tyr vs Ser/Thr) are correctly identified. Third, and similarly, we show that Alu subfamilies can be identified in 58,122 Alu sequences, in perfect agreement with the accepted topology of the Alu phylogeny, again based only on trigram frequency profiles. Fourth, we clustered 8,885 human promoters using trigram frequency profiles for ab initio discovery of co-expression networks associated with disease.We demonstrate that "lexical" statistics offers a viable alignment-free approach to identifying and representing...