There are two approaches to identifying genomic and pathogenesis islands (GI/PAIs) in bacterial genomes: the compositional and the functional, based on DNA or protein level composition and gene function, respectively. We applied n-gram analysis in addition to other compositional features, combined them by union and intersection and defined two measures for evaluating the results-recall and precision. Using the best criteria (by training on the Escherichia coli O157:H7 EDL933 genome), we predicted GIs for 14 Enterobacteriaceae family members and for 21 randomly selected bacterial genomes. These predictions were compared with results obtained from HGT DB (based on the compositional approach) and PAI DB (based on the combined approach). The results obtained show that intersecting n-grams with other compositional features improves relative precision by up to 10% in case of HGT DB and up to 60% in case of PAI DB. In addition, it was demonstrated that the union of all compositional features results in maximum recall (up to 37%). Thus, the application of n-gram analysis alongside existing or newly developed methods may improve the prediction of GI/PAIs.
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.