Modern benchtop DNA synthesis techniques and increased concern of emerging pathogens have elevated the importance of screening oligonucleotides for pathogens of concern. However, accurate and sensitive characterization of oligonucleotides is an open challenge for many of the current techniques and ontology-based tools. To address this gap, we have developed a novel software tool, SeqScreen, that can accurately and sensitively characterize short DNA sequences using a set of curated Functions of Sequences of Concern (FunSoCs), novel functional labels specific to microbial pathogenesis which describe the pathogenic potential of individual proteins. We show that our ensemble machine learning model after training on these curations can label sequences with FunSoCs via an imbalanced multi-class and multi-label classification task with high accuracy. In summary, SeqScreen represents a first step towards a novel paradigm of functionally informed pathogen characterization from genomic and metagenomic datasets. SeqScreen is open-source and freely available for download at: https://www.gitlab.com/treangenlab/seqscreen .
Background: Increasingly, researchers use protein-coding genes from targeted PCR amplification or direct metagenomic sequencing in community and population ecology. Analysis of protein-coding genes presents different challenges from those encountered in traditional SSU rRNA studies. Most protein-coding sequences are annotated based on homology to other computationally-annotated sequences, which can lead to inaccurate annotations. Therefore, the results of sensitive homology searches must be validated to remove false-positives and assess functionality. Multiple lines of in silico evidence can be gathered by examining conserved domains and residues identified through biochemical investigations. However, manually validating sequences in this way can be time consuming and error prone, especially in large environmental studies. Results: An automated pipeline for protein active site validation (PASV) was developed to improve validation and partitioning accuracy for protein-coding sequences, combining multiple sequence alignment with expert domain knowledge. PASV was tested using commonly misannotated proteins: ribonucleotide reductase (RNR), alternative oxidase (AOX), and plastid terminal oxidase (PTOX). PASV partitioned 9,906 putative Class I alpha and Class II RNR sequences from bycatch in a global viral metagenomic investigation with >99% true positive and true negative rates. PASV predicted the class of 2,579 RNR sequences in >98% agreement with manual annotations. PASV correctly partitioned all 336 tested AOX and PTOX sequences. Conclusions: PASV provides an automated and accurate way to address post-homology search validation and partitioning of protein-coding marker genes. Source code is released under the MIT license and is found with documentation and usage examples on GitHub at https://github.com/mooreryan/pasv.
19711 15 (Tel): (302) 831-3235 16 (Fax): (302) 831-4841 17 ABSTRACT 24The throughput of DNA sequencing continues to increase, allowing researchers 25 to analyze genomes of interest at greater depths. An unintended consequence of this 26 data deluge is the increased cost of analyzing these datasets. As a result, genome and 27 metagenome annotation pipelines are left with a few options: (i) search against smaller 28 reference databases, (ii) use faster, but less sensitive, algorithms to assess sequence 29 similarities, or (iii) invest in computing hardware specifically designed to improve BLAST 30 searches such as GPGPU systems and/or large CPU-rich clusters. 31We present a pipeline that improves the speed of amino acid sequence 32 homology searches with a minimal decrease in sensitivity and specificity by searching 33 against hierarchical clusters. Briefly, the pipeline requires two homology searches: the 34 first search is against a clustered version of the database and the second is against 35 sequences belonging to clusters with a hit from the first search. We tested this method 36 using two assembled viral metagenomes and three databases 37 Metagenomes Online, and UniRef100). Hierarchical cluster homology searching proved 38 to be 12-times faster than BLASTp and produced alignments that were nearly identical 39to BLASTp (precision=0.99; recall=0.97). This approach is ideal when searching large 40 collections of sequences against large databases. 41 42
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.