2023
DOI: 10.1101/2023.07.09.547343
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Indexing and searching petabyte-scale nucleotide resources

Abstract: Searching vast and rapidly growing sets of nucleotide content in data resources, such as runs in Sequence Read Archive and assemblies for whole genome shotgun sequencing projects in GenBank, is currently impractical in any reasonable amount of time or resources available to most researchers. We present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects that have short sequence matche… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 35 publications
0
5
0
Order By: Relevance
“…Polinton candidate sequences were further validated by the presence of at least 5 Polinton proteins through ORF prediction on candidate sequences and protein-protein BLAST (BLASTp, bitscore cutoff: 100) of the ORFs against query Polinton proteins. To ensure that the Polintons are from nematode (and not from other contaminating species), we scanned all NCBI-available genomes (WGS; all assemblies for the Whole Genome Shotgun sequencing projects available as of Feb 14, 2022) for closely matching sequences using the PebbleScout resource ( https://pebblescout.ncbi.nlm.nih.gov/ ) ( Shiryev and Agarwala 2023 ). The search across preindexed WGS nucleotide resources (17.33 terabases) detected only sequences from Nematoda genomes, arguing against any significant signal from contamination by unrelated species (this search does not rule out the possibility that other related nematodes have contaminated some assemblies; while unlikely, this situation would not substantially affect any of the conclusions of this work).…”
Section: Methodsmentioning
confidence: 99%
“…Polinton candidate sequences were further validated by the presence of at least 5 Polinton proteins through ORF prediction on candidate sequences and protein-protein BLAST (BLASTp, bitscore cutoff: 100) of the ORFs against query Polinton proteins. To ensure that the Polintons are from nematode (and not from other contaminating species), we scanned all NCBI-available genomes (WGS; all assemblies for the Whole Genome Shotgun sequencing projects available as of Feb 14, 2022) for closely matching sequences using the PebbleScout resource ( https://pebblescout.ncbi.nlm.nih.gov/ ) ( Shiryev and Agarwala 2023 ). The search across preindexed WGS nucleotide resources (17.33 terabases) detected only sequences from Nematoda genomes, arguing against any significant signal from contamination by unrelated species (this search does not rule out the possibility that other related nematodes have contaminated some assemblies; while unlikely, this situation would not substantially affect any of the conclusions of this work).…”
Section: Methodsmentioning
confidence: 99%
“…Using Obelisk-ɑ as a starting point, 21 additional full-length examples of Obelisk-ɑ (<4 % sequence variation, Supplementary Table 1) were found in 7 datasets using a k-mer search ( 25 ) of ∼3.2 million “metagenomic” annotated sequence read archive (SRA) datasets. All 7 datasets were human-derived metatranscriptome (metagenomic RNA) BioProjects (Table 3, see Obelisk homologue detection in additional public data); 0 sequences were found in metagenomic DNA samples.…”
Section: Resultsmentioning
confidence: 99%
“…Close Obelisk-ɑ homologues were identified in the Short Read Archive (SRA) 52 using (“Metagenomic” database, default settings) 25 , a recently released tool that efficiently queries ∼3.2 million ( mid 2022 ) raw sequencing data for exact 42 k-mer matches. 9 metatranscriptome BioProjects (comprising 34 short read datasets) were identified ( > 65) with close (∼1 % nucleotide divergence) matches to Obelisk-ɑ, of which 3 were part of iHMP or its predecessor 71 , 5 were from other human stool studies 7276 , and 1 was from a fox gut autopsy 77 .…”
Section: Methodsmentioning
confidence: 99%
“…The PRA-24 strain was isolated from a salt march in Virginia, but details on the original habitat and locality were not reported by Grant et al (2009) and are not available from the metadata provided by ATCC. To illuminate the ecological range of the respective organism, we extracted from its genome assembly the sequence of the highly variable rRNA ITS2 region and explored the occurrence of this “barcode” in raw metagenomic data using the PebbleScout search tool (Shiryev and Agarwala 2023). 461 metagenomic samples were retrieved matching the query with the maximal PBSscore value (i.e.…”
Section: Resultsmentioning
confidence: 99%