2021
DOI: 10.7717/peerj.11348
|View full text |Cite
|
Sign up to set email alerts
|

ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies

Abstract: TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuris… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

4
1

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 49 publications
(63 reference statements)
0
7
0
Order By: Relevance
“…We first reduced the number of genomes based on genomic signatures [ 27 ] to regroup similar genomes into genome clusters with a prerelease version of our new software ToRQuEMaDA [ 28 ]. Briefly, for five different k-mer sizes (from 2 to 6-nt), we computed the frequency of each word in each genome using the program compseq from the EMBOSS software package [ 29 ].…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We first reduced the number of genomes based on genomic signatures [ 27 ] to regroup similar genomes into genome clusters with a prerelease version of our new software ToRQuEMaDA [ 28 ]. Briefly, for five different k-mer sizes (from 2 to 6-nt), we computed the frequency of each word in each genome using the program compseq from the EMBOSS software package [ 29 ].…”
Section: Methodsmentioning
confidence: 99%
“…To choose the best combination of methods and parameters, the available taxonomic information was used to evaluate the quality of the clustering. Briefly, we computed how many different taxa of each rank (phylum, class, order, family, genus, species) were found in each individual cluster or each set of clusters and chose the combination that best separated the higher-level taxa (phylum, class, order, family) while merging the lower-level taxa (genus, species) [ 28 ]. This led us to settle on the following set of methods and parameters: 6-nt k-mer, 900 clusters, Pearson distance and ascending hierarchical clustering algorithm.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Three local mirrors of NCBI RefSeq were used during this study: 1) an archaeal database composed of the 819 whole genomes that were available on March 7, 2019, 2) a bacterial database of 598 representative genomes selected by the ToRQuEMaDA pipeline (Léonard et al 2021) and 3) a prokaryotic database of 80,490 genomes, already used in (Lupo et al 2022). To assemble the bacterial database, ToRQuEMaDA was run in June 2018, according to a ‘direct’ strategy and using the following parameters: dist-metric set to JI (Jaccard Index), dist-threshold set to 0.86, clustering-mode set to ‘loose’, and pack size set to 200.…”
Section: Methodsmentioning
confidence: 99%
“…Taxonomic affiliation is based on a MEGAN -like algorithm [ 37 ] that infers a last common ancestor (LCA) from the set of reference sequences best matching each contig or transcript of the evaluated dataset. Forty-Two can be used on prokaryotic or eukaryotic datasets, depending on the reference database considered, RiboDB [ 38 , 39 ] or a set of manually curated eukaryotic alignments [ 40 ], respectively. Recently, an inter-domain dataset has been assembled.…”
Section: Overview Of Algorithmsmentioning
confidence: 99%