2018
DOI: 10.1093/bib/bby090
|View full text |Cite
|
Sign up to set email alerts
|

Sequence clustering in bioinformatics: an empirical study

Abstract: Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
96
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
8
1

Relationship

2
7

Authors

Journals

citations
Cited by 128 publications
(96 citation statements)
references
References 58 publications
0
96
0
Order By: Relevance
“…Additionally, we used 10-fold cross validation method and jackknife test to evaluate the predictive performance ( Wei et al, 2017a ; Zeng et al, 2017a , b ; Liao et al, 2018 ; Zou et al, 2018 ). The two evaluation methods were chosen since existing methods in the literature used them for performance evaluation.…”
Section: Methodsmentioning
confidence: 99%
“…Additionally, we used 10-fold cross validation method and jackknife test to evaluate the predictive performance ( Wei et al, 2017a ; Zeng et al, 2017a , b ; Liao et al, 2018 ; Zou et al, 2018 ). The two evaluation methods were chosen since existing methods in the literature used them for performance evaluation.…”
Section: Methodsmentioning
confidence: 99%
“…This hyperplane can maximize the margin between the two classes, and support vectors define the hyperplane. SVM has been applied to many tasks in bioinformatics (Wei et al, 2014(Wei et al, , 2016(Wei et al, , 2018Ding et al, 2017;He et al, 2018;Zou et al, 2018;Fang et al, 2019;Zeng et al, 2019b,c;Zhang M. et al, 2019;Zhang X. et al, 2019;Zhu et al, 2019).…”
Section: Support Vector Machinementioning
confidence: 99%
“…Second, we removed the proteins that contained unknown residues or less than 50 residues because unknown residues may confuse the prediction model, and sequences of less than 50 residues tend to be peptides rather than complete protein sequences. Next, to reduce the negative influence of data redundancy and homology bias [17], we removed the homologous proteins with >30% similarity. CD-HIT [18] was used to cluster the remaining proteins with a sequence identity cut-off of 0.3, and the representative protein of each cluster was selected.…”
Section: Benchmark Datasetsmentioning
confidence: 99%