2006
DOI: 10.1093/bioinformatics/btl158
|View full text |Cite
|
Sign up to set email alerts
|

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Abstract: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
6,914
0
9

Year Published

2008
2008
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 9,356 publications
(7,220 citation statements)
references
References 4 publications
5
6,914
0
9
Order By: Relevance
“…The Swiss‐Prot and Pfam databases were used to identify the ORFs, and 107,242 transcripts (56.3% of total transcripts) were used to successfully predict ORFs with an average size of 196 amino acids and a N50 size of 314 amino acids. As the complement to TransDecoder, GeneMarkS‐T program (Tang et al., 2015) was used to predict ORFs, and 63,831 ORFs were obtained, after removing redundant ORFs determined by Cd‐hit package (Li & Godzik, 2006), 2,240 ORFs were added to TransDecoder prediction set, and finally 109,482 transcripts were successfully predicted ORFs. We then used several complementary routes to annotate the transcript sequences.…”
Section: Resultsmentioning
confidence: 99%
“…The Swiss‐Prot and Pfam databases were used to identify the ORFs, and 107,242 transcripts (56.3% of total transcripts) were used to successfully predict ORFs with an average size of 196 amino acids and a N50 size of 314 amino acids. As the complement to TransDecoder, GeneMarkS‐T program (Tang et al., 2015) was used to predict ORFs, and 63,831 ORFs were obtained, after removing redundant ORFs determined by Cd‐hit package (Li & Godzik, 2006), 2,240 ORFs were added to TransDecoder prediction set, and finally 109,482 transcripts were successfully predicted ORFs. We then used several complementary routes to annotate the transcript sequences.…”
Section: Resultsmentioning
confidence: 99%
“…We set out to merge contigs derived from the same population into clusters representing population genomes. To this end, contig sequences were first clustered at 95% global average nucleotide identity (ANI) with cd-hit-est 58 (options -c 0.95 -G 1 -n 10 -mask NX, Supplementary Fig. 7B), resulting in 10,578,271 non-redundant genome fragments.…”
Section: Genome Binning and Re-assemblymentioning
confidence: 99%
“…These analyses, in addition, were performed on a non-redundant data set at 90% sequence identity cutoff. The clustering was done using a CD-hit program [13]. The highest resolution structure which contains a specific metal was chosen as the representative of each cluster for atom and amino acid profile analysis.…”
Section: Data Set Under Investigationmentioning
confidence: 99%