2003
DOI: 10.1093/nar/gkg379
|View full text |Cite
|
Sign up to set email alerts
|

Efficient clustering of large EST data sets on parallel computers

Abstract: Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
50
0
1

Year Published

2003
2003
2015
2015

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 76 publications
(51 citation statements)
references
References 15 publications
0
50
0
1
Order By: Relevance
“…In contrast, some of the largest species-specific EST collections are from plants, including wheat (Triticum aestivum; more than 415,000), barley (Hordeum vulgare; more than 310,000), soybean (Glycine max; more than 305,000), maize (Zea mays; more than 195,000), and Medicago truncatula (more than 180,000; http://www.ncbi.nlm.nih.gov/dbEST/ dbEST_summary.html). Kalyanaraman et al (2003) present a novel algorithm and software program (PaCE) to cluster large sets of ESTs into contigs that represent distinct gene fragments and its application to 22 plant species EST sets. Our motivation for the mapping of Arabidopsis ESTs onto the Arabidopsis genome was in part derived from the need for a confirmed standard of proven EST clusters against which to gauge the success of EST clustering programs that do not incorporate genome sequence data.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…In contrast, some of the largest species-specific EST collections are from plants, including wheat (Triticum aestivum; more than 415,000), barley (Hordeum vulgare; more than 310,000), soybean (Glycine max; more than 305,000), maize (Zea mays; more than 195,000), and Medicago truncatula (more than 180,000; http://www.ncbi.nlm.nih.gov/dbEST/ dbEST_summary.html). Kalyanaraman et al (2003) present a novel algorithm and software program (PaCE) to cluster large sets of ESTs into contigs that represent distinct gene fragments and its application to 22 plant species EST sets. Our motivation for the mapping of Arabidopsis ESTs onto the Arabidopsis genome was in part derived from the need for a confirmed standard of proven EST clusters against which to gauge the success of EST clustering programs that do not incorporate genome sequence data.…”
Section: Discussionmentioning
confidence: 99%
“…Challenges of EST clustering arise from poor average sequence quality, incomplete EST sampling, polymorphisms, alternative transcript isoforms, representation of highly similar transcripts from distinct members of multigene families, and cloning artifacts. Different strategies for EST clustering and the associated gene indexing databases have been reviewed by Bouck et al (1999); for a recent method for EST clustering on parallel computers, see Kalyanaraman et al (2003).For Arabidopsis, up-to-date EST clusters are available in form of the UniGene clusters at NCBI (http:// www.ncbi.nlm.nih.gov/UniGene/) and as a The Institute for Genome Research (TIGR) Gene Index (AtGI; http://www.tigr.org/tdb/tgi/agi/; Quackenbush et al, 2001). The current UniGene build (no.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…Vmatch program [32] was used to identify contaminations and repetitive elements by comparison of the mRNA sequences to vector, bacterial and repeat databases. Cleaned EST sequences were first clustered by the PaCE program [33] and then for each clusters, clustering algorithm (CAP3) [34] is used to perform the assembly. In order to minimize such potential false negatives, the above resulted CAP3 contigs/singlets are self-clustered using the Vmatch program.…”
Section: Sequence Datamentioning
confidence: 99%
“…These ESTs were then clustered using PaCE (Kalyanaraman et al 2003) under default parameters, and contigs were generated using CAP3 from each resulting cluster as previously described. Polymorphic sites with representation in $25% of participating ESTs, which also violated random expectation for sequencing errors (P , 0.01), were selected; 28 primer pairs were designed to flank the 24 previously unreported duplications using Primer3.…”
Section: Methodsmentioning
confidence: 99%