2011
DOI: 10.1371/journal.pone.0018029
|View full text |Cite
|
Sign up to set email alerts
|

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Abstract: BackgroundWe investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
224
1
2

Year Published

2012
2012
2017
2017

Publication Types

Select...
8
2

Relationship

2
8

Authors

Journals

citations
Cited by 233 publications
(233 citation statements)
references
References 40 publications
1
224
1
2
Order By: Relevance
“…The development of publication-level classification systems is currently a subject of research. Boyack et al (2011) clustered a corpus of 2.15 million biomedical publications from Medline database (2004)(2005)(2006)(2007)(2008) which generated coherent and concentrated cluster solution of text-based similarity approaches based on keywords extracted from titles and abstracts. They found their approach more precise than the Medical Subject Headings.…”
Section: Introductionmentioning
confidence: 99%
“…The development of publication-level classification systems is currently a subject of research. Boyack et al (2011) clustered a corpus of 2.15 million biomedical publications from Medline database (2004)(2005)(2006)(2007)(2008) which generated coherent and concentrated cluster solution of text-based similarity approaches based on keywords extracted from titles and abstracts. They found their approach more precise than the Medical Subject Headings.…”
Section: Introductionmentioning
confidence: 99%
“…Various clustering approaches such as suffix tree clustering were supplemented with ontological information in [26], whereas the accuracy of similarity metrics is discussed in [27]. A knowledge domain scheme based on bipartite graphs with MeSH is presented in [28].…”
Section: Previous Workmentioning
confidence: 99%
“…Ahlgren and Colliander (2009) tested the performance of the complete-linkage clustering method for visualizing and classifying a set of 43 documents of the journal 'Information Retrieval' according to several similarity measures based on document text, coupling and a hybrid approach. A combination of graphic presentations and clustering was also adopted by Boyack et al (2011), yet they applied average-link clustering on several similarity matrices based on significant words extracted from the title, abstract and keywords of the Medical Subject Headings (MeSH) of over 2 million scientific articles gathered from the Medline database. More recently, Waltman and Van Eck (2012) employed a new multilevel clustering algorithm on a direct citation network comprising nearly 10 million publications in order to create an automatic classification from clusters detected.…”
Section: Clustering and Information Visualizationmentioning
confidence: 99%