Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Boyack, Kevin W.; Newman, David J.; Duhon, Russell J.; Klavans, Richard; Patek, Michael; Biberstine, Joseph R.; Schijvenaars, Bob J. A.; Skupin, André; Ma, Nianli; Börner, Katy

doi:10.1371/journal.pone.0018029

Cited by 233 publications

(233 citation statements)

References 40 publications

Supporting

Mentioning

224

Contrasting

Unclassified

Order By: Relevance

“…The development of publication-level classification systems is currently a subject of research. Boyack et al (2011) clustered a corpus of 2.15 million biomedical publications from Medline database (2004)(2005)(2006)(2007)(2008) which generated coherent and concentrated cluster solution of text-based similarity approaches based on keywords extracted from titles and abstracts. They found their approach more precise than the Medical Subject Headings.…”

Section: Introductionmentioning

confidence: 99%

A delineating procedure to retrieve relevant publication data in research areas: the case of nanocellulose

2016

View full text Add to dashboard Cite

Advances concerning publication-level classification system have been demonstrated striking results by dealing properly with emergent, complex and interdisciplinary research areas, such as nanotechnology and nanocellulose. However, less attention has been paid to propose a delineating method to retrieve relevant research areas on specific subjects. This study aims at proposing a procedure to delineate research areas addressed in case nanocellulose. We investigate how a bibliometric analysis could provide interesting insights into research about this sustainable nanomaterial. The research topics clustered by a Publication-level Classification System were used. The procedure involves an iterative process, which includes developing and cleaning a set of core publication regarding the subject and an analysis of clusters they are associated with. Nanocellulose was selected as the subject of study, but the methodology may be applied to any other research area or topic. A discussion about each step of the procedure is provided. The proposed delineation procedure enables us to retrieve relevant publications from research areas involving nanocellulose. Seventeen research topics were mapped and associated with current research challenges on nanocellulose.

show abstract

Section: Introductionmentioning

confidence: 99%

A delineating procedure to retrieve relevant publication data in research areas: the case of nanocellulose

2016

View full text Add to dashboard Cite

show abstract

“…Various clustering approaches such as suffix tree clustering were supplemented with ontological information in [26], whereas the accuracy of similarity metrics is discussed in [27]. A knowledge domain scheme based on bipartite graphs with MeSH is presented in [28].…”

Section: Previous Workmentioning

confidence: 99%

Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents

et al. 2017

View full text Add to dashboard Cite

Abstract:Biomedicine is a pillar of the collective, scientific effort of human self-discovery, as well as a major source of humanistic data codified primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwhelming complexity mandating efficient information retrieval for the purpose of the integration of clustering schemes. The latter should work natively with inherently multidimensional data and higher order interdependencies. Additionally, past experience indicates that clustering should be semantically enhanced. Tensor algebra is the key to extending the current term-document model to more dimensions. In this article, an alternative keyword-term-document strategy, based on scientometric observations that keywords typically possess more expressive power than ordinary text terms, whose algorithmic cornerstones are third order tensors and MeSH ontological functions, is proposed. This strategy has been compared against a baseline using two different biomedical datasets, the TREC (Text REtrieval Conference) genomics benchmark and a large custom set of cognitive science articles from PubMed.

show abstract

“…Ahlgren and Colliander (2009) tested the performance of the complete-linkage clustering method for visualizing and classifying a set of 43 documents of the journal 'Information Retrieval' according to several similarity measures based on document text, coupling and a hybrid approach. A combination of graphic presentations and clustering was also adopted by Boyack et al (2011), yet they applied average-link clustering on several similarity matrices based on significant words extracted from the title, abstract and keywords of the Medical Subject Headings (MeSH) of over 2 million scientific articles gathered from the Medline database. More recently, Waltman and Van Eck (2012) employed a new multilevel clustering algorithm on a direct citation network comprising nearly 10 million publications in order to create an automatic classification from clusters detected.…”

Section: Clustering and Information Visualizationmentioning

confidence: 99%

Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering

Gómez-Núñez¹,

Vargas-Quesada²,

Chinchilla‐Rodríguez

et al. 2016

AJIM

View full text Add to dashboard Cite

PurposeThe objective was to visualize the structure of SCImago Journal & Country Rank (SJR) coverage of the extensive citation network of Scopus journals, examining this bibliometric portal through an alternative approach, applying clustering and visualization techniques to a combination of citation-based links. MethodologyThree SJR journal-journal networks containing direct citation, co-citation and bibliographic coupling links are built. The three networks were then combined into a new one by summing up their values, which were later normalized through geo-normalization measure. Finally, the VOS clustering algorithm was executed and the journal clusters obtained were labeled using original SJR category tags and significant words from journal titles. FindingsThe resultant scientogram displays the SJR structure through a set of communities equivalent to SJR categories that represent the subject contents of the journals they cover. A higher level of aggregation by areas provides a broad view of the SJR structure, facilitating its analysis and visualization at the same time. ValueThis is the first study using Persson's combination of most popular citation-based links (direct citation, co-citation and bibliographic coupling) in order to develop a scientogram based on Scopus journals from SJR. The integration of the three measures along with performance of the VOS community detection algorithm gave a balanced set of clusters. The resulting scientogram is useful for assessing and validating previous classifications as well as for information retrieval and domain analysis.

show abstract

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Cited by 233 publications

References 40 publications

A delineating procedure to retrieve relevant publication data in research areas: the case of nanocellulose

A delineating procedure to retrieve relevant publication data in research areas: the case of nanocellulose

Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents

Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering

Contact Info

Product

Resources

About