Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization

Huang, Xiaodi; Zheng, Xiaokun; Yuan, Wei; Wang, Fei; Zhu, Shanfeng

doi:10.1016/j.ins.2011.01.029

Cited by 41 publications

(20 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, document clustering techniques, being an efficient way of navigating and summarizing documents, have been intensively investigated in biomedical research. As a dimension reduction method, non-negative matrix factorization [11] has been widely applied to medical document clustering [12,13]. By imposing nonnegativity constraints in both basis and weight factorization matrices, NMF guarantees to preserve the local structure of the original data.…”

Section: Related Workmentioning

confidence: 99%

“…Many extensions of the basic NMF method have also been explored for clustering biomedical documents. For instance, in [13], Multi-view NMF, which can integrate different data sources, was applied for clustering clinical document, based on medication/symptom names, whereas, in [12], ensemble NMF, able to achieve a consensus solution from a set of runs with different initial conditions, was tested on the TREC genomic 2004 track. Finally, also more complex techniques were recently introduced in order to cope with graph representations of medical documents [14].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Discovering Potential Clinical Profiles of Multiple Sclerosis from Clinical and Pathological Free Text Data with Constrained Non-negative Matrix Factorization

Acquarelli

Bianchini²,

Marchiori

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Constrained non-negative matrix factorization (CNMF) is an effective machine learning technique to cluster documents in the presence of class label constraints. In this work, we provide a novel application of this technique in research on neuro-degenerative diseases. Specifically, we consider a dataset of documents from the Netherlands Brain Bank containing free text describing clinical and pathological information about donors affected by Multiple Sclerosis. The goal is to use CNMF for identifying clinical profiles with pathological information as constraints. After pre-processing the documents by means of standard filtering techniques, a feature representation of the documents in terms of bi-grams is constructed. The high dimensional feature space is reduced by applying a trimming procedure. The resulting datasets of clinical and pathological bi-grams are then clustered using non-negative matrix factorization (NMF) and, next, clinical data are clustered using CNMF with constraints induced by the clustering of pathological data. Results indicate the presence of interesting clinical profiles, for instance related to vision or movement problems. In particular, the use of CNMF leads to the identification of a clinical profile related to diabetes mellitus. Pathological characteristics and duration of disease of the identified profiles are analysed. Although highly promising, results of this investigation should be interpreted with care due to the relatively small size of the considered datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Discovering Potential Clinical Profiles of Multiple Sclerosis from Clinical and Pathological Free Text Data with Constrained Non-negative Matrix Factorization

Acquarelli

Bianchini²,

Marchiori

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…In general, textual data clustering has been affected by the 'vector space model' [62] where each document is considered a "bag of words" and represented by a weighted vector to facilitate the similarity computation [63].…”

Section: Medline Clusteringmentioning

confidence: 99%

MEDLINE Text Mining: An Enhancement Genetic Algorithm Based Approach for Document Clustering

Karâa

Ashour

Sassi³

et al. 2015

Intelligent Systems Reference Library

View full text Add to dashboard Cite

MEDLINE is the largest biomedical literature database. It is updated daily with 200-4,000 citations. This permanent growth induces the need of a good MEDLINE abstract clustering to accelerate the procedure of research and information retrieval. Several works have been developed in this context, but clustering MEDLINE abstracts are still an area where researchers are trying to propose new approaches to better clustering. Over the last few years, evolutionary algorithms have been widely applied to clustering problems because of their ability to avoid local optimal solutions and converge to a global one. In this article, a new approach is proposed for clustering MEDLINE abstracts based on an extension of an evolutionary algorithm which is the genetic algorithm combined with a Vector Space Model and an agglomerative algorithm.

show abstract

“…MEDLINE (See Note 2) is the largest biomedical literature database in the world, which contains more than 24 million citations. MeSH terms are used to index almost all MEDLINE citations [1], which is crucial in biomedical text mining and information retrieval [2][3][4][5][6][7][8]. The NLM annotators who are responsible for annotating the MeSHs need to review the full text of a citation, which costs lots of time and money.…”

Section: Introductionmentioning

confidence: 99%

MeSHLabeler and DeepMeSH: Recent Progress in Large-Scale MeSH Indexing

Peng

Mamitsuka

Zhu

2018

Methods in Molecular Biology

Self Cite

View full text Add to dashboard Cite

The US National Library of Medicine (NLM) uses the Medical Subject Headings (MeSH) (see Note 1 ) to index almost all 24 million citations in MEDLINE, which greatly facilitates the application of biomedical information retrieval and text mining. Large-scale automatic MeSH indexing has two challenging aspects: the MeSH side and citation side. For the MeSH side, each citation is annotated by only 12 (on average) out of all 28,000 MeSH terms. For the citation side, all existing methods, including Medical Text Indexer (MTI) by NLM, deal with text by bag-of-words, which cannot capture semantic and context-dependent information well. To solve these two challenges, we developed the MeSHLabeler and DeepMeSH. By utilizing "learning to rank" (LTR) framework, MeSHLabeler integrates multiple types of information to solve the challenge in the MeSH side, while DeepMeSH integrates deep semantic representation to solve the challenge in the citation side. MeSHLabeler achieved the first place in both BioASQ2 and BioASQ3, and DeepMeSH achieved the first place in both BioASQ4 and BioASQ5 challenges. DeepMeSH is available at http://datamining-iip.fudan.edu.cn/deepmesh .

show abstract

Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization

Cited by 41 publications

References 33 publications

Discovering Potential Clinical Profiles of Multiple Sclerosis from Clinical and Pathological Free Text Data with Constrained Non-negative Matrix Factorization

Discovering Potential Clinical Profiles of Multiple Sclerosis from Clinical and Pathological Free Text Data with Constrained Non-negative Matrix Factorization

MEDLINE Text Mining: An Enhancement Genetic Algorithm Based Approach for Document Clustering

MeSHLabeler and DeepMeSH: Recent Progress in Large-Scale MeSH Indexing

Contact Info

Product

Resources

About