Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Wittek, Peter; Daranyi, Sándor

doi:10.1016/j.jpdc.2012.10.001

Cited by 19 publications

(6 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It is an abstraction that allows users to easily create parallel applications while hiding the details of data distribution, load balancing, and fault tolerance. At present, it is popular in text mining of various applications, especially natural language processing (NLP) and machine learning [8], [31], [37]. Laclavik et al presented a pattern of annotation tool based on the MapReduce architecture to process large amount of text data [13].…”

Section: Related Workmentioning

confidence: 99%

Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields

Tang

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Processing large volumes of data has presented a challenging issue, particularly in data-redundant systems. As one of the most recognized models, the conditional random fields (CRF) model has been widely applied in biomedical named entity recognition (Bio-NER). Due to the internally sequential feature, performance improvement of the CRF model is nontrivial, which requires new parallelized solutions. By combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and Viterbi algorithms, we propose a parallel CRF algorithm called MRCRF (MapReduce CRF) in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model. The MRLB (MapReduce L-BFGS) algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, the MRVtb (MapReduce Viterbi) algorithm infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job. Experimental results show that the MRCRF algorithm outperforms other competing methods by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.Index Terms-Biomedical named entity recognition, conditional random fields, MapReduce, parallel algorithm.

show abstract

Section: Related Workmentioning

confidence: 99%

Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields

Tang

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…So MapReduce is able to handle large amount of data processing problem which is difficult to use general servers. Now it is popular in text mining of various applications [18], especially Natural Language Processing (NLP) and Machine Learning (ML), as the MapReduce paradigm has emerged as a highly successful programing model for large-scale data-intensive computing applications [19]. Laclavik et al presented a pattern of annotation tool based on MapReduce architecture to process large amount of text data [20].…”

Section: Related Workmentioning

confidence: 99%

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

et al. 2015

View full text Add to dashboard Cite

As the rapid growth of the biomedical literature, the model training time in biomedical named entity recognition increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, this paper proposes an optimization method for two-phase recognition using conditional random fields. In the first stage, each named entity boundary is detected to distinguish all real entities. In the second stage, we label the semantic class of the entity detected. To expedite the training speed, in these two phases, we implement the model training process on a parallel optimization program framework based on MapReduce. Through dividing the training set into several parts, the iterations in the training algorithm are designed as map tasks which can be executed simultaneously in a cluster, where each map function is designed to complete the calculation of a gradient vector component for each part in the training set. Our experiments show that the proposed method in this paper can achieve high performance with short training time, which has important implications for the current biological big data processing.

show abstract

“…This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, can all be indexed as long as their textual information can be extracted [14]. Since the search operations of Lucene are performed in the indexed file, the metadata records, which are stored in relational database, should be converted to the indexed file in advance.…”

Section: Metadata Retrievalmentioning

confidence: 99%