Distributed Classification of Text Documents on Apache Spark Platform

Semberecki, Piotr; Maciejewski, Henryk

doi:10.1007/978-3-319-39378-0_53

Cited by 19 publications

(12 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work is an extension of the previous research [2], where subject classification was done using standard Machine Learning such as Decision Trees, Naive Bayes classifier etc., with the focus on distributed implementation, in order to manage large volumes of data. The best results in the previous work was obtained using Bag-of-Words model with TF-IDF and Naive Bayes, where recognition of three categories: History, Arts and Law was done with ca 75.28% accuracy on the testing corpus.…”

Section: Results Of Bow Methods For This Dataset In Previous Workmentioning

confidence: 99%

“…This approach was similar to previous work [2] and is a part of traditional NLP processing chain. We used English Punkt as sentence tokenizer for segmentation task.…”

Section: B Sample Data For Empirical Verification Of the Methodsmentioning

confidence: 90%

“…The dataset used for this work has been introduced in [2]. It is based on English Wikipedia articles.…”

Section: B Sample Data For Empirical Verification Of the Methodsmentioning

confidence: 99%

“…As this corpus was also used in our previous study [2], where subject classification was based on bag-of-words approach, we have a chance to compare performance of these two methods.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Deep Learning methods for Subject Text Classification of Articles

Semberecki¹,

Maciejewski²

2017

Annals of Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

Abstract-This work presents a method of classification of text documents using deep neural network with LSTM (long shortterm memory) units. We have tested different approaches to build feature vectors, which represent documents to be classified: we used feature vectors constructed as sequences of words included in the documents, or, alternatively, we first converted words into vector representations using word2vec tool and used sequences of these vector representations as features of documents. We evaluated feasibility of this approach for the task of subject classification of documents using a collection of Wikipedia articles representing 7 subject categories. Our experiments show that the approach based on an LSTM network with documents represented as sequences of words coded into word2vec vectors outperformed a standard, bag-of-word approach with documents represented as frequency-of-words feature vectors.

show abstract

Section: Results Of Bow Methods For This Dataset In Previous Workmentioning

confidence: 99%

“…This approach was similar to previous work [2] and is a part of traditional NLP processing chain. We used English Punkt as sentence tokenizer for segmentation task.…”

Section: B Sample Data For Empirical Verification Of the Methodsmentioning

confidence: 90%

“…The dataset used for this work has been introduced in [2]. It is based on English Wikipedia articles.…”

Section: B Sample Data For Empirical Verification Of the Methodsmentioning

confidence: 99%

“…As this corpus was also used in our previous study [2], where subject classification was based on bag-of-words approach, we have a chance to compare performance of these two methods.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Deep Learning methods for Subject Text Classification of Articles

Semberecki¹,

Maciejewski²

2017

Annals of Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…It also aims to store in memory all the assigned training data partitions on particular nodes. Another implementation based on Apache Spark uses similar techniques such as Hadoop MapReduce implementations [10], while several other works leverage the computational power of GPUs (Graphics Processing Units) to improve the performance of the MapReduce implementations. Caragea et al [11] describe a multi-agent approach to building tree-based classifiers.…”

mentioning

confidence: 99%

Improvement in the Efficiency of a Distributed Multi-Label Text Classification Algorithm Using Infrastructure and Task-Related Data

Sarnovský

Olejnik

2019

Informatics

View full text Add to dashboard Cite

Distributed computing technologies allow a wide variety of tasks that use large amounts of data to be solved. Various paradigms and technologies are already widely used, but many of them are lacking when it comes to the optimization of resource usage. The aim of this paper is to present the optimization methods used to increase the efficiency of distributed implementations of a text-mining model utilizing information about the text-mining task extracted from the data and information about the current state of the distributed environment obtained from a computational node, and to improve the distribution of the task on the distributed infrastructure. Two optimization solutions are developed and implemented, both based on the prediction of the expected task duration on the existing infrastructure. The solutions are experimentally evaluated in a scenario where a distributed tree-based multi-label classifier is built based on two standard text data collections.

show abstract

Big data and machine learning framework for clouds and its usage for text classification

Pintye

Kail

Kacsuk

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Reference architectures for big data and machine learning include not only interconnected building blocks but important considerations (among others) for scalability, manageability and usability issues as well. Leveraging on such reference architectures, the automated deployment of distributed toolsets and frameworks on various clouds is still challenging due to the diversity of technologies and protocols. The paper focuses particularly on the widespread Apache Spark cluster with Jupyter as the particularly addressed framework, and the Occopus cloud-agnostic orchestrator tool for automating its deployment and maintenance stages. The presented approach has been demonstrated and validated with a new, promising text classification application on the Hungarian academic research infrastructure, the OpenStack-based MTA Cloud. The paper explains the concept, the applied components, and illustrates their usage with real use-case measurements. K E Y W O R D S big data, cloud, machine learning, parallel and distributed execution, reference architectures, text classification 1 INTRODUCTION Research in different scientific fields (e.g., natural and social sciences) often require extremely huge computational resources and storage capacity to handle Big Data problems. Traditional sequential data processing algorithms are not sufficient to analyze this large volume of data. For efficient processing and analysis new approaches, techniques and tools are necessary. Moreover, cloud infrastructures and services are becoming even more popular and are nowadays widely used to address the computation and storage requirements of many scientific and commercial Big Data applications. Their widespread usage is a consequence of the dynamic and scalable nature of the services maintained by cloud providers.However, there are several challenges that a data scientist has to face when planning the use or deployment of any Big Data platform on cloud(s). 1The selection of the appropriate cloud provider(s) is always a tiresome process since several factors has to be considered, even when only a generic Infrastructure-as-a-Service (IaaS) provider is required: private (e.g., Agrodat Cloud 2 ), federated (e.g., MTA Cloud 3 or pan-European EGI FedCloud 4 ), or public cloud (e.g., Amazon AWS 5 ).The Hungarian Academy of Sciences (MTA) provides free IaaS cloud (MTA cloud) services for research communities and easy to use, dynamic infrastructures adapted to the actual project requirements. MTA Cloud was established to accelerate research for the scientists of MTA. Nearly 100 projects have been deployed on MTA Cloud since its opening and more and more projects require to use Big Data and machine learning applications.However, the large number of artificial intelligence (AI) tools available for clouds are very complex, and their proper deployment and configuration

show abstract

Distributed Classification of Text Documents on Apache Spark Platform

Cited by 19 publications

References 10 publications

Deep Learning methods for Subject Text Classification of Articles

Deep Learning methods for Subject Text Classification of Articles

Improvement in the Efficiency of a Distributed Multi-Label Text Classification Algorithm Using Infrastructure and Task-Related Data

Big data and machine learning framework for clouds and its usage for text classification

Contact Info

Product

Resources

About