Large-scale text processing pipeline with Apache Spark

Svyatkovskiy, A.; Imai, Kosuke; Kroeger, Mary; Shiraito, Yuki

doi:10.1109/bigdata.2016.7841068

Cited by 14 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To tackle this, we used Apache Spark for parallel data processing of costly operations, like the calculation of complex aggregations [13]. We also applied NLP techniques [14] for conceptual text parsing.…”

Section: B Deployment To the Cloudmentioning

confidence: 99%

Building a Cloud-based Regression Model to Predict Click-through Rate in Business Messaging Campaigns

Deligiannis¹,

Argyriou²,

Kourtesis³

2020

IJMO

View full text Add to dashboard Cite

The goal of the research presented here is to describe an innovative approach to predicting the impact of a business messaging campaign, by estimating the percentage of message recipients who will engage with a message. The motivation is to facilitate business marketers to address the problem of estimating the return on investment coming from a potential messaging campaign. The presented solution relies on the processing of large scale business data, taking into account state-of-the-art predictive algorithms, GDPR compliance requirements, and the challenge of increased data security and availability. In this paper we discuss the design of the core functional components of a system that could make this possible, which encompasses predictive analytics, data mining and machine learning technologies in a cloud computing environment.

show abstract

Section: B Deployment To the Cloudmentioning

confidence: 99%

Building a Cloud-based Regression Model to Predict Click-through Rate in Business Messaging Campaigns

Deligiannis¹,

Argyriou²,

Kourtesis³

2020

IJMO

View full text Add to dashboard Cite

show abstract

“…For instance, if one wants to couple Java and Python clients, there is a bridge named Py4J (https://www.py4j.org/) referenced from both clients as a gateway server application. Several solutions use this approach (Svyatkovsky et al, 2016) for advanced text processing. Moreover, many flexible solutions offer different ways for cross language communication.…”

Section: Proposed Solutionmentioning

confidence: 99%

Improving e-government services for advanced search

Šimić¹

2019

Vojnotehnički glasnik

View full text Add to dashboard Cite

The E-government services depend on many archived documents mostly scanned and partially described to be machine searchable in order to be found fast and to offer appropriate responses to citizens and to the government personnel as well. In order to improve the existing state, the hybrid solution based on the previous research results is presented. This paper presents an in-depth view of the Web solution that combines different technologies on both the client and the server side thus improving regular search services amd making them accessible to people with dissabilities (e.g. blindness).

show abstract

“…The most basic data structure is Resilient Distributed Dataset (RDD) and is the most fundamental operational unit in Spark [15]. This data set has a certain scalability and supports parallel processing.…”

Section: Introduction To Sparkmentioning

confidence: 99%

Research on Optimization of Random Forest Algorithm Based on Spark

Wang¹,

Zhang²,

Geng³

et al. 2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

As society has developed, increasing amounts of data have been generated by various industries. The random forest algorithm, as a classification algorithm, is widely used because of its superior performance. However, the random forest algorithm uses a simple random sampling feature selection method when generating feature subspaces which cannot distinguish redundant features, thereby affecting its classification accuracy, and resulting in a low data calculation efficiency in the stand-alone mode. In response to the aforementioned problems, related optimization research was conducted with Spark in the present paper. This improved random forest algorithm performs feature extraction according to the calculated feature importance to form a feature subspace. When generating a random forest model, it selects decision trees based on the similarity and classification accuracy of different decision. Experimental results reveal that compared with the original random forest algorithm, the improved algorithm proposed in the present paper exhibited a higher classification accuracy rate and could effectively classify data.

show abstract

Large-scale text processing pipeline with Apache Spark

Cited by 14 publications

References 16 publications

Building a Cloud-based Regression Model to Predict Click-through Rate in Business Messaging Campaigns

Building a Cloud-based Regression Model to Predict Click-through Rate in Business Messaging Campaigns

Improving e-government services for advanced search

Research on Optimization of Random Forest Algorithm Based on Spark

Contact Info

Product

Resources

About