2016 IEEE International Conference on Big Data (Big Data) 2016
DOI: 10.1109/bigdata.2016.7841068
|View full text |Cite
|
Sign up to set email alerts
|

Large-scale text processing pipeline with Apache Spark

Abstract: In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…To tackle this, we used Apache Spark for parallel data processing of costly operations, like the calculation of complex aggregations [13]. We also applied NLP techniques [14] for conceptual text parsing.…”
Section: B Deployment To the Cloudmentioning
confidence: 99%
“…To tackle this, we used Apache Spark for parallel data processing of costly operations, like the calculation of complex aggregations [13]. We also applied NLP techniques [14] for conceptual text parsing.…”
Section: B Deployment To the Cloudmentioning
confidence: 99%
“…For instance, if one wants to couple Java and Python clients, there is a bridge named Py4J (https://www.py4j.org/) referenced from both clients as a gateway server application. Several solutions use this approach (Svyatkovsky et al, 2016) for advanced text processing. Moreover, many flexible solutions offer different ways for cross language communication.…”
Section: Proposed Solutionmentioning
confidence: 99%
“…The most basic data structure is Resilient Distributed Dataset (RDD) and is the most fundamental operational unit in Spark [15]. This data set has a certain scalability and supports parallel processing.…”
Section: Introduction To Sparkmentioning
confidence: 99%