RStorm: Developing and Testing Streaming Algorithms in R

Kaptein, Maurits

doi:10.32614/rj-2014-012

Cited by 3 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RStorm (Kaptein 2013) provides an environment to prototype bolts in R. Spouts are represented as data frames. Bolts developed in RStorm can currently not directly be used in Storm, but this is planned for the future (Kaptein 2014).…”

Section: Distributed Computing Frameworkmentioning

confidence: 99%

Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R

Hahsler¹,

Bolanos²,

Forrest³

2017

J. Stat. Soft.

View full text Add to dashboard Cite

In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.

show abstract

Section: Distributed Computing Frameworkmentioning

confidence: 99%

Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R

Hahsler¹,

Bolanos²,

Forrest³

2017

J. Stat. Soft.

View full text Add to dashboard Cite

show abstract

“…Practically, it has to be noted that at this moment not many off-the-shelf statistical packages are available to actually analyze data streams. The currently available software, for instance (and not exhaustive) Apache Storm (Toshniwal et al, 2014) Apache Spark (Karau, Konwinski, Wendell, & Zaharia, 2015), RStorm (Kaptein, 2014), S4 (Neumeyer, Robbins, Nair, & Kesari, 2010), RapidMiner (Hofmann & Klinkenberg, 2013), KNIME (Berthold et al, 2009), and MOA (Bifet, Holmes, Kirkby, & Pfahringer, 2010), often require extensive programming knowledge and focus mainly on the infrastructure of analyzing large datasets. There is still a large gap between the methods and software developed by computer scientists, and those that can be used by social scientists to analyze their data streams using models that they are accustomed to.…”

Section: Considerations Analyzing Big Data and Data Streamsmentioning

confidence: 99%

Dealing With Data Streams

2016

Self Cite

View full text Add to dashboard Cite

Abstract. Novel technological advances allow distributed and automatic measurement of human behavior. While these technologies provide exciting new research opportunities, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously augmented. These data streams make the standard computation of well-known estimators inefficient as the computation has to be repeated each time a new data point enters. In this tutorial paper, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or “row-by-row,” processing approach. We present several simple (and exact) examples of the online estimation and discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this article with a discussion of the methodological challenges that remain.

show abstract