2019
DOI: 10.1111/rssb.12352
|View full text |Cite
|
Sign up to set email alerts
|

Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Data Sets

Abstract: Summary The paper presents an incremental updating algorithm to analyse streaming data sets using generalized linear models. The method proposed is formulated within a new framework of renewable estimation and incremental inference, in which the maximum likelihood estimator is renewed with current data and summary statistics of historical data. Our framework can be implemented within a popular distributed computing environment, known as Apache Spark, to scale up computation. Consisting of two data‐processing l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
72
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

4
5

Authors

Journals

citations
Cited by 73 publications
(74 citation statements)
references
References 55 publications
2
72
0
Order By: Relevance
“…Although there have been some recent articles expressing concern about the online updating method for analysis of datastreams (see Schifano et al (2016); Luo and Song (2020) and the references therein), the issues of developing effective methodologies and theories for statistical modeling and inference of massive datastreams still remain. As most of the existing procedures and formulae were mainly developed based on the assumption that the observations come from the same model across time and sources.…”
Section: Model Setup and Our Contributionmentioning
confidence: 99%
“…Although there have been some recent articles expressing concern about the online updating method for analysis of datastreams (see Schifano et al (2016); Luo and Song (2020) and the references therein), the issues of developing effective methodologies and theories for statistical modeling and inference of massive datastreams still remain. As most of the existing procedures and formulae were mainly developed based on the assumption that the observations come from the same model across time and sources.…”
Section: Model Setup and Our Contributionmentioning
confidence: 99%
“…Ongoing efforts have been made to develop modern distributed computing frameworks such as Hadoop, Spark and Storm (Ghemawat et al, 2003;Dean and Ghemawat, 2004;Bifet et al, 2015;Chintapalli et al, 2016). Aligned with these architectures, various statistical methods and algorithms for online estimation and inference have been proposed, including aggregated estimating equation (Lin and Xi, 2011), stochastic gradient descent (Robbins and Monro, 1951;Sakrison, 1965;Toulis and Airoldi, 2015), cumulative estimating equation (Schifano et al, 2016), as well as renewable estimator (Luo and Song, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…An approach with more and smaller blocks sacrifices statistical efficiency for lower computational cost, whereas an approach with fewer and larger blocks retains desirable statistical efficiency at the price of increased computational cost. For the latter approach with large data blocks, using a computationally inexpensive and statistically efficient estimation procedure in each large data block, such as renewable learning Luo and Song (2020), would mitigate the trade-off between computational speed and statistical efficiency.…”
Section: Introductionmentioning
confidence: 99%
“…The renewable learning proposed by Luo and Song (2020) is an online estimation and inference method that uses the Rho architecture to maintain low computing cost with no loss of statistical efficiency. This streaming approach divides data into smaller data batches and updates parameter estimates sequentially from the first to the last data batch using only summary statistics and without re-accessing individual-level raw data from previous data batches.…”
Section: Introductionmentioning
confidence: 99%