2019
DOI: 10.1101/647560
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Optimal Recovery of Missing Values for Non-negative Matrix Factorization

Abstract: We extend the approximation-theoretic technique of optimal recovery to the setting of imputing missing values in clustered data, specifically for non-negative matrix factorization (NMF), and develop an implementable algorithm. Under certain geometric conditions, we prove tight upper bounds on NMF relative error, which is the first bound of this type for missing values. We also give probabilistic bounds for the same geometric assumptions. Experiments on image data and biological data show that this theoreticall… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 38 publications
0
1
0
Order By: Relevance
“…Examining, more broadly, data imputation techniques, initial attempts simply replaced missing data with global statistics [19,20,21], though recent efforts are exploring probabilistic and machine learning methods to learn from the existing observed patterns in the incomplete data. Examples include k-nearest neighbor (KNN) methods [24,25], Support Vector Machine applications (SVN) [26], Matrix completion and factorization [27,28,29] and MissForest [30,31] approaches, Principal component analysis (PCA) [38,39,40], Kriging-based [32,33,34] or Gaussian Process (GP) [4,5] methods. The latter family (Kriging and GP) are particularly attractive for spatio-temporal problems, like the one considered here, though they might face few important challenges: a) to efficiently handle large datasets (many nodes and many time instances) some covariance approximation/simplification will be needed [35,36,37] that might reduce predictive accuracy; b) approach assumes correlation of surge between all nodes in close distance to oneanother, which might not be the case for all near-shore coastal regions, since complex local geomorphologies (for example existence of barriers or riverine systems) might change the storm inundation characteristics even for nodes in geographic close proximity; c) missing data for storm surge imputation is not randomly distributed in space and time, rather it appears in structured format as will be shown later, with substantial part of nodes in the same geographical domain remaining dry for same time period, providing challenges in the calibration (proper selection of length and temporal correlation scales).…”
Section: Introductionmentioning
confidence: 99%
“…Examining, more broadly, data imputation techniques, initial attempts simply replaced missing data with global statistics [19,20,21], though recent efforts are exploring probabilistic and machine learning methods to learn from the existing observed patterns in the incomplete data. Examples include k-nearest neighbor (KNN) methods [24,25], Support Vector Machine applications (SVN) [26], Matrix completion and factorization [27,28,29] and MissForest [30,31] approaches, Principal component analysis (PCA) [38,39,40], Kriging-based [32,33,34] or Gaussian Process (GP) [4,5] methods. The latter family (Kriging and GP) are particularly attractive for spatio-temporal problems, like the one considered here, though they might face few important challenges: a) to efficiently handle large datasets (many nodes and many time instances) some covariance approximation/simplification will be needed [35,36,37] that might reduce predictive accuracy; b) approach assumes correlation of surge between all nodes in close distance to oneanother, which might not be the case for all near-shore coastal regions, since complex local geomorphologies (for example existence of barriers or riverine systems) might change the storm inundation characteristics even for nodes in geographic close proximity; c) missing data for storm surge imputation is not randomly distributed in space and time, rather it appears in structured format as will be shown later, with substantial part of nodes in the same geographical domain remaining dry for same time period, providing challenges in the calibration (proper selection of length and temporal correlation scales).…”
Section: Introductionmentioning
confidence: 99%