Abhirup Datta scite author profile

Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.

show abstract

A Case Study Competition Among Methods for Analyzing Large Spatial Data

Heaton

Datta

Finley

et al. 2018

JABES

349

277

View full text Add to dashboard Cite

The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low-rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online. Electronic Supplementary Material Supplementary materials for this article are available at 10.1007/s13253-018-00348-w.

show abstract

Mapping local and global variability in plant trait distributions

Butler

Datta

Flores‐Moreno

et al. 2017

Proc. Natl. Acad. Sci. U.S.A.

187

View full text Add to dashboard Cite

Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration-specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen (Nm) and phosphorus (Pm), we characterize how traits vary within and among over 50,000 ∼50 × 50-km cells across the entire vegetated land surface. We do this in several ways-without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means.odeling global climate and the carbon cycle with Earth system models (ESMs) requires maps of plant traits that play key roles in leaf-and ecosystem-level metabolic processes (1-4). Multiple traits are critical to both photosynthesis and respiration, foremost leaf nitrogen concentration (Nm ) and specific leaf area (SLA) (5-7). More recently, variation in leaf phosphorus concentration (Pm ) has also been linked to variation in photosynthesis and foliar respiration (7-12). Estimating detailed global geographic patterns of these traits and corresponding trait-environment relationships has been hampered by limited measurements (13), but recent improvements in data coverage (14) allow for greater detail in spatial estimates of these key traits.Previous work has extrapolated trait measurements across continental or larger regions through three methodologies: (i) grouping measurements of individuals into larger categories that share a set of properties [a working definition of plant functional types (PFTs)] (4, 15), (ii) exploiting trait-environment relationships (e.g., leaf Nm and mean annual temperature) (1,(16)(17)(18)(19)(20), or (iii) restricting the analysis to species whose presence has been widely estimated on the ground (21-24). Each of these methods has limitations-for example, trait-environment relationships do not well explain observed trait spatial patterns (1, 25), while species-based approaches limit the scope of extrapolation to only areas with well-measured species abundance. More critically, the first two global methodologies emp...

show abstract

Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes

Finley

Datta

Cook

et al. 2019

Journal of Computational and Graphical Statistics

129

170

View full text Add to dashboard Cite

We consider alternate formulations of recently proposed hierarchical Nearest Neighbor Gaussian Process (NNGP) models (Datta et al., 2016a) for improved convergence, faster computing time, and more robust and reproducible Bayesian inference. Algorithms are defined that improve CPU memory management and exploit existing highperformance numerical linear algebra libraries. Computational and inferential benefits are assessed for alternate NNGP specifications using simulated datasets and remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska. The resulting data product is the first statistically robust map of forest canopy for the TIU.

show abstract

CoCoLasso for high-dimensional error-in-variables regression

2017

View full text Add to dashboard Cite

Much theoretical and applied work has been devoted to highdimensional regression with clean data. However, we often face corrupted data in many applications where missing data and measurement errors cannot be ignored. Loh and Wainwright (2012) proposed a non-convex modification of the Lasso for doing high-dimensional regression with noisy and missing data. It is generally agreed that the virtues of convexity contribute fundamentally the success and popularity of the Lasso. In light of this, we propose a new method named CoCoLasso that is convex and can handle a general class of corrupted datasets including the cases of additive measurement error and random missing data. We establish the estimation error bounds of CoCoLasso and its asymptotic sign-consistent selection property. We further elucidate how the standard cross validation techniques can be misleading in presence of measurement error and develop a novel corrected cross-validation technique by using the basic idea in CoCoLasso. The corrected cross-validation has its own importance. We demonstrate the superior performance of our method over the non-convex approach by simulation studies. 1. Introduction. High-dimensional regression has wide applications in various fields such as genomics, finance, medical imaging, climate science, sensor network, etc. The current inventory of high-dimensional regression methods includes Lasso [23], SCAD [11], elastic net [30], adaptive lasso [29] and Dantzig selector [7] among others. The articles [12] and [13] provide an overview of these existing methods while the book by [5] discusses their statistical properties in finer details. The canonical high-dimensional linear regression model assumes that the number of available predictors (p) is larger than the sample size (n), although the true number of relevant predictors (s) is much less than n. The model is expressed as y = Xβ * + w where y = (y 1 , . . . , y n ) ′ is the vector of responses, X = ((x ij )) is the n × p matrix of covariates, β * is a p × 1 sparse coefficient vector with only s non-zero entries and w = (w 1 , . . . , w n ) ′ is the noise vector.Much of the existing theoretical and applied work on high-dimensional MSC 2010 subject classifications: Primary 62J07; secondary 62F12

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Abhirup Datta

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets

A Case Study Competition Among Methods for Analyzing Large Spatial Data

Mapping local and global variability in plant trait distributions

Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes

CoCoLasso for high-dimensional error-in-variables regression

Contact Info

Product

Resources

About