As our ability to sense increases, we are experiencing a transition from data-poor problems, in which the central issue is a lack of relevant data, to data-rich problems, in which the central issue is to identify a few relevant features in a sea of observations. Motivated by applications in gravitational-wave astrophysics, we study a problem in which the goal is to predict the presence of transient noise artifacts in a gravitational wave detector from a rich collection of measurements from the detector and its environment.We argue that feature learning-in which relevant features are optimized from data-is critical to achieving high accuracy. We introduce models that reduce the error rate by over 60% compared to the previous state of the art, which used fixed, hand-crafted features. Feature learning is useful not only because it can improve performance on prediction tasks; the results provide valuable information about patterns associated with phenomena of interest that would otherwise be impossible to discover. In our motivating application, features found to be associated with transient noise provide diagnostic information about its origin and suggest mitigation strategies.Learning in such a high-dimensional setting is challenging. Through experiments with a variety of architectures, we identify two key factors in high-performing models: sparsity, for selecting relevant variables within the high-dimensional observations; and depth, which confers flexibility for handling complex interactions and robustness with respect to temporal variations. We illustrate their significance through a systematic series of experiments on real gravitational-wave detector data. Our results provide experimental corroboration of common assumptions in the machine-learning community and have direct applicability to improving our ability to sense gravitational waves, as well as to a wide variety of problem settings with similarly high-dimensional, noisy, or partly irrelevant data.
Background: Environmental health researchers often aim to identify sources or behaviors that give rise to potentially harmful environmental exposures. Objective: We adapted principal component pursuit (PCP)—a robust and well-established technique for dimensionality reduction in computer vision and signal processing—to identify patterns in environmental mixtures. PCP decomposes the exposure mixture into a low-rank matrix containing consistent patterns of exposure across pollutants and a sparse matrix isolating unique or extreme exposure events. Methods: We adapted PCP to accommodate nonnegative data, missing data, and values below a given limit of detection (LOD). We simulated data to represent environmental mixtures of two sizes with increasing proportions and three noise structures. We applied PCP-LOD to evaluate its performance in comparison with principal component analysis (PCA). We next applied principal component pursuit with limit of detection (PCP-LOD) to an exposure mixture of 21 persistent organic pollutants (POPs) measured in 1,000 U.S. adults from the 2001–2002 National Health and Nutrition Examination Survey (NHANES). We applied singular value decomposition to the estimated low-rank matrix to characterize the patterns. Results: PCP-LOD recovered the true number of patterns through cross-validation for all simulations; based on an a priori specified criterion, PCA recovered the true number of patterns in 32% of simulations. PCP-LOD achieved lower relative predictive error than PCA for all simulated data sets with up to 50% of the data . When 75% of values were , PCP-LOD outperformed PCA only when noise was low. In the POP mixture, PCP-LOD identified a rank-three underlying structure and separated 6% of values as extreme events. One pattern represented comprehensive exposure to all POPs. The other patterns grouped chemicals based on known structure and toxicity. Discussion: PCP-LOD serves as a useful tool to express multidimensional exposures as consistent patterns that, if found to be related to adverse health, are amenable to targeted public health messaging. https://doi.org/10.1289/EHP10479
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.