Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. This work develops a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. For instance, we can use logistic loss or Kullback-Leibler divergence, enabling tensor decomposition for binary or count data. We present a variety of statisticallymotivated loss functions for various scenarios. We provide a generalized framework for computing gradients and handling missing data that enables the use of standard optimization methods for fitting the model. We demonstrate the flexibility of GCP on several real-world examples including interactions in a social network, neural activity in a mouse, and monthly rainfall measurements in India.
medRxiv preprint 2The ongoing pandemic of SARS-CoV-2, a novel coronavirus, caused over 3 million reported cases of coronavirus disease 2019 (COVID-19) and 200,000 reported deaths between December 2019 and April 2020 1 . Cases and deaths will increase as the virus continues its global march outward. In the absence of effective pharmaceutical interventions or a vaccine, wide-spread virological screening is required to inform where restrictive isolation measures should be targeted and when they can be lifted 2-6 . However, limitations on testing capacity have restricted the ability of governments and institutions to identify individual clinical cases, appropriately measure community prevalence, and mitigate transmission. Group testing offers a way to increase efficiency, by combining samples and testing a small number of pools 7-9 . Here, we evaluate the effectiveness of group testing designs for individual identification or prevalence estimation of SARS-CoV-2 infection when testing capacity is limited. To do this, we developed mathematical models for epidemic spread, incorporating empirically measured individual-level viral kinetics to simulate changing viral loads in a large population over the course of an epidemic. We used these to construct representative populations and assess pooling strategies for community screening, accounting for variability in viral load samples, dilution effects, changing prevalence and resource constraints. We confirmed our group testing framework through pooled tests on de-identified human nasopharyngeal specimens with viral loads representative of the larger population. We show that group testing designs can both accurately estimate overall prevalence using a small number of measurements and substantially increase the identification rate of infected individuals in resource-limited settings. : medRxiv preprint 3 We aimed to evaluate the effectiveness of group testing for overall prevalence estimation and individual case identification. In the classical version of the identification problem 7 , samples from multiple individuals are combined and tested as a single pool ( Fig. 1a). If the test is negative (which might be likely if the prevalence is low and the pool is not too large), then each of the individuals is assumed to have been negative. If the test is positive, it is assumed that at least one individual in the pool was positive; each of the pooled samples is then tested individually. This strategy leverages the low frequency of cases which would otherwise cause substantial inefficiency, as the majority of pools will test negative when prevalence is low. The simple pooling method can be expanded to combinatorial pooling (each sample represented in multiple pools) for direct sample identification 8,9 ( Fig. 1b) and to pooled testing for prevalence estimation 10,11 ( Fig. 1c).To deploy group testing in the current pandemic, we need designs that can account for the (i) prevalence of infection within the population, (ii) position along the epidemic curve , and (iii) within-host kin...
Virological testing is central to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) containment, but many settings face severe limitations on testing. Group testing offers a way to increase throughput by testing pools of combined samples; however, most proposed designs have not yet addressed key concerns over sensitivity loss and implementation feasibility. Here, we combined a mathematical model of epidemic spread and empirically derived viral kinetics for SARS-CoV-2 infections to identify pooling designs that are robust to changes in prevalence, and to ratify sensitivity losses against the time course of individual infections. We show that prevalence can be accurately estimated across a broad range, from 0.02% to 20%, using only a few dozen pooled tests, and using up to 400 times fewer tests than would be needed for individual identification. We then exhaustively evaluated the ability of different pooling designs to maximize the number of detected infections under various resource constraints, finding that simple pooling designs can identify up to 20 times as many true positives as individual testing with a given budget. We illustrate how pooling affects sensitivity and overall detection capacity during an epidemic and on each day post infection, finding that only 3% of false negative tests occurred when individuals are sampled during the first week of infection following peak viral load, and that sensitivity loss is mainly attributable to individuals sampled at the end of infection when detection for limiting transmission has minimal benefit. Crucially, we confirmed that our theoretical results can be translated into practice using pooled human nasopharyngeal specimens by accurately estimating a 1% prevalence among 2,304 samples using only 48 tests, and through pooled sample identification in a panel of 960 samples. Our results show that accounting for variation in sampled viral loads provides a nuanced picture of how pooling affects sensitivity to detect infections. Using simple, practical group testing designs can vastly increase surveillance capabilities in resource-limited settings.
Cytoplasmic sequestration of the p53 tumor suppresser protein has been proposed as a mechanism involved in abolishing p53 function. However, the mechanisms regulating p53 subcellular localization remain unclear. In this report, we analyzed the possible existence of cis-acting sequences involved in intracellular trafficking of the p53 protein. To study p53 trafficking, the jellyfish green fluorescent protein (GFP) was fused to the wild-type or mutated p53 proteins for fast and sensitive analysis of protein localization in human MCF-7 breast cancer, RKO colon cancer, and SAOS-2 sarcoma cells. The wild-type p53/GFP fusion protein was localized in the cytoplasm, the nucleus, or both compartments in a subset of the cells. Mutagenesis analysis demonstrated that a single amino acid mutation of Lys-305 (mt p53) caused cytoplasmic sequestration of the p53 protein in the MCF-7 and RKO cells, whereas the fusion protein was distributed in both the cytoplasm and the nucleus of SAOS-2 cells. In SAOS-2 cells, the mutant p53 was a less efficient inducer of p21/CIP1/WAF1 expression. Cytoplasmic sequestration of the mt p53 was dependent upon the C-terminal region (residues 326 -355) of the protein. These results indicated the involvement of cis-acting sequences in the regulation of p53 subcellular localization. Lys-305 is needed for nuclear import of p53 protein, and amino acid residues 326 -355 can sequester mt p53 in the cytoplasm.
Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional and heteroscedastic. This paper analyzes the statistical performance of PCA in this setting, i.e., for high-dimensional data drawn from a low-dimensional subspace and degraded by heteroscedastic noise. We provide simplified expressions for the asymptotic PCA recovery of the underlying subspace, subspace amplitudes and subspace coefficients; the expressions enable both easy and efficient calculation and reasoning about the performance of PCA. We exploit the structure of these expressions to show that, for a fixed average noise variance, the asymptotic recovery of PCA for heteroscedastic data is always worse than that for homoscedastic data (i.e., for noise variances that are equal across samples). Hence, while average noise variance is often a practically convenient measure for the overall quality of data, it gives an overly optimistic estimate of the performance of PCA for heteroscedastic data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.