There has been an explosion in the amount of digital text information available in recent years, leading to challenges of scale for traditional inference algorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA) have made it feasible to learn topic models on very large-scale corpora, but these methods do not currently take full advantage of the collapsed representation of the model. We propose a stochastic algorithm for collapsed variational Bayesian inference for LDA, which is simpler and more efficient than the state of the art method. In experiments on large-scale text corpora, the algorithm was found to converge faster and often to a better solution than previous methods. Humansubject experiments also demonstrated that the method can learn coherent topics in seconds on small corpora, facilitating the use of topic models in interactive document analysis software.
Contextual bandits are a common problem faced by machine learning practitioners in domains as diverse as hypothesis testing to product recommendations. There have been a lot of approaches in exploiting rich data representations for contextual bandit problems with varying degree of success. Self-supervised learning is a promising approach to find rich data representations without explicit labels. In a typical self-supervised learning scheme, the primary task is defined by the problem objective (e.g. clustering, classification, embedding generation etc.) and the secondary task is defined by the self-supervision objective (e.g. rotation prediction, words in neighborhood, colorization, etc.). In the usual selfsupervision, we learn implicit labels from the training data for a secondary task. However, in the contextual bandit setting, we don't have the advantage of getting implicit labels due to lack of data in the initial phase of learning. We provide a novel approach to tackle this issue by combining a contextual bandit objective with a self supervision objective. By augmenting contextual bandit learning with self-supervision we get a better cumulative reward. Our results on eight popular computer vision datasets show substantial gains in cumulative reward. We provide cases where the proposed scheme doesn't perform optimally and give alternative methods for better learning in these cases.
We have developed a new method that uses high-throughput reads that span multiple somatic point mutations to reconstruct multiple, genetically diverse subclonal populations from one or more heterogeneous tumor samples. Tumors often contain multiple, genetically diverse subclonal populations, as predicted by the clonal theory of cancer. These subclonal populations develop through successive waves of expansion and selection and have differing abilities to metastasize and resist treatment. Identifying these sub-populations and their evolutionary relationships can help identify driver mutations associated with cancer development and progression. Subclonal reconstruction algorithms attempt to infer the prevalence and genotype of multiple, genetically-related subclonal populations using the variant allele frequency (VAF) of somatic variants. To date, these algorithms exclusively use data on individual somatic mutations. This restriction greatly reduces their ability to fully resolve phylogenic ambiguities. In some cases, it is possible to determine the mutation status of >1 mutation in a single cell, for example, when single reads cover multiple single nucleotide variants (SNVs). This type of information can add considerable power to the phylogenetic reconstruction of the tumor subclonal population. We have developed the PhyloSpan algorithm which attempts to infer the states of multiple SNVs in single cells, and then exploits that information in subclonal reconstruction. Our algorithm starts with phasing somatic SNVs by looking for reads / read-pairs that cover both a somatic mutation and germline heterozygous single nucleotide polymorphism (SNP). These germline SNPs are often available through profiling of normal tissue. PhyloSpan then identifies SNVs that are on the same chromosome and close enough to be covered by a single read or paired reads. These pairs of mutations provide more phylogenetic certainty than can be found by looking at mutations independently. For example, if those SNVs are found in the same evolutionary branch, then we expect to see some reads containing both mutations. If however, the SNVs are an separate branches then no reads should show both SNVs. PhyloSpan integrates this phylogenetic information, along with information about the VAF of each somatic SNV in order to perform subclonal reconstruction. Incorporating these various types of information, especially given the substantial uncertainty in phasing and NGS read content, requires a rigorous statistical approach and so we have developed a Bayesian non-parametric tree-based clustering algorithm, based on our existing PhyloWGS method. This algorithm not only infers the number of subclonal populations and their genotype but also provides a measure of uncertainty about this inference, enabling users to determine which parts of the subclonal reconstruction are certain and which parts remain ambiguous. While the number of SNVs a short-read length distance away from another SNV is small, a handful of such pairs are all that is needed to eliminate a substantial amount of ambiguity in subclonal reconstruction. Furthermore, long (>10k) read technologies, such as PacBio, can be used to supplement short read sequence. Our approach generalizes to permit the integration of single-cell sequencing with bulk tumor sequencing. Furthermore, we can also use our framework to identify a small number of SNVs for which low throughput assays would be most useful to resolve subclonal reconstruction ambiguity. We will present results applying our algorithm to whole genome sequencing data showing the added value of considering multiple SNVs compared to independent SNVs. Citation Format: Amit G. Deshwar, Levi Boyles, Jeff Wintersinger, Paul C. Boutros, Yee Whye Teh, Quaid Morris, Quaid Morris. PhyloSpan: Using multi-mutation reads to resolve subclonal architectures from heterogeneous tumor samples. [abstract]. In: Proceedings of the AACR Special Conference on Computational and Systems Biology of Cancer; Feb 8-11 2015; San Francisco, CA. Philadelphia (PA): AACR; Cancer Res 2015;75(22 Suppl 2):Abstract nr B2-59.
We have developed a new method that uses high-throughput reads that span multiple somatic point mutations to reconstruct multiple, genetically diverse subclonal populations from one or more heterogeneous tumor samples. Subclonal reconstruction algorithms attempt to infer the prevalence and genotype of multiple, genetically-related subclonal populations using the variant allele frequency (VAF) of somatic variants. To date, these algorithms exclusively use data on individual somatic mutations. This restriction greatly reduces their ability to fully resolve phylogenic ambiguities. In some cases, it is possible to determine the mutation status of >1 mutation in a single cell, for example, when single reads cover multiple single nucleotide variants (SNVs). This type of information can add considerable power to the phylogenetic reconstruction of the tumor subclonal population. We have developed the PhyloSpan algorithm which attempts to infer the states of multiple SNVs in single cells, and then exploits that information in subclonal reconstruction. Our algorithm starts with phasing somatic SNVs by looking for reads / read-pairs that cover both a somatic mutation and germline heterozygous single nucleotide polymorphism (SNP). These germline SNPs are often available through profiling of normal tissue. PhyloSpan then identifies SNVs that are on the same chromosome and close enough to be covered by a single read or paired reads. These pairs of mutations provide more phylogenetic certainty than can be found by looking at mutations independently. For example, if those SNVs are found in the same evolutionary branch, then we expect to see some reads containing both mutations. If however, the SNVs are an separate branches then no reads should show both SNVs. PhyloSpan integrates this phylogenetic information, along with information about the VAF of each somatic SNV in order to perform subclonal reconstruction. Incorporating these various types of information requires a rigorous statistical approach, and so we have developed a Bayesian non-parametric tree-based clustering algorithm. This algorithm not only infers the number of subclonal populations and their genotype but also provides a measure of uncertainty about this inference, enabling users to determine which parts of the subclonal reconstruction are certain and which parts remain ambiguous. While the number of SNVs a short-read length distance away from another SNV is small, a handful of such pairs are all that is needed to eliminate a substantial amount of ambiguity in subclonal reconstruction. Furthermore, long read technologies, such as PacBio, can be used to supplement short reads. Our approach generalizes to permit the integration of single-cell sequencing with bulk tumor sequencing. We will present results applying our algorithm to whole genome sequencing data showing the added value of considering multiple SNVs compared to independent SNVs. Citation Format: Amit G. Deshwar, Levi Boyles, Jeff Wintersinger, Paul C. Boutros, Yee Whye Teh, Quaid Morris. PhyloSpan: using multi-mutation reads to resolve subclonal architectures from heterogeneous tumor samples. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 4865. doi:10.1158/1538-7445.AM2015-4865
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.