In many human diseases, associated genetic changes tend to occur within non-coding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such non-coding regions: Given a region that is statistically associated with changes in gene expression (expression Quantitative Trait Locus; eQTL), does it in fact play a regulatory role? And if so, how is this role “coded” in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
We design algorithms for minimizing max i∈[n] f i (x) over a d-dimensional Euclidean or simplex domain. When each f i is 1-Lipschitz and 1-smooth, our method computes an ϵ-approximate solution using O(nϵ −1/3 + ϵ −2 ) gradient and function evaluations, and O(nϵ −4/3 ) additional runtime. For large n, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each f i is linear-which corresponds to finding a near-optimal primal strategy in a matrix game-our method finds an ϵ-approximate solution in runtime O(n(d/ϵ) 2/3 + nd + dϵ −2 ). For n > d and ϵ = 1/ √ n this improves over all existing first-order methods. When additionally d = ω(n 8/11 ) our runtime also improves over all known interior point methods.Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small ℓ 2 or ℓ 1 balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for non-Euclidean geometry.
The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated noncoding genetic variants. We present a novel TF binding motif representation, the -mer set memory (KSM), which consists of a set of aligned-mers that are overrepresented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM-derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq data sets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of noncoding genetic variations.
In this paper we provide faster algorithms for approximately solving ℓ ∞ regression, a fundamental problem prevalent in both combinatorial and continuous optimization (for example, it was shown in [LS15a] that ℓ ∞ regression is equivalent to solving linear programming). In particular we provide an accelerated coordinate descent method which converges in k iterations at a O 1 k rate independent of the dimension of the problem, and whose iterations can be implemented cheaply for many structured matrices. Our algorithm can be viewed as an alternative approach to the recent breakthrough result of Sherman [She17] which achieves a similar running time improvement over classic algorithmic approaches, i.e. smoothing and gradient descent, which either converge at a O 1 √ k rate or have running times with a worse dependence on problem parameters. Our running times match those of [She17] across a broad range of parameters and in certain cases, improves upon it.We demonstrate the efficacy of our result by providing faster algorithms for the well-studied maximum flow problem. We show how to leverage our algorithm to achieve a runtime of O m + √ ns ǫ to compute an ǫ-approximate maximum flow, for an undirected graph with m edges, n vertices, and where s is the squared ℓ 2 norm of the congestion of any optimal flow. As s = O(m) this yields a running time ofÕ m + √ nm ǫ, generically improving upon the previous best known runtime ofÕ m ǫ in [She17] whenever the graph is slightly dense. Moreover, we show how to leverage this result to achieve improved exact algorithms for maximum flow on a variety of unit capacity graphs.We achieve these results by providing an accelerated coordinate descent method capable of provably exploiting dynamic measures of coordinate smoothness for smoothed versions of ℓ ∞ regression. Our analysis leverages the structure of the Hessian of the smoothed problem via a simple bound on its trace, as well as techniques for exploiting column sparsity of the constraint matrix for faster sampling and improved smoothness estimates. We hope that the work of this paper can serve as an important step towards achieving even faster maximum flow algorithms. time penalty in terms of dimension or domain size, is the recent breakthrough result of [She17]. Our running times match those of [She17] across a broad range of parameters, and in certain cases improve upon it, due to our algorithm's tighter dependence on the ℓ 2 -norm and therefore sparsity of the optimal solution, as well as a more fine-grained dependence on the problem's smoothness parameters. Because of these tighter dependences, in many parameter regimes including the maximum flow problem for even slightly dense graphs, our result improves upon [She17].Interestingly, our work provides an alternative approach to [She17] for accelerating ℓ ∞ gradient descent for certain highly structured optimization problems, i.e. ℓ ∞ regression. Whereas Sherman's work introduced an intriguing notion of area convexity and new regularizations of ℓ ∞ regression, our results are achi...
The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of diseaseassociated non-coding genetic variants. We present a novel TF binding motif representation, the K-mer Set Memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix models (PWMs) and other more complex motif models across a large set of ChIP-seq experiments. KMAC also identifies correct motifs in more experiments than four state-of-the-art motif discovery methods. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1488 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of non-coding genetic variations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.