We consider the task of inferring is-a relationships from large text corpora. For this purpose, we propose a new method combining hyperbolic embeddings and Hearst patterns. This approach allows us to set appropriate constraints for inferring concept hierarchies from distributional contexts while also being able to predict missing is-a-relationships and to correct wrong extractions. Moreover -and in contrast with other methods -the hierarchical nature of hyperbolic space allows us to learn highly efficient representations and to improve the taxonomic consistency of the inferred hierarchies. Experimentally, we show that our approach achieves state-of-the-art performance on several commonly-used benchmarks.
Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-ofthe-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.
We present a novel algorithm, Westfall-Young light, for detecting patterns, such as itemsets and subgraphs, which are statistically significantly enriched in one of two classes. Our method corrects rigorously for multiple hypothesis testing and correlations between patterns through the WestfallYoung permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations.In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach in terms of both runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that unlike all existing methods, our algorithm neither needs to solve the underlying frequent itemset mining problem anew for each permutation nor needs to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously led to prohibitive runtime or memory costs.
10Predicting quantitative effects of gene regulatory elements (GREs) on gene expression is a longstanding 11 challenge in biology. Machine learning models for gene expression prediction may be able to address 12 this challenge, but they require experimental datasets that link large numbers of GREs to their 13 quantitative effect. However, current methods to generate such datasets experimentally are either 14 restricted to specific applications or limited by their technical complexity and error-proneness. Here we 15 introduce DNA-based phenotypic recording as a widely applicable and practical approach to generate 16 very large datasets linking GREs to quantitative functional readouts of high precision, temporal 17 resolution, and dynamic range, solely relying on sequencing. This is enabled by a novel DNA 18 architecture comprising a site-specific recombinase, a GRE that controls recombinase expression, and a 19 DNA substrate modifiable by the recombinase. Both GRE sequence and substrate state can be 20 determined in a single sequencing read, and the frequency of modified substrates amongst constructs 21 harbouring the same GRE is a quantitative, internally normalized readout of this GRE's effect on 22 recombinase expression. Using next-generation sequencing, the quantitative expression effect of 23 extremely large GRE sets can be assessed in parallel. As a proof of principle, we apply this approach to 24 record translation kinetics of more than 300,000 bacterial ribosome binding sites (RBSs), collecting over 25 2.7 million sequence-function pairs in a single experiment. Further, we generalize from these large-scale 26Recent progress in DNA sequencing and synthesis has facilitated reading and (re-)writing of the genetic 33 makeup of biological systems on a massive scale 1,2 . Despite this progress, the relationship between a 34 genetic sequence and its functional properties is poorly understood, and thus the question "what to write" 35 remains largely unanswered 3,4 . Since the number of possible sequences scales exponentially with their 36 length, the theoretical sequence space cannot be exhaustively explored by experiments, even for small 37GREs 5-7 . Therefore, innovative high-throughput (HTP) approaches are required that allow to collect a 38 quantitative functional readout for large numbers of genetic sequences 7,8 . At the same time, novel 39 methods are required that identify statistical patterns and dependencies in the resulting datasets to 40 generate models that accurately predict the properties of untested sequences. Deep learning maximizes 41 the benefit of data collection at large scale owing to its ability to capture complex, nonlinear 42 dependencies and to its computational scalability 9 , which led to several successful applications in 43 computational biology, from genomics to proteomics 10-15 . These methods promise to be able to model 44 sequence-function dependencies with minimal prior assumptions, provided that large experimental 45 training datasets that link sequence to quantitative measure ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.