Selectivity estimation -the problem of estimating the result size of queries -is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. Recently, deep learning has been applied to this problem with promising results. However, many of the proposed approaches often struggle to provide accurate results for multi attribute queries involving large number of predicates and with low selectivity.In this paper, we propose two complementary approaches that are effective for this scenario. Our first approach models selectivity estimation as a density estimation problem where one seeks to estimate the joint probability distribution from a finite number of samples. We leverage techniques from neural density estimation to build an accurate selectivity estimator. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We describe how to extend our algorithms for range queries. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating
Cancer registries collect unstructured and structured cancer data for surveillance purposes which provide important insights regarding cancer characteristics, treatments, and outcomes. Cancer registry data typically (1) categorize each reportable cancer case or tumor at the time of diagnosis, (2) contain demographic information about the patient such as age, gender, and location at time of diagnosis, (3) include planned and completed primary treatment information, and (4) may contain survival outcomes. As structured data is being extracted from various unstructured sources, such as pathology reports, radiology reports, medical records, and stored for reporting and other needs, the associated information representing a reportable cancer is constantly expanding and evolving. While some popular analytic approaches including SEER*Stat and SAS exist, we provide a knowledge graph approach to organizing cancer registry data. Our approach offers unique advantages for timely data analysis and presentation and visualization of valuable information. This knowledge graph approach semantically enriches the data, and easily enables linking with third-party data which can help explain variation in cancer incidence patterns, disparities, and outcomes. We developed a prototype knowledge graph based on the Louisiana Tumor Registry dataset. We present the advantages of the knowledge graph approach by examining: i) scenario-specific queries, ii) links with openly available external datasets, iii) schema evolution for iterative analysis, and iv) data visualization. Our results demonstrate that this graph based solution can perform complex queries, improve query run-time performance by Manuscript
Data is generated at an unprecedented rate surpassing our ability to analyze them. The database community has pioneered many novel techniques for Approximate Query Processing (AQP) that could give approximate results in a fraction of time needed for computing exact results. In this work, we explore the usage of deep learning (DL) for answering aggregate queries specifically for interactive applications such as data exploration and visualization. We use deep generative models, an unsupervised learning based approach, to learn the data distribution faithfully such that aggregate queries could be answered approximately by generating samples from the learned model. The model is often compact -few hundred KBs -so that arbitrary AQP queries could be answered on the client side without contacting the database server. Our other contributions include identifying model bias and minimizing it through a rejection sampling based approach and an algorithm to build model ensembles for AQP for improved accuracy. Our extensive experiments show that our proposed approach can provide answers with high accuracy and low latency.
Our society is struggling with an unprecedented amount of falsehoods, hyperboles, and half-truths. Politicians and organizations repeatedly make the same false claims. Fake news floods the cyberspace and even allegedly influenced the 2016 election. In fighting false information, the number of active fact-checking organizations has grown from 44 in 2014 to 114 in early 2017. 1 Fact-checkers vet claims by investigating relevant data and documents and publish their verdicts. For instance, PolitiFact.com, one of the earliest and most popular fact-checking projects, gives factual claims truthfulness ratings such as True, Mostly True, Half true, Mostly False, False, and even "Pants on Fire". In the U.S., the election year made fact-checking a part of household terminology. For example, during the first presidential debate on September 26, 2016, NPR.org's live fact-checking website drew 7.4 million page views and delivered its biggest traffic day ever.
Sentiment analysis is an emerging field, concerned with the detection of human emotions from textual data. Sentiment analysis seeks to characterize opinionated or evaluative aspects of natural language text thus helping people to discover valuable information from large amount of unstructured data. Sentiment analysis can be used for grouping search engine results, analyzing news content, reviews for books, movie, sports, blogs, web forums, etc. Sentiment (i.e., bad or good opinion) described in texts has been studied widely, and at three different levels: word, sentence, and document level. Several methods have been proposed for sentiment analysis, mostly based on common machine learning techniques such as Support Vector Machine (SVM), Naive Bayes (NB), Maximum Entropy (ME). In this thesis we explore a new methodology for sentiment analysis called proximity-based sentiment analysis. We take a different approach, by considering a new set of features based on word proximities in a written text. We focused on three different word proximity based features, namely, proximity distribution, mutual information between proximity types and proximity patterns. We applied this approach to the analysis of movie reviews domain. We perform empirical research to demonstrate the performance of the proposed approach. The experimental results show that proximity-based sentiment analysis is able to extract sentiments from a specific domain, with performance comparable to the state-ofthe-art. To the best of our knowledge, this is the first attempt at focusing on proximity based features as the primary features in sentiment analysis. v TABLE OF CONTENTS Abstract…………………………………………………………………………………………………………. ii Dedication………………………………………………………………………………………………………. iii Acknowledgements…………………………………………………………………………………………… iv Table of contents……………………………………………………………………………………………… v List of Figures………………………………………………………………………………………………….. vi List of Tables…………………………………………………………………………………………………… vii Chapter 1: Introduction……………………………………………………………………………………….
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.