We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable, as we show using loss cases.
Query optimizers and query execution engines cooperate to deliver high performance on complex analytic queries. Typically, the optimizer searches through the plan space and sends a selected plan to the execution engine. However, optimizers may at times miss the optimal plan, with sometimes disastrous impact on performance. In this paper, we develop the notion of robustness of a query evaluation strategy with respect to a space of query plans. We also propose a novel query execution strategy called Lookahead Information Passing (LIP) that is robust with respect to the space of (fully pipeline-able) left-deep query plan trees for in-memory star schema data warehouses. LIP ensures that execution times for the best and the worst case plans are far closer than without LIP. In fact, under certain assumptions of independent and uniform distributions, any plan in that space is theoretically guaranteed to execute in near-optimal time. LIP ensures that the execution time for every plan in the space is nearly-optimal. In this paper, we also evaluate these claims using workloads that include skew and correlation. With LIP we make an initial foray into a novel way of thinking about robustness from the perspective of query evaluation, where we develop strategies (like LIP) that collapse plan sub-spaces in the overall global plan space.
Many modern applications deal with exponentially increasing data volumes and aid business-critical decisions in near real-time. Particularly in exploratory data analysis, the focus is on interactive querying and some degree of error in estimated results is tolerable. A common response to this challenge is approximate query processing, where the user is presented with a quick confidence interval estimate based on a sample of the data. In this work, we highlight some of the problems that are associated with this probabilistic approach when extended to more complex queries, both in semantic interpretation and the lack of a formal algebra. As an alternative, we propose deterministic approximate querying (DAQ) schemes, formalize a closed deterministic approximation algebra, and outline some design principles for DAQ schemes. We also illustrate the utility of this approach with an example deterministic online approximation scheme which uses a bitsliced index representation and computes the most significant bits of the result first. Our prototype scheme delivers speedups over exact aggregation and predicate evaluation, and outperforms sampling-based schemes for extreme value aggregations.
Modern servers pack enough storage and computing power that just a decade ago was spread across a modest-sized cluster. This paper presents a prototype system, called Quickstep, to exploit the large amount of parallelism that is packed inside modern servers. Quickstep builds on a vast body of previous methods for organizing data, optimizing, scheduling and executing queries, and brings them together in a single system. Quickstep also includes new query processing methods that go beyond previous approaches. To keep the project focused, the project's initial target is read-mostly in-memory data warehousing workloads in single-node settings. In this paper, we describe the design and implementation of Quickstep for this target application space. We also present experimental results comparing the performance of Quickstep to a number of other systems, demonstrating that Quickstep is often faster than many other contemporary systems, and in some cases faster by orders-of-magnitude. Quickstep is an Apache (incubating) project.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.