We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform 'query-by-burst' on the database of timeseries. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database.
Probabilistic programming languages are used for developing statistical models, and they typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate an extension of Datalog for specifying statistical models, and establish a declarative probabilistic-programming paradigm over databases. Our proposed extension provides convenient mechanisms to include common numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. The resulting semantics is robust under different chases and invariant to rewritings that preserve logical equivalence.
This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.
In this paper we present the Threshold Join Algorithm (TJA), which is an efficient TOP-k query processing algorithm for distributed sensor networks. The objective of a top-k query is to find the k highest ranked answers to a user defined similarity function. The evaluation of such a query in a sensor network environment is associated with the transfer of data over an extremely expensive communication medium. TJA uses a non-uniform threshold on the queried attribute in order to minimize the number of tuples that have to be transferred towards the querying node. Additionally, TJA resolves queries in the network rather than in a centralized fashion, which minimizes even more the consumption of bandwidth and delay. Our preliminary experimental results, using our trace driven simulator, show that TJA is both practical and efficient.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.