With a vast amount of data available on online social networks, how to enable efficient analytics has been an increasingly important research problem. Many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While almost all of these techniques use the exact same underlying technique of random walk -a Markov Chain Monte Carlo based method that iteratively transits from one node to its random neighbor.Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burn-in" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner.In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves -thereby enabling a more efficient "drop-in" replacement for existing sampling-based analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higher-ordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on real-world social networks and synthetic graphs the superiority of our techniques over the existing ones.
Time series is the simplest form of temporal data. A time series is a sequence of real numbers collected regularly in time, where each number represents a value. Time series data come up in a variety of domains, including stock market analysis, environmental data, telecommunications data, medical data and financial data. Web data that count the number of clicks on given cites, or model the usage of different pages are also modeled as time series. Therefore time series account for a large fraction of the data stored in commercial databases. There is recently increasing recognition of this fact, and support for time series as a different data type in commercial data bases management systems is increasing. IBM DB2 for example implements support for time series using data-blades. The pervasiveness and importance of time series data has sparked a lot of research work on the topic. While the statistics literature on time series is vast, it has not studied methods that would be appropriate for the time series similarity and indexing problems we discuss here; much of the relevant work on these problems has been done by the computer science community. One interesting problem with time series data is finding whether different time series display similar behavior. More formally, the problem can be stated as: Given two time series X and Y , determine whether they are similar or not (in other words, define and compute a distance function dist ( X , Y )). Typically each time series describes the evolution of an object, for example the price of a stock, or the levels of pollution as a function of time at a given data collection station. The objective can be to cluster the different objects to similar groups, or to classify an object based on a set of known object examples. The problem is hard because the similarity model should allow for imprecise matches. One interesting variation is the subsequence similarity problem, where given two time series X and Y , we have to determine those subsequences of X that are similar to pattern Y . To answer these problems, different notions of similarity between time series have been proposed in data mining research. In the tutorial we examine the different time series similarity models that have been proposed, in terms of efficiency and accuracy. The solutions encompass techniques from a wide variety of disciplines, such as databases, signal processing, speech recognition, pattern matching, combinatorics and statistics. We survey proposed similarity techniques, including the L p norms, time warping, longest common subsequence measures, baselines, moving averaging, or deformable Markov model templates. Another problem that comes up in applications is the indexing problem: given a time series X , and a set of time series S = { Y 1 ,…, Y N }, find the time series in S that are most similar to the query X . A variation is the subsequence indexing problem, where given a set of sequences S , and a query sequence (pattern) X , find the sequences in S that contain subsequences that are similar to X . To solve these problems efficiently, appropriate indexing techniques have to be used. Typically, the similarity problem is related to the indexing problem: simple (and possibly inaccurate) similarity measures are usually easy to build indexes for, while more sophisticated similarity measures make the indexing problem hard and interesting. We examine the indexing techniques that can be used for different models, and the dimensionality reduction techniques that have been proposed to improve indexing performance. A time series of length n can be considered as a tuple in an n -dimensional space. Indexing this space directly is inefficient because of the very high dimensionality. The main idea to improve on it is to use a dimensionality reduction technique that takes the n item long time series, and maps it to a lower dimensional space with k dimensions (hopefully, k << n ). We give a detailed description of the most important techniques used for dimensionality reduction. These include: the SVD decomposition, the Fourier transform (and the similar Discrete Cosine transform), the Wavelet decomposition, Multidimensional Scaling, random projection techniques, FastMap (and variants), and Linear partitioning. These techniques have specific strengths and weaknesses, making some of them better suited for specific applications and settings. Finally we consider extensions to the problem of indexing subsequences, as well as to the problem of finding similar high-dimensional sequences, such as trajectories or video frame sequences.
The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem whose goal is to minimize the error in answering queries in the given workload. A key novelty of our approach is that we can tailor the choice of samples to be robust even for workloads that are "similar" but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server 2000 that demonstrate the superior quality of our method compared to previous work.
Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.