Yan-Nei Law scite author profile

Abstract. In this paper, we propose an incremental classification algorithm which uses a multi-resolution data representation to find adaptive nearest neighbors of a test point. The algorithm achieves excellent performance by using small classifier ensembles where approximation error bounds are guaranteed for each ensemble size. The very low update cost of our incremental classifier makes it highly suitable for data stream applications. Tests performed on both synthetic and real-life data indicate that our new classifier outperforms existing algorithms for data streams in terms of accuracy and computational costs.

show abstract

Query Languages and Data Models for Database Sequences and Data Streams

Law¹,

Wang

Zaniolo³

2004

View full text Add to dashboard Cite

Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding

Law

Zaniolo²

2008

IJBIDM

View full text Add to dashboard Cite

Random samples are common in data streams applications due to limitations in data sources and transmission lines, or to load-shedding policies. Here we introduce a formal error model and show that, besides providing accurate estimates, it improves query answer accuracy by exploiting past statistics. The method is general, robust in the presence of concept drift, and minimises uncertainties due to sampling with negligible time and space overhead. We describe the application of the method, and the results obtained for SQL window aggregates, statistical aggregates such as quantiles, and data mining functions such as k-means clustering and naive Bayesian classifiers.

show abstract

Relational languages and data models for continuous queries on sequences and data streams

Law

Wang

Zaniolo

2011

ACM Trans. Database Syst.

View full text Add to dashboard Cite

Most data stream management systems are based on extensions of the relational data model and query languages, but rigorous analyses of the problems and limitations of this approach, and how to overcome them, are still wanting. In this article, we elucidate the interaction between stream-oriented extensions of the relational model and continuous query language constructs, and show that the resulting expressive power problems are even more serious for data streams than for databases. In particular, we study the loss of expressive power caused by the loss of blocking query operators, and characterize nonblocking queries as monotonic functions on the database. Thus we introduce the notion of N B-completeness to assure that a query language is as suitable for continuous queries as it is for traditional database queries. We show that neither RA nor SQL are N B-complete on unordered sets of tuples, and the problem is even more serious when the data model is extended to support order-a sine-qua-non in data stream applications. The new limitations of SQL, compounded with well-known problems in applications such as sequence queries and data mining, motivate our proposal of extending the language with user-defined aggregates (UDAs). These can be natively coded in SQL, according to simple syntactic rules that set nonblocking aggregates apart from blocking ones. We first prove that SQL with UDAs is Turing complete. We then prove that SQL with monotonic UDAs and union operators can express all monotonic set functions computable by a Turing machine (N B-completeness) and finally extend this result to queries on sequences ordered by their timestamps. The proposed approach supports data stream models that are more sophisticated than append-only relations, along with data mining queries, and other complex applications.

show abstract

Load Shedding for Window Joins on Multiple Data Streams

Law

Zaniolo²

2007

View full text Add to dashboard Cite

We consider the problem of semantic load shedding for continuous queries containing window joins on multiple data streams and propose a robust approach that is effective with the different semantic accuracy criteria that are required in different applications. In fact, our approach can be used to (i) maximize the number of output tuples produced by joins, and (ii) optimize the accuracy of complex aggregates estimates under uniform random sampling. We first consider the problem of computing maximal subsets of approximate window joins over multiple data streams. Previously proposed approaches are based on multiple pairwise joins and, in their load-shedding decisions, disregard the content of streams outside the joined pairs. To overcome these limitations, we optimize our load-shedding policy using various predictors of the productivity of each tuple in the window. To minimize processing costs, we use a fastand-light sketching technique to estimate the productivity of the tuples. We then show that our method can be generalized to produce statistically accurate samples, as needed in, e.g., the computation of averages, quantiles, and stream mining queries. Tests performed on both synthetic and reallife data demonstrate that our method outperforms previous approaches, while requiring comparable amounts of time and space.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yan-Nei Law

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Query Languages and Data Models for Database Sequences and Data Streams

Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding

Relational languages and data models for continuous queries on sequences and data streams

Load Shedding for Window Joins on Multiple Data Streams

Contact Info

Product

Resources

About