The skyline query over uncertain data streams, as an important aspect of big data analysis, plays a significant role in domains such as environment monitoring, decision-making, and data mining.The skyline query over uncertain data streams with sliding window model always focuses on the most recent N streaming items, which cannot meet the query requirements of different window scales at the same time. To improve the query flexibility and efficiency, we propose an efficient parallel method for processing uncertain n-of-N skyline queries; that is, computing the skyline for the most recent n (∀n ≤ N) items in parallel. Specifically, we first propose a framework for parallelizing the query computation for uncertain n-of-N skylines. Furthermore, we put forward a sliding window partitioning strategy as well as a streaming items mapping strategy to realize the load balance for each node. In addition, we define a spatial index structure RST based on R-tree to organize the elements within each individual sliding window and candidate set in each which can significantly improve the dominance tests. Most importantly, we provide an encoding interval scheme to transform the n-of-N query into stabbing query in each compute node, which can greatly minimize the query scope and improve the query efficiency. In addition, we use a red-black tree named RBI to store all stabbing intervals. Extensive experimental results demonstrate that the proposals are efficient and can greatly meet the query requirement of users in real applications. KEYWORDS data streams, n-of-N model, parallel queries, skyline queries, uncertain data
INTRODUCTIONWith the fast development of computer technology and easily available network services, uncertain data query has received extensive attention in a large number of practical applications in domains like location-based service, 1 RFID network, 2 online shopping, 3 and radar detection. 4 Uncertain data is inherent in these applications due to various factors, 5 such as data randomness and incompletness, limited facilities of measuring, loss of data transmission, and interference of external environment. Moreover, uncertain data in these applications are often generated dynamically and continuously and gradually evolve into uncertain data streams. For example, in the online shopping applications, information of goods are usually updated continuously, and uncertain data such as the satisfaction scores from the feedback of the customers are collected from multiple web sites dynamically. As another example, in the application of resource detecting with radar detection and ranging, a large number of geological and oceanographic data generate continuously and are transmitted to the processing systems in real time. Therefore, it is greatly important to analyze large collections of uncertain streaming data efficiently, due to the significance of such real applications and characteristics of uncertain data streams such as real-time arriving, data uncertainty, and single-pass scanning.The skyline query is a typical que...