Abstract. Considering the scalability and semantic requirements, Resource Description Framework (RDF) and the de-facto query language SPARQL are well suited for managing and querying online social network (OSN) data. Despite some existing works have introduced distributed framework for querying large-scale data, how to improve online query performance is still a challenging task. To address this problem, this paper proposes a scalable RDF data framework, which uses key-value store for offline RDF storage and pipelined inmemory based query strategy. The proposed framework efficiently supports SPARQL Basic Graph Pattern (BGP) queries on large-scale datasets. Experiments on the benchmark dataset demonstrate that the online SPARQL query performance of our framework outperforms existing distributed RDF solutions.
Keywords: RDF · SPARQL · Social networks · Query processing
IntroductionWith the rapid development of web social network applications such as Facebook, Twitter and Microblog, a large number of users linked data are generated. The characteristics of such data are large volume and complicated structure. So how to effectively manage OSN data is a hot topic in academic and industrial research. The scalability and flexibility of RDF, which is designed for Semantic Web can express BGP queries for RDF, which can be directly applied to the OSN subgraph query. In general, the nature of RDF model makes it suitable for large-scale complex OSN management. Figure 1 illustrates an example for a fraction of OSN graph representing relations between users and User Generated Contents. Query which finds pairs of users in a path of friend relationship which user1 likes the blog1 that created by user2 is expressed in SPARQL as:
Relational databases are wildly adopted in RDF (Resource Description Framework) data management. For efficient SPARQL query evaluation, the legacy query optimizer needs reconsiderations. One vital problem is how to tackle the suboptimal query plan caused by error-prone cardinality estimation. Consider the schema-free nature of RDF data and the Join-intensive characteristic of SPARQL query, determine an optimal execution order before the query actually evaluated is costly or even infeasible, especially for complex queries on large-scale data. In this paper, we propose ROSIE, a Runtime Optimization framework that iteratively re-optimize SPARQL query plan according to the actual cardinality derived from Incremental partial query Evaluation. By introducing an approach for heuristic-based plan generation, as well as a mechanism to detect cardinality estimation error at runtime, ROSIE relieves the problem of biased cardinality propagation in an efficient way, and thus is more resilient to complex query evaluation. Extensive experiments on real and benchmark data show that compared to the state-of-the-arts, ROSIE consistently outperformed on complex queries by orders of magnitude.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.