Efficient management of transitive relationships in large data and knowledge bases

Agrawal, Rakesh; Borgida, Alex; Jagadish, H. V.

doi:10.1145/67544.66950

Cited by 250 publications

(202 citation statements)

References 9 publications

Supporting

Mentioning

202

Contrasting

Order By: Relevance

“…First, we explain important notations that we will use in our algorithm description. QueryList is Algorithm 1 ReachableToUnReachable(QueryList) 1: QueryList ← Given query set 2: EvalQueryList ← Declare empty list 3: QueryList ← P re − process (QueryList) 4: for all Edit e (s → t) in edit stream do…”

Section: B Algorithmmentioning

confidence: 99%

CoUPE: Continuous Query Processing Engine for Evolving Graphs

Mullangi¹,

Ramaswamy²

2015

2015 IEEE International Congress on Big Data

View full text Add to dashboard Cite

Continuously Evolving Graphs (CEGs) are graphs whose connectivity constantly changes over time is uniquely important for many domains such as social networks, evolutionary genomics, communication networks etc. In many of these it is often important to keep track of connectivity among the nodes of interest as the underlying structure changes over time. While intervalbased indexing has been a popular strategy for testing reachability in static graphs, it cannot be directly applied in the context of evolving graphs. In this paper, we propose CoUPE (Continuous qUery Processing Engine), which, to our best knowledge, is the first time-efficient framework for answering continuous reachability queries in evolving graphs. The main idea here is to maintain the indices of the evolving graph and recalculate only a subset of reachability queries using a novel heuristic to determine the change in the state of queries because for the most recent change in the graph. We make three novel contributions while designing CoUPE. First, we introduce a generic indexing and querying framework for answering continuous queries in large time-evolving graphs. Second, we design a highly efficient, scalable and provably correct algorithm for updating the indices of graph by analyzing the changes happening on the graph. Third, we present a novel heuristic-based technique for identifying which subset of existing queries might get effected because of the most recent edit. This paper also presents a detailed experimental study demonstrating the scalability and efficiency of the processing engine.

show abstract

Section: B Algorithmmentioning

confidence: 99%

CoUPE: Continuous Query Processing Engine for Evolving Graphs

Mullangi¹,

Ramaswamy²

2015

2015 IEEE International Congress on Big Data

View full text Add to dashboard Cite

show abstract

“…In the graph database community, researchers have been designing algorithms for efficiently indexing graph databases for answering reachability (e.g., [30, 31, 32, 33, 34, 35, 36, 8, 28, 37]), distance (e.g., [9, 10, 11, 12]) and shortest path queries (e.g., [38, 39]). Indexing schemes use additional labels built for a graph database to quickly answer queries.…”

Section: Related Workmentioning

confidence: 99%

k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery

Xiang

James

et al. 2012

Journal of Biomedical Informatics

View full text Add to dashboard Cite

The Unified Medical Language System (UMLS) is the largest thesaurus in the biomedical informatics domain. Previous works have shown that knowledge constructs comprised of transitively-associated UMLS concepts are effective for discovering potentially novel biomedical hypotheses. However, the extremely large size of the UMLS becomes a major challenge for these applications. To address this problem, we designed a k-neighborhood Decentralization Labeling Scheme (kDLS) for the UMLS, and the corresponding method to effectively evaluate the kDLS indexing results. kDLS provides a comprehensive solution for indexing the UMLS for very efficient large scale knowledge discovery. We demonstrated that it is highly effective to use kDLS paths to prioritize disease-gene relations across the whole genome, with extremely high fold-enrichment values. To our knowledge, this is the first indexing scheme capable of supporting efficient large scale knowledge discovery on the UMLS as a whole. Our expectation is that kDLS will become a vital engine for retrieving information and generating hypotheses from the UMLS for future medical informatics applications.

show abstract

“…A multi-interval code to encode all reachability information in DAGs is given in [24]. Wang et al studied processing T X ffl X,!Y T Y over a directed graph [23] and proposed a join algorithm, called IGMJ.…”

Section: Sort-merge-based Multijoinmentioning

confidence: 99%

“…First, it constructs a DAG G 0 by condensing a maximal strongly connected component in G D as a node in G 0 . Second, it generates a multi-interval code for a node in G 0 in [24]. As its name implies, the multiinterval code for encoding DAG [24] is to assign a set of intervals and a postorder number to each node in DAG G 0 .…”

Section: Sort-merge-based Multijoinmentioning

confidence: 99%

See 1 more Smart Citation

Graph Pattern Matching: A Join/Semijoin Approach

Cheng

2011

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Due to rapid growth of the Internet and new scientific/technological advances, there exist many new applications that model data as graphs, because graphs have sufficient expressiveness to model complicated structures. The dominance of graphs in realworld applications demands new graph processing techniques to access large data graphs effectively and efficiently. In this paper, we study a graph pattern matching problem, which is to find all patterns in a large data graph that match a user-given graph pattern. We propose new two-step R-join (reachability join) algorithms with a filter step (R-semijoin) and a fetch step (R-join) by utilizing a new cluster-based join index with graph codes in a relational database context. We also propose two optimization approaches to further optimize sequences of R-joins/R-semijoins. The first approach is based on R-join order selection followed by R-semijoin enhancement, and the second approach is to interleave R-joins with R-semijoins. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.Given the transitive closure T C computed, a reachability condition X ,! Y can be processed as an equijoin using the following SQL expression:And the graph pattern matching can be processed using a sequence of equijoins. However, it requests either to compute T C online or to materialize T C by precomputing. Both are infeasible, because the former requests high computational cost, and the latter requests huge space.In this work, instead, we maintain a data graph G D with jAEj labels in a relational database G DB using jAEj relations. In brief, for each label X 2 AE, we create a relation, denoted T X , to maintain the extent of X-labeled nodes in G D . Because transitive closure is essential for processing graph pattern matching, we maintain the transitive closure, T C, using graph coding, called 2-hop labeling [8], in the relations in G DB .A 2-hop labeling is a compressed representation of transitive closure [8], which assigns every node v in graph G D a label LðvÞ ¼ ðL in ðvÞ; L out ðvÞÞ, where L in ðvÞ; L out ðvÞ V ðG D Þ, and u 7 ! v is true if and only if L out ðuÞ \ L in ðvÞ 6 ¼ ;. A 2-hop labeling for G D is derived from a 2-hop cover of G D , that minimizes a set of SðU w ; w; V w Þ, as a set cover problem. Here, w 2 V ðG D Þ is called a center, and U w ; V w V ðG D Þ. SðU w ; w; V w Þ implies that, for every node, u 2 U w and v 2 V w , u 7 ! w and w 7 ! v, and therefore u 7 ! v. C o n s i d e r F i g . 2 , a n e x a m p l e i s SðU w ; w; V w Þ ¼ Sðfb 3 ; b 4 g; c 2 ; fe 0 gÞ. Here, c 2 is the center. It indicates: b 3 7 ! c 2 , b 4 7 ! c 2 , c 2 7 ! e 0 , b 3 7 ! e 0 , and b 4 7 ! e 0 . Several algorithms were proposed to fast compute a 2-hop cover for G D [9], [10], [11], [12] and to maintain such a computed 2-hop cover [10], [13]. Let H ¼ fS w1 ; S w2 ; . . .g be the set of 2-hop cover computed, where S wi ¼ SðU wi ; w i ; V wi Þ and all w i are centers. The 2-hop labeling for a node v is LðvÞ ¼ ðL in ðvÞ; L out ðvÞÞ. Here, L in ðvÞ i...

show abstract

Efficient management of transitive relationships in large data and knowledge bases

Cited by 250 publications

References 9 publications

CoUPE: Continuous Query Processing Engine for Evolving Graphs

CoUPE: Continuous Query Processing Engine for Evolving Graphs

k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery

Graph Pattern Matching: A Join/Semijoin Approach

Contact Info

Product

Resources

About