Scaling up all pairs similarity search

Bayardo, Roberto J.; Ma, Yiming; Srikant, Ramakrishnan

doi:10.1145/1242572.1242591

Cited by 568 publications

(710 citation statements)

References 20 publications

Supporting

Mentioning

697

Contrasting

Unclassified

Order By: Relevance

“…4) and token sim (Eq. 5) that use different string similarity measures; we also compare to AllPairs [2], PP-Join(+) [20] and Ed-Join [19]; lastly, we compare to Naive [15] that detects owl:sameAs links without candidate selection. Since Ed-Join is not compatible with our Sun machine, we run it on a Linux machine (dual-core 2GHz processor and 4GB memory), and estimate its runtime on the Sun machine by examining runtime difference of bigram on the two machines.…”

Section: Evaluation Results On Rdf Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

Song

Heflin

2011

The Semantic Web – ISWC 2011

View full text Add to dashboard Cite

Abstract. One challenge for Linked Data is scalably establishing highquality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional approaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we propose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating literal values that are chosen using domain-independent unsupervised learning. We index the instances on the chosen predicates' literal values to efficiently look up similar instances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don't always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system.

show abstract

Section: Evaluation Results On Rdf Datasetsmentioning

confidence: 99%

“…All-Pairs [2], PP-Join(+) [20] and Ed-Join [19] are all inverted index based approaches. All-Pairs is a simple index based algorithm with certain optimization strategies.…”

Section: Related Workmentioning

confidence: 99%

Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

Song

Heflin

2011

The Semantic Web – ISWC 2011

View full text Add to dashboard Cite

show abstract

“…FastJoin [22] adopts fuzzy matching techniques that consider both token and character level similarity. Similar algorithms also include AllPairs [2] and IndexChunk [14]. Although our proposed candidate selection algorithm also adopts indexing techniques, a secondary filtering on the looked-up candidates from the index significantly reduces the size of the final candidate set.…”

Section: Related Workmentioning

confidence: 99%

Scalable and Domain-Independent Entity Coreference: Establishing High Quality Data Linkages across Heterogeneous Data Sources

Song

2012

The Semantic Web – ISWC 2012

View full text Add to dashboard Cite

Abstract. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources and assigned syntactically distinct identifiers. In order to facilitate data utilization in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process can also be referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. This proposal will investigate algorithms to solve this entity coreference problem in the Semantic Web in several aspects. The essence of entity coreference is to compute the similarity of instance pairs. Given the diversity of domains of existing datasets, it is important that an entity coreference algorithm be able to achieve good precision and recall across domains represented in various ways. Furthermore, in order to scale to large datasets, an algorithm should be able to intelligently select what information to utilize for comparison and determine whether to compare a pair of instances to reduce the overall complexity. Finally, appropriate evaluation strategies need to be chosen to verify the effectiveness of the algorithms.

show abstract

“…The indexing scheme we developed is inspired by redundant indexing methods such as LSH [4], RBV [7], OMEDRANK [3] or PvS [5], and by proposals addressing similarity joins, like [1] and [2]. We divide the database of keyframe signatures into segments (or buckets) such that, in each segment, the similarity between any two signatures is above a threshold; the search for similar keyframes is then only performed within each bucket.…”

Section: Keyframe Indexing For Off-line or Online Miningmentioning

confidence: 99%

Fast Content-Based Mining of Web2.0 Videos

Poullot

Crucianu

Buisson

2008

Advances in Multimedia Information Processing - PCM 2008

View full text Add to dashboard Cite

Abstract. The accumulation of many transformed versions of the same original videos on Web2.0 sites has a negative impact on the quality of the results presented to the users and on the management of content by the provider. An automatic identification of such content links between video sequences can address these difficulties. We put forward a fast solution to this video mining problem, relying on a compact keyframe descriptor and an adapted indexing solution. Two versions are developed, an off-line one for mining large databases and an online one to quickly post-process the results of keyword-based interactive queries. After demonstrating the reliability of the method on a ground truth, the scalability on a database of 10,000 hours of video and the speed on 3 interactive queries, some results obtained on Web2.0 content are illustrated.

show abstract

Scaling up all pairs similarity search

Cited by 568 publications

References 20 publications

Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

Scalable and Domain-Independent Entity Coreference: Establishing High Quality Data Linkages across Heterogeneous Data Sources

Fast Content-Based Mining of Web2.0 Videos

Contact Info

Product

Resources

About