Building a distributed full-text index for the Web

Melink, Sergey; Raghavan, S. V.; Yang, Beverly; García-Molina, Héctor

doi:10.1145/371920.372095

Cited by 104 publications

(93 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The disadvantages are: (1) the entire document collection has to be crawled at periodic intervals, and (2) every word in the collection has to be scanned to construct the inverted file. Distributed rebuilding techniques, such as [24], parallelizes (and pipelines) the process, but does not eliminate the need of scanning every word in every document. If the magnitude of change is small, scanning and re-indexing the words in documents that did not change is wasteful and unnecessary.…”

Section: Current Methodsmentioning

confidence: 99%

Efficient Update of Indexes for Dynamically Changing Web Documents

et al. 2007

View full text Add to dashboard Cite

Section: Current Methodsmentioning

confidence: 99%

Efficient Update of Indexes for Dynamically Changing Web Documents

et al. 2007

View full text Add to dashboard Cite

“…To build distributed index [4,5], the matrix should be partitioned, and then each sub matrix is distributed to each index server. Currently, there are two partitioning schemes for distributed inverted files.…”

Section: Two Inverted File Partitioning Scheme For Distributed Indexmentioning

confidence: 99%

A Two-Tier Distributed Full-Text Indexing System

Zhang¹,

Chen²,

He³

et al. 2014

Appl. Math. Inf. Sci.

View full text Add to dashboard Cite

Abstract:The performance of indexing systems is very important for a search engine. Usually, indexing systems on large-scale clusters can provide high search efficiency, but it brings expensive hardware costs. The costs would be greatly reduced if a distributed indexing system runs on small-scale clusters connected by the Internet. Two current inverted file partitioning schemes: document partitioning and term partitioning, have their merits individually. A two-tier distributed full-text indexing system is implemented, which uses document partitioning among the clusters and term partitioning inside each cluster. Our experiments show that the system performs well in search efficiency, resource consuming and load balance.

show abstract

“…[15]). In a real (Semantic) Web search engine, advanced techniques such as MapReduce [16] can be applied on a cluster of machines to speed up the index building process, thanks to the simplicity of the IR index structure.…”

Section: Fig 2 Posidx Index Structure Examplementioning

confidence: 99%

Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data

et al. 2007

View full text Add to dashboard Cite

Abstract. As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we breifly describe how Semplore is used for searching Wikipedia and an IBM customer's product information.

show abstract

Building a distributed full-text index for the Web

Cited by 104 publications

References 30 publications

Efficient Update of Indexes for Dynamically Changing Web Documents

Efficient Update of Indexes for Dynamically Changing Web Documents

A Two-Tier Distributed Full-Text Indexing System

Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data

Contact Info

Product

Resources

About