Proceedings of the 10th International Conference on World Wide Web 2001
DOI: 10.1145/371920.372095
|View full text |Cite
|
Sign up to set email alerts
|

Building a distributed full-text index for the Web

Abstract: We identify crucial design issues in building a distributed inverted index for a large collection of web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance resul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
91
0
1

Year Published

2002
2002
2014
2014

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 104 publications
(93 citation statements)
references
References 30 publications
0
91
0
1
Order By: Relevance
“…The disadvantages are: (1) the entire document collection has to be crawled at periodic intervals, and (2) every word in the collection has to be scanned to construct the inverted file. Distributed rebuilding techniques, such as [24], parallelizes (and pipelines) the process, but does not eliminate the need of scanning every word in every document. If the magnitude of change is small, scanning and re-indexing the words in documents that did not change is wasteful and unnecessary.…”
Section: Current Methodsmentioning
confidence: 99%
“…The disadvantages are: (1) the entire document collection has to be crawled at periodic intervals, and (2) every word in the collection has to be scanned to construct the inverted file. Distributed rebuilding techniques, such as [24], parallelizes (and pipelines) the process, but does not eliminate the need of scanning every word in every document. If the magnitude of change is small, scanning and re-indexing the words in documents that did not change is wasteful and unnecessary.…”
Section: Current Methodsmentioning
confidence: 99%
“…To build distributed index [4,5], the matrix should be partitioned, and then each sub matrix is distributed to each index server. Currently, there are two partitioning schemes for distributed inverted files.…”
Section: Two Inverted File Partitioning Scheme For Distributed Indexmentioning
confidence: 99%
“…[15]). In a real (Semantic) Web search engine, advanced techniques such as MapReduce [16] can be applied on a cluster of machines to speed up the index building process, thanks to the simplicity of the IR index structure.…”
Section: Fig 2 Posidx Index Structure Examplementioning
confidence: 99%