Scalability of the Nutch search engine

Moreira, José E.; Michael, Maged M.; Silva, Dilma Da; Shiloach, Doron; Dube, Parijat; Zhang, Li

doi:10.1145/1274971.1274975

Cited by 22 publications

(15 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We studied the scalability of P orgl and P ghtl by doing experiments on IBM intranet website data as used in [1]. The text data was extracted from HTML files and loaded equally into the memory of the producer nodes, before, the indexing time measurement is started.…”

Section: Methodsmentioning

confidence: 99%

“…We get peak indexing rate within a single-index group(G = 64) of about 2.44 GB/min. Now assuming search is scalable to around 2K nodes [1], if we have 2K such independent indexgroups, each of size 64 nodes, we will get a peak indexing rate of around 5 TB/min, while maintaining acceptable search performance. As part of our experiment, we instead used 8K nodes and got a peak indexing rate 312 GB/min.…”

Section: Strong Scalability Studymentioning

confidence: 99%

“…Indexing Latency Variation 2) Using multiple index groups of same size(8) with total number of nodes increasing up to 512 that indexed total 8GB of data. Both experiments used 1000 queries from query-set used in [1]. For the first experiment P ghtl got better or similar search performance (2.34s for 64 nodes/group, 1.75s for 32 nodes/group) compared to P orgl (2.85s for 64 nodes/group, 2.13s for 32 nodes/group).…”

Section: Distributed Search Performancementioning

confidence: 99%

“…[1] claims, using experiments and queuing theory based analytical model, that after 2K nodes in a single cluster the search performance degrades at a fast rate. After this threshold, we need to have hierarchical indexing and search.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Highly scalable algorithm for distributed real-time text indexing

Narang

Agarwal

Kedia

et al. 2009

2009 International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Strong Scalability Studymentioning

confidence: 99%

Section: Distributed Search Performancementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Highly scalable algorithm for distributed real-time text indexing

Narang

Agarwal

Kedia

et al. 2009

2009 International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

show abstract

“…Nutch is particularly well suited for scaling out with a large number of commodity hardware [32,33].…”

Section: Building An Inverted Indexmentioning

confidence: 99%

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Wittek

Daranyi

2013

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Scientific computations have been using GPU-enabled computers successfully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infrastructure. Since the initial steps of text mining are typically data-intensive, and the ease of deployment of algorithms is an important factor in developing advanced applications, we introduce a flexible, distributed, MapReducebased text mining workflow that performs I/O-bound operations on CPUs with industry-standard tools and then runs compute-bound operations on GPUs which are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, and we achieve considerable speedups for random projection and self-organizing maps.

show abstract