This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search. 19:2 Z. Hadjilambrou et al. search services are required to provide tight QoS guarantees, such as tail latencies below 500ms [2] even at peak traffic loads. Previous work aims at improving the latency, efficiency and cost of operation of search services. In the work of Meisner et al. [27], full system power management is evaluated for a web search workload. To improve energy efficiency, Lo et al. [20] proposed running each server just fast enough to satisfy global latency requirements, whereas Vamanan et al. [33] proposed to exploit time slack by slowing down individual sub-queries. The possibility of using mobile cores for web search for improved cost and energy efficiency is studied in the work of Reddi et al. [30]. Ren et al. [31] examined how web search can benefit from heterogeneous cores, whereas Haque et al. [10] and Jeon et al. [15] looked at adaptive parallelism for improving response times. Work stealing for meeting web search target latency is proposed by Li et al. [17]. Hsu et al. [14] propose a turbo boost framework that increases CPU voltage and frequency at fine-grain time intervals to reduce the latency of computational heavy search queries. Other work has collocated search applications with other types of workloads to increase data center utilization [25,26,35].This article presents a thorough top-down characterization of an open source search engine to improve the overall understanding of search engines. In particular, this work presents a characterization of the Lucene-based Nutch web search benchmark [8] on real hardware providing insights about the application and micro-architectural level behavior of this benchmark. This workload is based on the popular Lucene document search engine. Previous characterization efforts of this benchmark focused only on the query stream characterization [34] and micro-architectural characterization [8]. Another work conducted with the Nutch benchmark [9] evaluated the performance of index intra-server partitioning and slower cores. However, that work used a small inde...