This paper targets the growing area of interactive data analytics engines. We present a system called Getafix that intelligently decides replication levels and replica placement for data segments, in a way that is responsive to changing popularity of data access by incoming queries. We present an optimal solution to the static version of the problem, achieving minimality in both makespan and replication factor. Based on this intuition we build the Getafix system to handle queries and segments arriving in real time. We integrated Getafix into Druid, a modern open-source interactive data analytics engine. We present experimental results using workloads from Yahoo!'s production Druid cluster. Compared to existing work, Getafix achieves comparable query latency (both average and tail), while using 1.45-2.15× less memory in a private cloud. In a public cloud, for a 100 TB hot dataset size, Getafix can cut dollar costs by as much as 10 million annually with negligible performance impact. CCS CONCEPTS • Information systems → Online analytical processing engines; Cloud based storage; Distributed storage; Data warehouses;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.