Boduo Li scite author profile

Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

show abstract

Supporting scalable analytics with latency constraints

Diao

Shenoy

2015

Proc. VLDB Endow.

View full text Add to dashboard Cite

Recently there has been a significant interest in building big data analytics systems that can handle both "big data" and "fast data". Our work is strongly motivated by recent real-world use cases that point to the need for a general, unified data processing framework to support analytical queries with different latency requirements. Toward this goal, we start with an analysis of existing big data systems to understand the causes of high latency. We then propose an extended architecture with mini-batches as granularity for computation and shuffling, and augment it with new model-driven resource allocation and runtime scheduling techniques to meet user latency requirements while maximizing throughput. Results from real-world workloads show that our techniques, implemented in Incremental Hadoop, reduce its latency from tens of seconds to sub-second, with 2x-5x increase in throughput. Our system also outperforms state-ofthe-art distributed stream systems, Storm and Spark Streaming, by 1-2 orders of magnitude when combining latency and throughput.

show abstract

Exploiting the Interplay between Memory and Flash Storage in Embedded Sensor Devices

Agrawal

Cao

et al. 2010

View full text Add to dashboard Cite

Abstract-Although memory is an important constraint in embedded sensor nodes, existing sensor applications and systems are typically designed to work under the memory constraints of a single platform and do not consider the interplay between memory and flash storage. In this paper, we present the design of a memory-adaptive flash-based sensor system that allows an application to exploit the presence of flash and adapt to different amounts of RAM on the embedded device. We describe how such a system can be exploited by sensor data management applications. Our design involves several novel features: flash and memory-efficient storage and indexing, techniques for efficient storage reclamation, and intelligent buffer management to maximize write coalescing. Our results show that our system is highly energy-efficient under different workloads, and can be configured for sensor platforms with memory constraints ranging from a few kilobytes to hundreds of kilobytes.

show abstract

Towards Scalable One-Pass Analytics Using MapReduce

Mazur

Diao

et al. 2011

View full text Add to dashboard Cite

Scalla

Mazur

Diao

et al. 2012

ACM Trans. Database Syst.

View full text Add to dashboard Cite

Today’s one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Boduo Li

A platform for scalable one-pass analytics using MapReduce

Supporting scalable analytics with latency constraints

Exploiting the Interplay between Memory and Flash Storage in Embedded Sensor Devices

Towards Scalable One-Pass Analytics Using MapReduce

Scalla

Contact Info

Product

Resources

About