Driving big data with big compute

Byun, Chansup; Arcand, William; Bestor, David; Bergeron, Bill; Hubbell, Matthew; Kepner, Jeremy; McCabe, Andrew; Michaleas, Peter; Mullen, Julie; O'Gwynn, David; Prout, Andrew; Reuther, Albert; Rosa, Antonio; Yee, Charles

doi:10.1109/hpec.2012.6408678

Cited by 30 publications

(24 citation statements)

References 7 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have previously demonstrated data ingest rates in excess of four million records per second for our 8-node Accumulo instance, with speedup for up to 256 client processes parsing raw data and inserting records [11]. These ingest rates on the LLCySA system are roughly an order of magnitude faster than insert rates for traditional relational databases reported on the web [12], and the Accumulo architecture offers significantly more scalability.…”

Section: Case Study: Network Situational Awarenessmentioning

confidence: 93%

Understanding query performance in Accumulo

Sawyer

O'Gwynn

Tran³

et al. 2013

2013 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

Open-source, BigTable-like distributed databases provide a scalable storage solution for data-intensive applications. The simple key-value storage schema provides fast record ingest and retrieval, nearly independent of the quantity of data stored. However, real applications must support non-trivial queries that require careful key design and value indexing. We study an Apache Accumulo-based big data system designed for a network situational awareness application. The application's storage schema and data retrieval requirements are analyzed. We then characterize the corresponding Accumulo performance bottlenecks. Queries are shown to be communication-bound and server-bound in different situations. Inefficiencies in the opensource communication stack and filesystem limit network and I/O performance, respectively. Additionally, in some situations, parallel clients can contend for server-side resources. Maximizing data retrieval rates for practical queries requires effective key design, indexing, and client parallelization.

show abstract

Section: Case Study: Network Situational Awarenessmentioning

confidence: 93%

Understanding query performance in Accumulo

Sawyer

O'Gwynn

Tran³

et al. 2013

2013 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

show abstract

“…The reduce task will wait until all the mapper tasks are completed by setting a job dependency between the mapper tasks and the reducer task. LLGrid MapReduce is covered in more detail in [16].…”

Section: Batch Computing Jobs and Llmapreduce Jobsmentioning

confidence: 99%

LLSuperCloud: Sharing HPC systems for diverse rapid prototyping

Reuther

Kepner

Arcand

et al. 2013

2013 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

Abstract-1 The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing, virtualization, and data hosting. Similarly, the supercomputing community has developed new capabilities in heterogeneous, massively parallel hardware and software. Merging the benefits of enterprise clouds and supercomputing has been a challenging goal. Significant effort has been expended in trying to deploy supercomputing capabilities on cloud computing systems. These efforts have resulted in unreliable, low-performance solutions, which requires enormous expertise to maintain. LLSuperCloud provides a novel solution to the problem of merging enterprise cloud and supercomputing technology. More specifically LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases. The benefits of LLSuperCloud are highlighted using a mixed workload of C MPI, parallel Matlab, Java, databases, and virtualized web services.

show abstract

“…To achieve parallelism, we utilized LLGrid's LLGrid MapReduce facility [7]. To evaluate the algorithms, we used the macro-averaged precision and recall to obtain an F 1 score for each algorithm as defined in [8].…”

Section: A Experiments Setupmentioning

confidence: 99%

Achieving Linguistic Provenance via Plagiarism Detection

Idika

Phan

Varia

2013

2013 12th International Conference on Document Analysis and Recognition

View full text Add to dashboard Cite

Abstract-To go beyond what current provenance systems can capture for natural language text documents, we propose the Lincoln Laboratory Plagiarism for Provenance System (LLPlā) as an approach for capturing linguistic provenance. Linguistic provenance infers the origin of text based on its linguistic structure. We take a plagiarism detection approach to this task as identifying similar sections of text is fundamental to linguistic provenance and central to LLPlā's performance. Thus, to determine the most viable plagiarism detection algorithm for use in LLPlā, we evaluate three state-of-theart plagiarism detection algorithms. Moreover, we propose extensions to the best-performing algorithm that improve its precision with negligible effects on recall.

show abstract

Driving big data with big compute

Cited by 30 publications

References 7 publications

Understanding query performance in Accumulo

Understanding query performance in Accumulo

LLSuperCloud: Sharing HPC systems for diverse rapid prototyping

Achieving Linguistic Provenance via Plagiarism Detection

Contact Info

Product

Resources

About