D4M 2.0 schema: A general purpose high performance schema for the Accumulo database

Kepner, Jeremy; Anderson, Caroline F.; Arcand, William; Bestor, David; Bergeron, Bill; Byun, Chansup; Hubbell, Matthew; Michaleas, Peter; Mullen, Julie; O'Gwynn, David; Prout, Andrew; Reuther, Albert; Rosa, Antonio; Yee, Charles

doi:10.1109/hpec.2013.6670318

Cited by 37 publications

(32 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The code snippet below describes the D4M syntax for loading the incidence matrix file, inserting into a table called Tedge, generating the degree table and inserting it into TedgeDeg. Details about the general schema and table design can be found in [33]. mat.E.mat'],'E'); put(Tedge,putVal(E,'1,')); Edeg = putCol(sum(E.',2),'degree,'); put(TedgeDeg,num2str(Edeg));…”

Section: F Step 6: Ingestmentioning

confidence: 99%

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Gadepally

Kepner

Milechin³

et al. 2018

2018 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

Detecting anomalous behavior in network traffic is a major challenge due to the volume and velocity of network traffic. For example, a 10 Gigabit Ethernet connection can generate over 50 MB/s of packet headers. For global network providers, this challenge can be amplified by many orders of magnitude. Development of novel computer network traffic analytics requires: high level programming environments, massive amount of packet capture (PCAP) data, and diverse data products for "at scale" algorithm pipeline development. D4M (Dynamic Distributed Dimensional Data Model) combines the power of sparse linear algebra, associative arrays, parallel processing, and distributed databases (such as SciDB and Apache Accumulo) to provide a scalable data and computation system that addresses the big data problems associated with network analytics development. Combining D4M with the MIT SuperCloud manycore processors and parallel storage system enables network analysts to interactively process massive amounts of data in minutes. To demonstrate these capabilities, we have implemented a representative analytics pipeline in D4M and benchmarked it on 96 hours of Gigabit PCAP data with MIT SuperCloud. The entire pipeline from uncompressing the raw files to database ingest was implemented in 135 lines of D4M code and achieved speedups of over 20,000.

show abstract

Section: F Step 6: Ingestmentioning

confidence: 99%

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Gadepally

Kepner

Milechin³

et al. 2018

2018 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…of the four largest computing ecosystems: supercomputing, enterprise computing, big data, and traditional databases. The MIT SuperCloud has spurred the development of a number of cross-ecosystem innovations in high performance databases [31], [32], database management [33], data protection [34], database federation [35], [36], data analytics [37] and system monitoring [38].…”

Section: Experimental Environmentmentioning

confidence: 99%

Interactive Launch of 16,000 Microsoft Windows Instances on a Supercomputer

Jones

Kepner

Orchard

et al. 2018

2018 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

Simulation, machine learning, and data analysis require a wide range of software which can be dependent upon specific operating systems, such as Microsoft Windows. Running this software interactively on massively parallel supercomputers can present many challenges. Traditional methods of scaling Microsoft Windows applications to run on thousands of processors have typically relied on heavyweight virtual machines that can be inefficient and slow to launch on modern manycore processors. This paper describes a unique approach using the Lincoln Laboratory LLMapReduce technology in combination with the Wine Windows compatibility layer to rapidly and simultaneously launch and run Microsoft Windows applications on thousands of cores on a supercomputer. Specifically, this work demonstrates launching 16,000 Microsoft Windows applications in 5 minutes running on 16,000 processor cores. This capability significantly broadens the range of applications that can be run at large scale on a supercomputer.

show abstract

“…The SuperCloud is a fusion of the four large computing ecosystems: supercomputing, enterprise computing, big data and traditional databases into a coherent, unified platform. The MIT SuperCloud has spurred the development of a number of cross-ecosystem innovations in high performance databases [3], [13]; database management [19]; data protection [14]; database federation [11], [6]; data analytics [12]; dynamic virtual machines [23], [8] and system monitoring [7].…”

Section: Introductionmentioning

confidence: 99%

Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers

Mullen

Reuther

Arcand

et al. 2018

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

For decades, the use of HPC systems was limited to those in the physical sciences who had mastered their domain in conjunction with a deep understanding of HPC architectures and algorithms. During these same decades, consumer computing device advances produced tablets and smartphones that allow millions of children to interactively develop and share code projects across the globe. As the HPC community faces the challenges associated with guiding researchers from disciplines using high productivity interactive tools to effective use of HPC systems, it seems appropriate to revisit the assumptions surrounding the necessary skills required for access to large computational systems. For over a decade, MIT Lincoln Laboratory has been supporting interactive, ondemand high performance computing by seamlessly integrating familiar high productivity tools to provide users with an increased number of design turns, rapid prototyping capability, and faster time to insight. In this paper, we discuss the lessons learned while supporting interactive, on-demand high performance computing from the perspectives of the users and the team supporting the users and the system. Building on these lessons, we present an overview of current needs and the technical solutions we are building to lower the barrier to entry for new users from the humanities, social, and biological sciences.

show abstract

D4M 2.0 schema: A general purpose high performance schema for the Accumulo database

Cited by 37 publications

References 15 publications

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Interactive Launch of 16,000 Microsoft Windows Instances on a Supercomputer

Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers

Contact Info

Product

Resources

About