Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

Bingmann, Timo; Axtmann, Michael; Jöbstl, Emanuel; Lamm, Sebastian; Nguyen, Huyen Chau; Noe, Alexander; Schlag, Sebastian; Stumpp, Matthias; Sturm, Tobias; Sanders, Peter

doi:10.48550/arxiv.1608.05634

Cited by 3 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We propose two distributed graph clustering algorithms, DSLM-Mod and DSLM-Map, that optimize modularity and map equation, respectively. Our algorithms are the first graph clustering algorithms based on Thrill [4], a distributed big data processing framework written in C++ that implements an extended MapReduce model. Our algorithms are easy to extend for optimizing different density-based quality measures.…”

Section: Contributionmentioning

confidence: 99%

See 1 more Smart Citation

Distributed Graph Clustering Using Modularity and Map Equation

Hamann

Strasser

Wagner

et al. 2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

We study large-scale, distributed graph clustering. Given an undirected graph, our objective is to partition the nodes into disjoint sets called clusters. A cluster should contain many internal edges while being sparsely connected to other clusters. In the context of a social network, a cluster could be a group of friends. Modularity and map equation are established formalizations of this internally-dense-externally-sparse principle. We present two versions of a simple distributed algorithm to optimize both measures. They are based on Thrill, a distributed big data processing framework that implements an extended MapReduce model. The algorithms for the two measures, DSLM-Mod and DSLM-Map, differ only slightly. Adapting them for similar quality measures is straight-forward. We conduct an extensive experimental study on real-world graphs and on synthetic benchmark graphs with up to 68 billion edges. Our algorithms are fast while detecting clusterings similar to those detected by other sequential, parallel and distributed clustering algorithms. Compared to the distributed GossipMap algorithm, DSLM-Map needs less memory, is up to an order of magnitude faster and achieves better quality.

show abstract

Section: Contributionmentioning

confidence: 99%

“…Thrill [4] is a distributed C++ big data processing framework. It can distribute the program execution over multiple machines and threads within a machine.…”

Section: Thrillmentioning

confidence: 99%

Distributed Graph Clustering Using Modularity and Map Equation

Hamann

Strasser

Wagner

et al. 2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

show abstract

“…We explored using the Thrill [26] library to track the most energetic particles for the results of VPIC plasma physics simulation [32]. Thrill is a research project that aims to provide a bridge between big data analytics and HPC platforms.…”

Section: Solution Approachmentioning

confidence: 99%

The ISTI Rapid Response on Exploring Cloud Computing 2018

Coffrin

Arnold

Eidenbenz

et al. 2018

View full text Add to dashboard Cite

CloudFront: CloudFront provides a fast and secure content delivery service for web-hosting. Cloud-Front simplifies the process of delivering content with low latency and high bandwidth across the globe and provides basic threat mitigation tools, for example to protect the web service from DDoS attacks.Route 53: Route 53 is a reliable and scaleable DNS service that makes it easy to route users to web applications hosted at specific IP addresses.

show abstract

“…We implemented five suffix array construction algorithms using the distributed big data batch computation framework Thrill [2]. Thrill works with distributed immutable arrays (DIAs) storing tuples.…”

Section: A Short Introduction Into Thrillmentioning

confidence: 99%

“…= [ (0, 3),(4,3),(8,2),(1,7),(5,7), (2, 1), (6, 0),(3,6),(7,5) ] // 3.= [ (0, 3, 3), (4, 3, 2), (8, 2, 0), (1, 7, 7), (5, 7, 0), (2, 1, 0), (6, 0, 0), (3, 6, 5), (7, 5, 0) ] // 3.= [ (6, 0, 0), (2, 1, 0), (8, 2, 0), (4, 3, 2), (0, 3, 3), (7, 5, 0),(3,6,5),(5,7, 0),(1,7,7) ] // 1.= [ (6, 0), (2, 1),(8,2), (4, 3), (0, 4),(7,5),(3,6),(5,7),(1,8) ] // 1.5 1 item with rank 0 // 1.6…”

mentioning

confidence: 99%

Scalable Construction of Text Indexes with Thrill

Bingmann

Gog

Kurpicz

2018

2018 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows us to process input sizes in orders of magnitude that have not been considered before.Window k (w) and FlatWindow k (w ′ ) takes an input DIA A X and a window function w : N 0 × A k → B. The operation scans over X with a window of size k and applies w once to each set of k consecutive items from X and their index in X. The final k − 1 indexes with less than k consecutive items are delivered to w as partial windows padded with sentinel values. The result of all invocations of w is returned as a DIA B containing |X| items in the order. FlatWindow is a variant of Window which takes a input DIA A X and a window function w ′ : N 0 × A k → list(B). The only difference compared to Window is, that w ′ can emit zero or more items that are concatenated in the resulting DIA B in the order they are emitted. 3 PrefixSum(s) Given an input DIA A X and an associative operation s :If Sort is called without a comparison function, we assume the tuples are compared component-wise with the first component being most significant, the second component the second most significant, and so on. Merge(X 1 , . . . , X n , c) Given a set of sorted DIA A s X 1 , . . . , X n and a less-comparison function c : A × A → bool, Merge returns DIA A Y that contains all tuples of X 1 , . . . , X n and is sorted with respect to c. If Merge is called without a comparison function we compare the tuples component-wise (see Sort). Union(X 1 , . . . , X n ) Given a set of DIA A s X 1 , . . . , X n , Union returns DIA A Y = n i=1 X i containing all items of the input in an arbitrary order. Zip(X 1 , . . . , X n , f ) Given a set of DIAs X 1 , . . . , X n of type A 1 , . . . , A n of equal size (|X 1 | = · · · = |X n |) and a function f :. . , X n [i]) for all i = 0, . . . , |X 1 | − 1. ZipWithIndex(f ) Given an input DIA A X and a function f : (N 0 , A) for all i = 0, . . . , |X| − 1 Max(c) Given an input DIA A X, Max returns the maximum item m = max c X with respect to a less-comparison function c : A × A → bool. By default (if Max is called without a comparison function) the tuples are compared component-wise (see Sort). Size() Given an input DIA A X, Size returns the number of items in X, i.e., |X|.Algorithm 1: Generic Prefix Doubling algorithm.3 for k := 1 to ⌈log 2 |T |⌉ − 1 do 4 S := S.Sort((i, r0, r1) by (r0, r1)) // Sort triples by name pair. 5 N := S.FlatWindow2((i, [ a, b ]) → CmpName(i, a, b)) // Map to names 0 or i. 6 if N.Filter((i, r) → (r = 0)).Size() = 1 then // If all names distinct, then 7 return N.Map((i, r) → i) // return names a...

show abstract

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

Cited by 3 publications

References 0 publications

Distributed Graph Clustering Using Modularity and Map Equation

Distributed Graph Clustering Using Modularity and Map Equation

The ISTI Rapid Response on Exploring Cloud Computing 2018

Scalable Construction of Text Indexes with Thrill

Contact Info

Product

Resources

About