A Tight I/O Lower Bound for Matrix Multiplication

Smith, Tyler; Lowery, Bradley R.; Langou, Julien; Geijn, Robert A.

doi:10.48550/arxiv.1702.02017

Cited by 7 publications

(17 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Smith et al [Smith et al 2019] introduced a generalization of this argument, leading to tighter bounds in many cases. The idea is to decompose the execution into segments with T loads.…”

Section: Partitioningmentioning

confidence: 99%

“…3.3 on a CDAG G = (V , E), using its compact representation as a DFG. We also present, in 5.1.1, a generalization of one of the techniques introduced in [Dongarra et al 2008;Lowery and Langou 2014;Smith et al 2019; Smith and van de Geijn 2017] that these authors used to derive a tighter lower bound for matrix multiplication.…”

Section: K-partition Bound Derivationmentioning

confidence: 99%

See 1 more Smart Citation

Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs

Olivry¹,

Langou²,

Pouchet³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how efficiently the cache is used, e.g., untiled versus effectively tiled matrix-matrix multiplication. A significant current challenge is that no existing tool is able to meaningfully quantify the potential reduction to the data movement of a computation that can be achieved by more effective use of the cache through operation rescheduling. Asymptotic parametric expressions of data movement lower bounds have previously been manually derived for a limited number of algorithms, often without scaling constants. In this paper, we present the first compile-time approach for deriving non-asymptotic parametric expressions of data movement lower bounds for arbitrary affine computations.The approach has been implemented in a fully automatic tool (IOLB) that can generate these lower bounds for input affine programs. IOLB's use is demonstrated by exercising it on all the benchmarks of the PolyBench suite. The advantages of IOLB are many: (1) IOLB enables us to derive bounds for few dozens of algorithms for which these lower bounds have never been derived. This reflects an increase of productivity by automation.(2) Anyone is able to obtain these lower bounds through IOLB, no expertise is required. (3) For some of the most well-studied algorithms, the lower bounds obtained by IOLB are higher than any previously reported manually derived lower bounds.

show abstract

“…Smith et al [Smith et al 2019] introduced a generalization of this argument, leading to tighter bounds in many cases. The idea is to decompose the execution into segments with T loads.…”

Section: Partitioningmentioning

confidence: 99%

Section: K-partition Bound Derivationmentioning

confidence: 99%

Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs

Olivry¹,

Langou²,

Pouchet³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Smith et al [26] starts with a simple model of memory with two layers of memory: a small, fast memory with capacity of M elements and a large, slow memory with unlimited capacity. It shows that any algorithm for ordinary MMM 2 must read at least 2mnk/ √ M − 2M elements from slow memory and additionally write at least mn −M elements to slow memory.…”

Section: An I/o Lower Bound For MMMmentioning

confidence: 99%

“…In [26], it is shown that three algorithms, named Resident A, Resident B, and Resident C, a ain the lower bound on the number of reads from slow memory 3 . Additionally, Resident C a ains the lower bound on the number of writes to slow memory 3 .…”

Section: Resident Algorithms For MMMmentioning

confidence: 99%

See 1 more Smart Citation

The MOMMS Family of Matrix Multiplication Algorithms

Smith,

van de Geijn

2019

Preprint

Self Cite

View full text Add to dashboard Cite

As the ratio between the rate of computation and rate with which data can be retrieved from various layers of memory continues to deteriorate, a question arises: Will the current best algorithms for computing matrix-matrix multiplication on future CPUs continue to be (near) optimal?is paper provides compelling analytical and empirical evidence that the answer is "no". e analytical results guide us to a new family of algorithms of which the current state-of-the-art "Goto's algorithm" is but one member. e empirical results, on architectures that were custom built to reduce the amount of bandwidth to main memory, show that under different circumstances, different and particular members of the family become more superior.us, this family will likely start playing a prominent role going forward.

show abstract