2015
DOI: 10.1007/s11227-015-1409-9
|View full text |Cite
|
Sign up to set email alerts
|

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Abstract: International audienceIn this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity)… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
6
1

Relationship

4
3

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 43 publications
0
8
0
Order By: Relevance
“…15 The technique employs analytical models to test a smaller number of implementations instead of the whole search space. 44,45 These models can exploit knowledge about modern processor architectures and be based on the data usage pattern. However, these refinements do not alleviate the previously considered drawbacks.…”
Section: Identification Of the Optimal Parameter Values For Mvmmentioning
confidence: 99%
“…15 The technique employs analytical models to test a smaller number of implementations instead of the whole search space. 44,45 These models can exploit knowledge about modern processor architectures and be based on the data usage pattern. However, these refinements do not alleviate the previously considered drawbacks.…”
Section: Identification Of the Optimal Parameter Values For Mvmmentioning
confidence: 99%
“…Many research works as well ATLAS [51] (one of the state of the art high performance libraries) apply loop tiling by taking into account only the cache size, i.e., the accumulated size of three rectangular tiles (one of each matrix) must be smaller or equal to the cache size; however, the elements of these tiles are not written in consecutive main memory locations (the elements of each tile sub-row are written in different main memory locations) and thus they do not use consecutive data cache locations; this means that having a set-associative cache, they cannot simultaneously fit in data cache due to the cache modulo effect. Moreover, even if the tile elements are written in consecutive main memory locations (different data array layout), the three tiles cannot simultaneously fit in data cache if the cache is two-way associative or direct mapped [52], [53]. Thus, loop tiling is efficient only when cache size, cache associativity and data array layouts, are addressed together as one problem and not separately.…”
Section: Loop Tiling and Data Array Layoutsmentioning
confidence: 99%
“…The strategy to efficiently parallelize this operation depends on the sparseness of the connectivity matrix. Depending on this type, there are multiple methods available, including single-instruction-multiple-data operations (SIMD), cache blocking, loop unrolling, prefetching and autotuning (Williams et al, 2007 ; Kelefouras et al, 2015 ). Thanks to the code generation approach used in ANNarchy, we will be able in future versions to implement these improvements depending on the known connectivity before compilation.…”
Section: Benchmarksmentioning
confidence: 99%