A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Kelefouras, Vasilios; Kritikakou, Angeliki; Papadima, Elissavet; Goutis, C.E.

doi:10.1007/s11227-015-1409-9

Cited by 11 publications

(8 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…15 The technique employs analytical models to test a smaller number of implementations instead of the whole search space. 44,45 These models can exploit knowledge about modern processor architectures and be based on the data usage pattern. However, these refinements do not alleviate the previously considered drawbacks.…”

Section: Identification Of the Optimal Parameter Values For Mvmmentioning

confidence: 99%

Analytical modeling of matrix–vector multiplication on multicore processors

Gareev

Akimova

2021

Math Methods in App Sciences

View full text Add to dashboard Cite

The efficiency of matrix–vector multiplication is of considerable importance. No current approaches can optimize this sufficiently well under severe time constraints. All major existing methods are based on either manual‐tuning or auto‐tuning and can therefore be time‐consuming. We introduce an alternative model‐driven approach, which is used to map the implementation of matrix–vector multiplication to a target architecture and analytically obtain its parameters. The approach yields the performance that is competitive with optimized Basic Linear Algebra Subprograms (BLAS)‐like dense linear algebra libraries without the need for manual‐tuning or auto‐tuning. Our method provides competitive performance across hardware architectures and can be utilized to obtain single‐threaded and multi‐threaded implementations on multicore processors. We expect that this approach allows the community to progress from valuable engineering solutions to techniques with a broader application.

show abstract

Section: Identification Of the Optimal Parameter Values For Mvmmentioning

confidence: 99%

Analytical modeling of matrix–vector multiplication on multicore processors

Gareev

Akimova

2021

Math Methods in App Sciences

View full text Add to dashboard Cite

show abstract

“…Many research works as well ATLAS [51] (one of the state of the art high performance libraries) apply loop tiling by taking into account only the cache size, i.e., the accumulated size of three rectangular tiles (one of each matrix) must be smaller or equal to the cache size; however, the elements of these tiles are not written in consecutive main memory locations (the elements of each tile sub-row are written in different main memory locations) and thus they do not use consecutive data cache locations; this means that having a set-associative cache, they cannot simultaneously fit in data cache due to the cache modulo effect. Moreover, even if the tile elements are written in consecutive main memory locations (different data array layout), the three tiles cannot simultaneously fit in data cache if the cache is two-way associative or direct mapped [52], [53]. Thus, loop tiling is efficient only when cache size, cache associativity and data array layouts, are addressed together as one problem and not separately.…”

Section: Loop Tiling and Data Array Layoutsmentioning

confidence: 99%

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Kelefouras

2017

Computing

Self Cite

View full text Add to dashboard Cite

Today's compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate subproblems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way.In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced a) by addressing the aforementioned transformations together as one problem and not separately, b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse).The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.

show abstract

“…The strategy to efficiently parallelize this operation depends on the sparseness of the connectivity matrix. Depending on this type, there are multiple methods available, including single-instruction-multiple-data operations (SIMD), cache blocking, loop unrolling, prefetching and autotuning (Williams et al, 2007 ; Kelefouras et al, 2015 ). Thanks to the code generation approach used in ANNarchy, we will be able in future versions to implement these improvements depending on the known connectivity before compilation.…”

Section: Benchmarksmentioning

confidence: 99%

ANNarchy: a code generation approach to neural simulations on parallel hardware

Vitay

Dinkelbach

Hamker

2015

Front. Neuroinform.

View full text Add to dashboard Cite

Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions.

show abstract

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Cited by 11 publications

References 43 publications

Analytical modeling of matrix–vector multiplication on multicore processors

Analytical modeling of matrix–vector multiplication on multicore processors

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

ANNarchy: a code generation approach to neural simulations on parallel hardware

Contact Info

Product

Resources

About