Small SIMD Matrices for CERN High Throughput Computing

Lemaître, F.; Couturier, B.; Lacassagne, Lionel

doi:10.1145/3178433.3178434

Cited by 3 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The key elements in the process included having flexible data structures that can be grown and shrunk at run time using dynamic memory allocation, but also the possibility of traversing decay trees for the analysis of multi-staged particle decays. In order to reach a high computational speed, the model needed to allow easy vectorisation [10][11][12][13]. At the same time, the new model had to be compatible with the old event model also during the development phase to not break the workflow of the full reconstruction sequence and for quality assurance.…”

Section: The Lhcb Event Modelmentioning

confidence: 99%

Event and data persistency models for the LHCb Real Time Analysis System

De Cian,

Esen,

Hennequin

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

Starting in 2022, the upgraded LHCb detector is collecting data with a pure software trigger. In its first stage, reducing the rate from 30MHz to about 1MHz, GPUs are used to reconstruct and trigger on B and D meson topologies and high-pT objects in the event. In its second stage, a CPU farm is used to reconstruct the full event and perform candidate selections, which are persisted for offline use with an output rate of about 10GB/s. Fast data processing, flexible and custom-designed data structures tailored for SIMD architectures and efficient storage of the intermediate data at various steps of the processing pipeline onto persistent media, e.g. tapes is essential to guarantee the full physics program of LHCb. We present the event model and data persistency developments for the trigger of LHCb in Run 3. Particular emphasis is given to the novel software-design aspects with respect to the Run 1+2 data taking, the performance improvements which can be achieved and the experience of restructuring a major part of the reconstruction software in a large HEP experiment.

show abstract

Section: The Lhcb Event Modelmentioning

confidence: 99%

Event and data persistency models for the LHCb Real Time Analysis System

De Cian,

Esen,

Hennequin

et al. 2024

EPJ Web of Conf.

View full text Add to dashboard Cite

show abstract

“…DSLs can either generate high-level code in a more general language or directly go to an IR level such as LLVM-IR. For batched Cholesky factorization and Kalman filters, Lemaitre et al [34] propose a template system. Rodrigues et al [44] specify a small DSL for static tensor multiplications-even parallelizing error correction in 5G base stations [14] warrants a DSL.…”

Section: Related Workmentioning

confidence: 99%

Flynn’s Reconciliation

Thuerck

Weber²,

Bifulco³

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.

show abstract