Awkward Arrays in Python, C++, and Numba

Future analysis of ATLAS data will involve new small-sized analysis formats to cope with the increased storage needs. The smallest of these, named DAOD_PHYSLITE, has calibrations already applied to allow fast downstream analysis and avoid the need for further analysis-specific intermediate formats. This allows for application of the “columnar analysis” paradigm where operations are applied on a per-array instead of a per-event basis. We will present methods to read the data into memory, using Uproot, and also discuss I/O aspects of columnar data and alternatives to the ROOT data format. Furthermore, we will show a representation of the event data model using the Awkward Array package and present proof of concept for a simple analysis application.

show abstract

“…This confirms the findings shown in Ref. [5]. Notably, the loading time is always lower than reading with ROOT's TTree::Draw function.…”

Section: I/o and Storage Formatssupporting

confidence: 91%

“…Many of them are collected under the umbrella of the Scikit-HEP [3] project. Currently, the Coffea framework [4], together with the Awkward Array package [5], provide the most complete set of tools for columnar data analysis in HEP.…”

Section: Introductionmentioning

confidence: 99%

Columnar data analysis with ATLAS analysis formats

2021

View full text Add to dashboard Cite

show abstract

“…Previously, we quantified this slow-down [5] using ROOT TTrees containing float, std::vector<float>, and vectors of vectors up to three levels deep, reading them with the then-current Python codebase and with custom C++ code, which represents what is possible now that Awkward Array is implemented in C++. The read performance of float data is identical for Python and C++, the C++ is several times faster for std::vector<float>, and the gap widens to factors of hundreds for the doubly-nested and triply-nested cases.…”

Section: Motivationmentioning

confidence: 99%

“…Even the TypedArrayBuilder use-case works by "wiring" its fixed suite of commands to algorithmically generated Forth subroutines. Some TypedArrayBuilder commands must change the state of its finite-state machine: for instance, when filling an array of doubly nested lists of integers like [ [1,2], [3]], [], [[4], [5]], the first begin_list ([) puts it into a state that expects another begin_list ([) or end_list (]); the second puts it into a state that expects integer or end_list (]). And yet, each of these commands must return control-flow to its caller and remember its state for the next call.…”

Section: Awkwardforth Virtual Machinementioning

confidence: 99%

AwkwardForth: accelerating Uproot with an internal DSL

et al. 2021

Self Cite

View full text Add to dashboard Cite

File formats for generic data structures, such as ROOT, Avro, and Parquet, pose a problem for deserialization: it must be fast, but its code depends on the type of the data structure, not known at compile-time. Just-in-time compilation can satisfy both constraints, but we propose a more portable solution: specialized virtual machines. AwkwardForth is a Forth-driven virtual machine for deserializing data into Awkward Arrays. As a language, it is not intended for humans to write, but it loosens the coupling between Uproot and Awkward Array. AwkwardForth programs for deserializing record-oriented formats (ROOT and Avro) are about as fast as C++ ROOT and 10–80× faster than fastavro. Columnar formats (simple TTrees, RNTuple, and Parquet) only require specialization to interpret metadata and are therefore faster with precompiled code.

show abstract

“…Many scientists are taking advantage of Jupyter notebooks [6], which provide a straightforward way of documenting a data analysis. Column-wise data analysis [7], in which a single operation on a vector of events replaces calculations on individual events serially, is seen as a way to for the field to take advantage of vector processing units in modern CPUs, leading to significant speed-ups in throughput. Also, declarative programming paradigms [8] can make it simpler for physicists to intuitively code their analysis.…”

Section: Introductionmentioning

confidence: 99%

Coffea-casa: an analysis facility prototype

et al. 2021

View full text Add to dashboard Cite

Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The “Coffea-casa” prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead of the command-line interface and asynchronous batch access, a notebook-based web interface and interactive computing is provided. Instead of writing event loops, the columnbased Coffea library is used. In this paper, we describe the architectural components of the facility, the services offered to end users, and how it integrates into a larger ecosystem for data access and authentication.

show abstract

Awkward Arrays in Python, C++, and Numba

Cited by 18 publications

References 6 publications

Columnar data analysis with ATLAS analysis formats

Columnar data analysis with ATLAS analysis formats

AwkwardForth: accelerating Uproot with an internal DSL

Coffea-casa: an analysis facility prototype

Contact Info

Product

Resources

About