Parsing gigabytes of JSON per second

Langdale, Geoff; Lemire, Daniel

doi:10.1007/s00778-019-00578-5

Cited by 46 publications

(58 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We expect that such new instruction sets should be applicable to base64 decoding and encoding. Future work could also integrate fast base64 decoders inside vectorized parsers such as simdjson 15 .…”

Section: Resultsmentioning

confidence: 99%

Base64 encoding and decoding at almost the speed of a memory copy

Muła

Lemire

2019

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding uses 64 ASCII characters: both lower and upper case Latin letters, digits and two other symbols. We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors.Our implementation generates several times fewer instructions than previous SIMDaccelerated base64 codecs. It is also more versatile, as it can be adapted-even at runtime-to any base64 variant by only changing constants.

show abstract

Section: Resultsmentioning

confidence: 99%

Base64 encoding and decoding at almost the speed of a memory copy

Muła

Lemire

2019

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

show abstract

“…After loading v 1 , we detect all invalid 2-byte sequences at once using vectorized classification, a concept we documented in earlier work [11]. If a bit in the range 0-6 is set in all three looked-up patterns for a byte as checked with the AND instruction, 5 that byte (and the UTF-8) is considered invalid.…”

Section: Invalid 2-byte Sequencesmentioning

confidence: 99%

“…There has been much work on the acceleration of text content using SIMD instructions (e.g., base64 [14,15], JSON [11], XML [16], HTML [17], CVS [18]). We are not aware of any published work directly related to Unicode validation using SIMD instructions other than our own [11]. Cameron [19] has worked on the related problem of UTF-8 to UTF-16 transcoding using SIMD instruction, but their approach is not applicable to high-speed validation.…”

Section: Related Workmentioning

confidence: 99%

Validating UTF‐8 in less than one instruction per byte

Keiser

Lemire

2020

Softw Pract Exp

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, with the release of CityJSON 1.0, an effort was made to optimize the CityJSON parser in azul to the same level as the CityGML parser. For this, azul 0.9 has a new parser based on the highly optimized experimental simdjson library (https://github.com/lemir e/simdjson) (Langdale & Lemire, 2019), which uses modern processors' SIMD instructions to speed up parsing. It is worth noting that despite spending less time developing azul's CityJSON parser than the CityGML parser, azul is now able to parse CityJSON files twice or three times faster than the same files in CityGML.…”

Section: Cityjsonmentioning

confidence: 99%