Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

An, SangWoo; Seo, Seog Chung

doi:10.3390/app10113711

Cited by 16 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researches that optimize the AES encryption process on GPUs have still being studied. Recently, various optimization methods and results of AES using GPUs have been presented in [4], [5] and [6].…”

Section: Related Workmentioning

confidence: 99%

Designing a New XTS-AES Parallel Optimization Implementation Technique for Fast File Encryption

Seo

2022

IEEE Access

Self Cite

View full text Add to dashboard Cite

XTS-AES is a disk encryption mode of operation that uses the block cipher AES. Several studies have been conducted to improve the encryption speed using XTS-AES according to the increasing disk size. Among them, there are researches on parallel encryption of XTS-AES using GPU. Although these studies focus on parallel encryption of AES, optimization for the entire XTS mode has not been performed. The reason is that the α j computation process included in XTS mode is not suitable for parallel operation. Therefore, in this paper, we proposed several techniques for high-speed encryption in GPU by modifying XTS-AES into a form that is advantageous for parallel operation. The core idea is to pre-calculate the α j calculation on the CPU into a form that is easy to operate on the GPU. To achieve this goal, we analyzed the α j calculation process and present the parts that can be optimized. First, we presented a method that can replace multiple operations with a single table reference through the analyzed α j computation progress. Thereafter, we proposed a method that can be calculated by partially skipping the entire α j computation process that must be sequentially calculated through the table reference technique. For the proposed optimization implementation, we presented various results for evaluating the optimal implementation. In addition, we compared the performance of XTS-AES OpenSSL implementation on CPU and our proposed optimization implementation on GPU.

show abstract

Section: Related Workmentioning

confidence: 99%

Designing a New XTS-AES Parallel Optimization Implementation Technique for Fast File Encryption

Seo

2022

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Algorithm 2 is a summary of θ process. Algorithm 2 proposes a calculation method for one lane of the initial 1: χ and ι Process state[0] 2: asm("{\n\t" 3: ".reg.b64 value; \n\t" Register Setting 4: ".not.b64 value, %2;\n\t" NOT Operation : value = ∼%6 5: ".and.b64 value, value, %3;\n\t" AND Operation : value &= %3 6: ".xor b64 %0, value, %1; \n\t" χ process END 7: ".xor b64 %0, %0, %4; }\n\t" iota process 8: :"=l"(state[0]) Output Parameter : %0 = state[0] 9: :"l"(Buf f er[0]),"l"(Buf f er [6]),"l"(Buf f er [12]), "l"(RC[round]) 10: Input Parameters : Buf f er[0], Buf f er [6], Buf f er [12] and RC[round] 11: ); θ process. That is, a value used repeatedly in the θ process are calculated using lanes belonging to the same sheet.…”

Section: Optimization Of Sha-3 Internal Processmentioning

confidence: 99%

Fast Implementation of SHA-3 in GPU Environment

Choi

Seo²

2021

IEEE Access

Self Cite

View full text Add to dashboard Cite

Recently, Graphic Processing Units (GPUs) have been widely used for general purpose applications such as machine learning applications, acceleration of cryptographic applications (especially, blockchains), etc. The development of CUDA makes this General-Purpose computing on GPU possible. In particular, currently GPU technology has been widely used for server-side applications so as to provide fast and efficient service to a number of clients. In other words, servers need to process a large amount of user data and execute authentication process. Verifying the integrity of transmitted data is essential for ensuring that the data is not modified during transmission. Hash functions are the cryptographic algorithm which can verify the integrity of data and there are SHA-1, SHA-2, and SHA-3 standard hash functions. In 2015, Keccak algorithm was selected for SHA-3 competition by NIST. However, until now, software implementations of SHA-3 have not provided enough performance for various applications. In addition, SHA-3 and SHAKE using SHA-3 are being used in many Post-Quantum Cryptosystems (PQC) submitted to NIST PQC competition. Therefore, SHA-3 optimization research is required in the software environment. We propose an optimized SHA-3 software implementation on GPU environment. For performance efficiency, we propose several techniques including optimization of SHA-3 internal process, inline PTX optimization, optimized memory usage, and the application of asynchronous CUDA stream. As a result of applying the proposed optimization method, our SHA-3(512) (resp. SHA-3(256)) implementation without CUDA stream provides a maximum throughput of 88.51 Gb/s (resp. 171.62 Gb/s) on RTX2080Ti GPU. Furthermore, without the application of CUDA stream, our SHA-3(512) software on GTX1070 provides about 49.73% improved throughput compared with the previous best work on GTX1080, which shows the superiority of our proposed optimization methods. Our optimized SHA-3 software on GPU can be efficiently used for block-chain applications and several PQCs (especially, key generation process in Lattice-based cryptosystems).

show abstract

“…By utilizing the characteristics of warp actively, terabit throughput was proposed for various block ciphers. In [14], CHAM and LEA were optimized in GPU environment. Terabit throughput was achieved by integrating and resolving various memory problems that could occur in the GPU environment.…”

Section: Related Workmentioning

confidence: 99%

Parallel Implementations of ARX-Based Block Ciphers on Graphic Processing Units

Kim

Kwon

et al. 2020

Mathematics

Self Cite

View full text Add to dashboard Cite

With the development of information and communication technology, various types of Internet of Things (IoT) devices have widely been used for convenient services. Many users with their IoT devices request various services to servers. Thus, the amount of users’ personal information that servers need to protect has dramatically increased. To quickly and safely protect users’ personal information, it is necessary to optimize the speed of the encryption process. Since it is difficult to provide the basic services of the server while encrypting a large amount of data in the existing CPU, several parallel optimization methods using Graphics Processing Units (GPUs) have been considered. In this paper, we propose several optimization techniques using GPU for efficient implementation of lightweight block cipher algorithms on the server-side. As the target algorithm, we select high security and light weight (HIGHT), Lightweight Encryption Algorithm (LEA), and revised CHAM, which are Add-Rotate-Xor (ARX)-based block ciphers, because they are used widely on IoT devices. We utilize the features of the counter (CTR) operation mode to reduce unnecessary memory copying and operations in the GPU environment. Besides, we optimize the memory usage by making full use of GPU’s on-chip memory such as registers and shared memory and implement the core function of each target algorithm with inline PTX assembly codes for maximizing the performance. With the application of our optimization methods and handcrafted PTX codes, we achieve excellent encryption throughput of 468, 2593, and 3063 Gbps for HIGHT, LEA, and revised CHAM on RTX 2070 NVIDIA GPU, respectively. In addition, we present optimized implementations of Counter Mode Based Deterministic Random Bit Generator (CTR_DRBG), which is one of the widely used deterministic random bit generators to provide a large amount of random data to the connected IoT devices. We apply several optimization techniques for maximizing the performance of CTR_DRBG, and we achieve 52.2, 24.8, and 34.2 times of performance improvement compared with CTR_DRBG implementation on CPU-side when HIGHT-64/128, LEA-128/128, and CHAM-128/128 are used as underlying block cipher algorithm of CTR_DRBG, respectively.

show abstract

Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

Cited by 16 publications

References 10 publications

Designing a New XTS-AES Parallel Optimization Implementation Technique for Fast File Encryption

Designing a New XTS-AES Parallel Optimization Implementation Technique for Fast File Encryption

Fast Implementation of SHA-3 in GPU Environment

Parallel Implementations of ARX-Based Block Ciphers on Graphic Processing Units

Contact Info

Product

Resources

About