Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

Kumaki, Takeshi; Ishizaki, Masakatsu; Koide, Tetsushi; Mattausch, Hans Jürgen; Kuroda, Yasushi; Gyohten, T.; Noda, Haruhiko; Dosaka, K.; Arimoto, Kazutami; Saito, Kazunori

doi:10.1093/ietele/e91-c.9.1409

Cited by 9 publications

(13 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, its efficiency in Mpixel/mm 2 is up to 3.3 times and 4.4 times higher than that of a CAM-less massive-parallel SIMD matrix and a conventional mobile DSP, respectively [15].…”

Section: Massive-parallel Memory-embedded Simd Matrix Architecturementioning

confidence: 95%

“…We have developed a massive parallel processor based on a SRAM-embedded matrix architecture [11]- [15], [20], [21], which overcomes the limitations in parallelism of previous architectures. This massive-parallel SIMD matrix architecture achieves for example 40 GOPS performance for 16-bit additions at 200 MHz clock frequency and 250 mW power dissipation in a 90 nm CMOS technology [11].…”

Section: Massive-parallel Memory-embedded Simd Matrix Architecturementioning

confidence: 99%

“…They need to perform tablelookup operations according to the S-Box or the S −1 -Box code-word tables, which can be done with a one-by-one substitution of 8-bit data parts from the 128 bit data blocks. Although, the table-lookup coding operation is known as less suitable for a SIMD processing architecture [15], it can be implemented on the massive-parallel memory-embedded SIMD matrix, as described in the following.…”

Section: Subbytes and Invsubbytes Transformationsmentioning

confidence: 99%

“…On the other hand, the CAM-enhanced massiveparallel memory-embedded SIMD matrix [15], which is verified to allow effective processing of both repeated arithmetic or logic operations and table-lookup coding operations in several multimedia applications, is more efficiently applied to generating substitution values in the SubBytes or the InvSubBytes transformation. Here, the CAM, which is included in table-lookup interface module (see Fig.…”

Section: Subbytes and Invsubbytes Transformationsmentioning

confidence: 99%

“…Then, the matrix architecture can enable effective processing of multimedia contents with a large data amount and decrease data-transfer power consumption. Furthermore, for improving the processing efficiency of multimedia data, a Content Addressable Memory (CAM)-enhanced massive-parallel SIMD matrix has also been proposed, which can realize simultaneously fast repeated arithmetic operations and pipelined table-lookup coding and achieves optimized performance of multimedia applications [15], [17]. For optimizing the security-related processing capability in mobile data storage devices, such as USB flash memory, SDD and so on, we propose an effective parallel-implementation method of the cryptographic algorithm on massive-parallel SIMD matrix processors.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

Kumaki

Koide

Mattausch

et al. 2011

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper presents a software-based parallel cryptographic solution with a massive-parallel memory-embedded SIMD matrix (MTX) for data-storage systems. MTX can have up to 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. Furthermore, a next-generation SIMD matrix called MX-2 has been developed by expanding processing-element capability of MTX from 2-bit to 4-bit processing. These SIMD matrix architectures are verified to be a better alternative for processing repeated-arithmetic and logical-operations in multimedia applications with low power consumption. Moreover, we have proposed combining Content Addressable Memory (CAM) technology with the massive-parallel memory-embedded SIMD matrix architecture to enable fast pipelined table-lookup coding. Since both arithmetic logical operation and table-lookup coding execute extremely fast on these architectures, efficient execution of encryption and decryption algorithms can be realized. Evaluation results of the CAMless and CAM-enhanced massive-parallel SIMD matrix processor for the example of the Advanced Encryption Standard (AES), which is a widelyused cryptographic algorithm, show that a throughput of up to 2.19 Gbps becomes possible. This means that several standard data-storage transfer specifications, such as SD, CF (Compact Flash), USB (Universal Serial Bus) and SATA (Serial Advanced Technology Attachment) can be covered. Consequently, the massive-parallel SIMD matrix architecture is very suitable for private information protection in several data-storage media. A further advantage of the software based solution is the flexible update possibility of the implemented-cryptographic algorithm to a safer future algorithm. The massive-parallel memory-embedded SIMD matrix architecture (MTX and MX-2) is therefore a promising solution for integrated realization of real-time cryptographic algorithms with low power dissipation and small Si-area consumption. key words: matrix-processing architecture, SIMD, bit-serial and wordparallel, CAM, table-lookup coding, cryptographic algorithm, AES

show abstract

“…Furthermore, its efficiency in Mpixel/mm 2 is up to 3.3 times and 4.4 times higher than that of a CAM-less massive-parallel SIMD matrix and a conventional mobile DSP, respectively [15].…”

Section: Massive-parallel Memory-embedded Simd Matrix Architecturementioning

confidence: 95%

Section: Massive-parallel Memory-embedded Simd Matrix Architecturementioning

confidence: 99%

Section: Subbytes and Invsubbytes Transformationsmentioning

confidence: 99%

Section: Subbytes and Invsubbytes Transformationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

Kumaki

Koide

Mattausch

et al. 2011

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

Secure data processing with massive‐parallel SIMD matrix for embedded SoC in digital‐convergence mobile devices

Kumaki

Koide

Fujino

2016

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

This paper presents secure data processing with a massive-parallel single-instruction multiple-data (SIMD) matrix for embedded system-on-chip (SoC) in digital-convergence mobile devices. Recent mobile devices are required to use private-informationsecure technology, such as cipher processing, to prevent the leakage of personal information. However, this adds to the device's required specifications, especially cipher implementation for fast processing, power consumption, low hardware cost, adaptability, and end-user's operation for maintaining the safety condition. To satisfy these security-related requirements, we propose the interleaved-bitslice processing method, which combines two processing concepts (bitslice processing and interleaved processing), for novel parallel block cipher processing with five confidentiality modes on mobile processors. Furthermore, we adopt a massiveparallel SIMD matrix processor (MX-1) for interleaved-bitslice processing to verify the effectiveness of parallel block cipher implementation. As the implementation target from the Federal Information Processing Standardization-approved block ciphers, a data encryption standard (DES), triple-DES, and Advanced Encryption Standard (AES) algorithms are selected. For the AES algorithm, which is mainly studied in this paper, the MX-1 implementation has up to 93% fewer clock cycles per byte than other conventional mobile processors. Additionally, the MX-1 results are almost constant for all confidentiality modes. The practical-use energy efficiency of parallel block cipher processing with the evaluation board for MX-1 was found to be about 4.8 times higher than that of a BeagleBoard-xM, which is a single-board computer and uses the ARM Cortex-A8 mobile processor. Furthermore, to improve the operation of a single-bit logical function, we propose the development of a multi-bit logical library for interleaved-bitslice cipher processing with MX-1. Thus, the number of clock cycles is the smallest among those reported in other related-studies. Consequently, interleaved-bitslice block cipher processing with five confidentiality modes on MX-1 is effective for the implementation of parallel block cipher processing for several digital-convergence mobile devices.

show abstract

Implementation of Floating‐Point Arithmetic Processing on Content Addressable Memory‐Based Massive‐ParallelSIMDmatriX Core

Kageyama

Arai

Hamano

et al. 2023

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

Several multimedia applications have recently been implemented on mobile devices, including digital image compression, video compression, and audio processing. Furthermore, Artificial Intelligence (AI) processing has grown in popularity, necessitating the execution of large amounts of data in mobile devices. Therefore, the processing core in a mobile device requires high performance, programmability, and versatility. Multimedia apps for mobile devices typically comprise repeated arithmetic and table‐lookup coding operations. A Content Addressable Memory‐based massive‐parallel SIMD matriX core (CAMX) is presented to increase the processing speed of both operations on a processing core. The CAMX serves as a CPU core accelerator for mobile devices. The CAMX supports high‐parallel processing and is equipped with two CAM modules for high‐speed repeated arithmetic and table‐lookup coding operations. The CAMX has great performance, programmability, and versatility on mobile devices because it can handle logical, arithmetic, search, and shift operations in parallel. This paper shows that the CAMX can process parallel repeated arithmetic and table‐lookup coding operations; single‐precision floating‐point arithmetic can calculate 1024 entries in 5613 clock cycles in parallel without embedding a dedicated floating‐point arithmetic unit. This clock cycle using two's complement‐reduced floating‐point addition implementation decreases 59% than the implementation of straight‐forward floating‐point addition. The implementation of straight‐forward floating‐point additions is improved as two's complement instruction reduced algorithms. Thus, this paper proposes an instruction reduction architecture by modulating the CAMX to directly access the data in the left and right CAM modules from the preserve register. The CAMX has achieved high performance, programmability, and versatility by not embedding a dedicated processing unit. Moreover, assuming the CAMX processes at an operating frequency of 0.1, 0.5, 1.0, or 1.5 GHz, it can process floating‐point additions above approximately 4500 parallelized data, with better performance than an ARM core using NEON and Vector Floating‐Point (VFP). In addition, related works executed by software instruction, dedicated floating‐point arithmetic unit, or both and the CAMX are compared while assuming the same operation frequency. From this result, the CAMX which has 128‐bit and 1024‐entry CAM modules achieves higher performance than the related works executed by only software instructions and by combining software instructions and a dedicated floating‐point arithmetic unit. © 2023 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.

show abstract

Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

Cited by 9 publications

References 9 publications

Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

Secure data processing with massive‐parallel SIMD matrix for embedded SoC in digital‐convergence mobile devices

Implementation of Floating‐Point Arithmetic Processing on Content Addressable Memory‐Based Massive‐ParallelSIMDmatriX Core

Contact Info

Product

Resources

About