Hardware-Efficient 2D-DCT/IDCT Architecture for Portable HEVC-Compliant Devices

Singhadia, Ashish; Mamillapalli, Meghan; Chakrabarti, Indrajit

doi:10.1109/tce.2020.3006213

Cited by 21 publications

(7 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the processing time of a single 8×8 block is 7 cycles, it consumes a large amount of on-chip DSP resources. Both the architectures proposed by Mert et al [35] and Singhadia et al [37] consume a lot of LUT and register resources. The architecture of [35] supports 2D DCT transformation units as 4×4 and 8×8, while [37] supports 4-/8-/16-/32point length 2D DCT/IDCT.…”

Section: Resultsmentioning

confidence: 99%

“…Both the architectures proposed by Mert et al [35] and Singhadia et al [37] consume a lot of LUT and register resources. The architecture of [35] supports 2D DCT transformation units as 4×4 and 8×8, while [37] supports 4-/8-/16-/32point length 2D DCT/IDCT. In [36], a new method called multiple transform selection (MTS) is proposed and selects the appropriate transform type for 2D DCT, running at 164 MHz on Arria 10 FPGA.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Effective Hardware Accelerator for 2D DCT/IDCT Using Improved Loeffler Architecture

Zhou

Zhang

2022

IEEE Access

View full text Add to dashboard Cite

This paper proposes an effective hardware accelerator for 2D 8×8 discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) using an improved Loeffler architecture. The accelerator optimizes the data stream of the Loeffler 8-point 1D DCT/IDCT according to the characteristics of image and video processing. An 8-stage pipeline structure greatly improves the processing speed by reasonably dividing the number of clock cycles and simplifying the arithmetic operations in each cycle. The multiplication-free approximation of the DCT coefficients is implemented through adders and shifters, combined with both fixed-point and canonic signed digit (CSD) coding. In particular, the proposed fast parallel transposed matrix architecture achieves the function of row-column coefficient conversion with lower circuit complexity. The FPGA implementation of the proposed architecture uses a Virtex-7 XC7VX330T device, running at 288 MHz with a throughput of 558 M Pixel/sec, and a Full HD real-time frame rate of up to 269 fps. Only 33 cycles are required to complete the 8×8 blocks of 2D DCT/IDCT, which can be used as a high-performance hardware accelerator for image and video compression encoding.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Effective Hardware Accelerator for 2D DCT/IDCT Using Improved Loeffler Architecture

Zhou

Zhang

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Therefore, an efficient hardware circuit is needed to implement fast IDCT. For 2D image blocks, a variety of fast IDCT technologies have been proposed in many researches [24][25][26][27][28], which can significantly improve the throughput and reduce computation. For 8×8 image blocks, 2D IDCT can be expressed as:…”

Section: Fast Idctmentioning

confidence: 99%

An FPGA‐based JPEG preprocessing accelerator for image classification

Zhang

Guo

et al. 2022

The Journal of Engineering

View full text Add to dashboard Cite

The FPGA‐based image classification accelerator has achieved success in many practical applications. However, most accelerators focus on solving the problem of convolution computation efficiency. End‐to‐end image classification involves many non‐convolutional operations, which can also become performance bottlenecks. Therefore, the authors propose an FPGA‐based JPEG preprocessing accelerator, which can accelerate non‐convolution operations of JPEG before feature extraction. To improve throughput and energy efficiency, four hardware structures are adopted in the design: 1) adaptive image block; 2) fast IDCT; 3) image block buffer; and 4) image block self‐location. The proposed design is evaluated on Xilinx XCZU7EV. The authors compare it with the optimized implementation of CPU and GPU, and the energy efficiency is improved by 23.07 times and 4.21 times, respectively. The throughput is 2.52 times better than the CPU implementation. And the authors demonstrate its practicality through a case study of image classification. These experimental results demonstrate its superior performance in terms of throughput and energy efficiency.

show abstract

“…The low-area cost design for multiple transform size HEVC applications using shifts and additions is presented in [22], in which 112 K gate counts are required for a 2-D IDCT transform. The 2-D DCT/IDCT [24] computes 2-D 4-/8-/16-/32-point DCT/IDCT and consumes 120 K gates supporting the 4K HEVC video sequences. As presented in Table 2, the proposed design achieves the smallest area cost when supporting multiple transform dimensions.…”

Section: Comparison With Existing Studiesmentioning

confidence: 99%

A low-area high-efficiency video coding inverse transform core using resource and time sharing architecture

Chen

Liu

2020

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

In this paper, a very-large-scale integration (VLSI) design that can support high-efficiency video coding inverse discrete cosine transform (IDCT) for multiple transform sizes is proposed. The proposed two-dimensional (2-D) IDCT is implemented at a low area by using a single one-dimensional (1-D) IDCT core with a transpose memory. The proposed 1-D IDCT core decomposes a 32-point transform into 16-, 8-, and 4-point matrix products according to the symmetric property of the transform coefficient. Moreover, we use the shift-and-add unit to share hardware resources between multiple transform dimension matrix products. The 1-D IDCT core can simultaneously calculate the first- and second-dimensional data. The results indicate that the proposed 2-D IDCT core has a throughput rate of 250 MP/s, with only 110 K gate counts when implemented into the Taiwan semiconductor manufacturing (TSMC) 90-nm complementary metal-oxide-semiconductor (CMOS) technology. The results show the proposed circuit has the smallest area supporting the multiple transform sizes.

show abstract

Hardware-Efficient 2D-DCT/IDCT Architecture for Portable HEVC-Compliant Devices

Cited by 21 publications

References 27 publications

Effective Hardware Accelerator for 2D DCT/IDCT Using Improved Loeffler Architecture

Effective Hardware Accelerator for 2D DCT/IDCT Using Improved Loeffler Architecture

An FPGA‐based JPEG preprocessing accelerator for image classification

A low-area high-efficiency video coding inverse transform core using resource and time sharing architecture

Contact Info

Product

Resources

About