Versatile video coding (VVC) will be released by 2020, and it is expected to be the nextgeneration video coding standard. One of its enhancements is multiple transform selection (MTS) for core transform. MTS uses three different types of 2D discrete sine/cosine transforms (DCT-II, DCT-VIII and DST-VII) and up to 64 × 64 transform unit sizes. With this schema, significant enhancements of the compression ratio are obtained at the expense of more computational complexity on both encoders and decoders. In this paper, a deeply pipelined high-performance architecture is proposed that implements the three transforms for sizes from 4 × 4 to 64 × 64 according to working draft 4 of the standard. The design has been described in very high-speed integrated circuit hardware description language (VHDL), and it has been prototyped in a system on a programmable chip (SoPC). It is able to process up to 64 fps@3840 × 2.160 for 4 × 4 transform sizes. To the best of our knowledge, this is the first implementation of an architecture for VVC MTS supporting the 64 × 64 size. INDEX TERMS FPGA, hardware architecture, multiple transform selection, pipeline, SoPC, versatile video coding. * The architecture proposed in this paper has been implemented and tested in accordance with WD 4. † The number of multiplications required by a direct implementation of a 2D N×N point DCT/DST is N 2 .
In this paper, we show how to derive all the optimum multi-path delay commutator (MDC) fast Fourier transform (FFT) hardware architectures in terms of delays and multiplexers and calculate the number of such architectures. The proposed approach is based on analyzing the orders at the FFT stages that lead to optimum number of delays and multiplexers. The results show that there exist a large number of optimum MDC FFTs. This large design space can be explored in the future in order to design efficient MDC architectures that not only optimize the number of delays and multiplexers, but also other figures of merit such as the number of rotators or the input/output data order.
In this paper, we propose two efficient implementations of complex multipliers on field-programmable gate arrays (FPGAs) using DSP slices. The first implementation aims for high throughput and the second one for low area. By mapping these circuits to the DSP slices in the FPGA, the proposed implementations have the advantage that they only require three DSP slices. Experimental results show that the proposed high-throughput implementation saves hardware resources with respect to previous approaches, while reaching the highest achievable clock frequency. Alternatively, the proposed low-area implementation reduces the amount of hardware resources even further at the cost of reducing the clock frequency.
This paper presents a novel CORDIC-based approach for computing arcsine and arccosine functions. Previous approaches based on CORDIC either calculate double iterations, which increases the complexity of stages, or have a high approximation error. By contrast, the proposed approach presents a novel compensation of the gain in the rotations that allows for an accurate computation of the arcsine and arccosine that does not increase the complexity of the stages. The proposed approach has been implemented on an FPGA to demonstrate its benefits.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.