This paper presents two new architecture optimiza tions to improve the processing performance of video applications with a high degree of data parallelism in VLIW processors. On the one hand, a new register file access mechanism, called X4 operation mode, allows to access wide operands made up of several consecutive registers in the register file, while keeping its normal functionality (i.e. single read/write register access). On the other hand, a new functional unit is proposed to efficiently process a typical stereoscopic video application based on a rank transformation and a semi-global-matching algorithm. An evaluation of those enhanced mechanisms is performed using a generic VLIW architecture and the resulting VLIW processor is compared with other CPU/GPU and FPGA implementations. The proposed architecture provides the full flexibility of a programmable processor, while processing 64Ox480 stereo video sequences under real-time conditions, what is not possible with the compared CPUs or GPUs.
I. INT RODUCTIONStereo matching is one of the most researched topics in com puter vision and is used to extract detailed depth information (disparity information) from passive stereo-camera systems. This is required in numerous machine vision applications like driver assistance systems or 3D video processing. On the one hand, systems based on passive stereo-cameras can provide highly accurate three dimensional details with a lower cost than those based on active radar or lidar sensors. Moreover, the same camera system can be used to perform other kinds of video processing, such as pedestrian, vehicle, or road detection. On the other hand, the algorithms used to extract the depth information from those video stereo-camera systems require a high computational load and the use of dedicated hardware embedded systems to provide real-time processing capabilities. Several autonomous embedded system solutions based on dedicated hardware structures, specially designed for an FPGA or ASIC technology, can be found in literatureThe use of a programmable embedded processor in com puter vision systems allows to easily introduce new algorithm modifications by just modifying the application code without requiring to redesign the complete hardware architecture. At the Institute of Microelectronic Systems, the research project RAPANUI [4] provides a complete design space exploration environment based on a generic VLIW architecture, which includes a parameterized pipeline architecture simulator, an aggressive instruction scheduler, and a parameterized HDL description of the generic VLIW architecture [5]. By using this environment, the generic VLIW architecture can be optimized and improved in terms of performance and hardware cost for a specific group of applications. This paper is focused on novel VLIW architecture optimiza tions for efficient processing of stereoscopic video applica tions. More narrowly, it proposes a new mechanism, called X4 operation mode, that allows to concurrently use more functional units (FU) or special FUs with wi...