A Multi-Shared Register File Structure for VLIW Processors

Payá-Vayá, Guillermo; Martin-Langerwerf, J.; Pirsch, P.

doi:10.1007/s11265-009-0355-2

Cited by 10 publications

(2 citation statements)

References 38 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The generic VLIW archi tecture allows to implement different kinds of monolithic and partitioned RF configurations in order to study the architecture trade-offs between performance and hard ware cost. In general, partitioned RF configurations are desirable in VLIW architectures with a high number of issue-slots [14]. VLIW architectures specially designed for processing video applications are optimized to take advantage of the high 459 A mechanism, called X4 operation mode, can be used to concurrently use four identical FUs without increasing the number of instructions to decode.…”

Section: Vliw Architecture Optimizationsmentioning

confidence: 99%

VLIW architecture optimization for an efficient computation of stereoscopic video applications

Payá-Vayá

Martin-Langerwerf

Banz

et al. 2010

The 2010 International Conference on Green Circuits and Systems

View full text Add to dashboard Cite

This paper presents two new architecture optimiza tions to improve the processing performance of video applications with a high degree of data parallelism in VLIW processors. On the one hand, a new register file access mechanism, called X4 operation mode, allows to access wide operands made up of several consecutive registers in the register file, while keeping its normal functionality (i.e. single read/write register access). On the other hand, a new functional unit is proposed to efficiently process a typical stereoscopic video application based on a rank transformation and a semi-global-matching algorithm. An evaluation of those enhanced mechanisms is performed using a generic VLIW architecture and the resulting VLIW processor is compared with other CPU/GPU and FPGA implementations. The proposed architecture provides the full flexibility of a programmable processor, while processing 64Ox480 stereo video sequences under real-time conditions, what is not possible with the compared CPUs or GPUs. I. INT RODUCTIONStereo matching is one of the most researched topics in com puter vision and is used to extract detailed depth information (disparity information) from passive stereo-camera systems. This is required in numerous machine vision applications like driver assistance systems or 3D video processing. On the one hand, systems based on passive stereo-cameras can provide highly accurate three dimensional details with a lower cost than those based on active radar or lidar sensors. Moreover, the same camera system can be used to perform other kinds of video processing, such as pedestrian, vehicle, or road detection. On the other hand, the algorithms used to extract the depth information from those video stereo-camera systems require a high computational load and the use of dedicated hardware embedded systems to provide real-time processing capabilities. Several autonomous embedded system solutions based on dedicated hardware structures, specially designed for an FPGA or ASIC technology, can be found in literatureThe use of a programmable embedded processor in com puter vision systems allows to easily introduce new algorithm modifications by just modifying the application code without requiring to redesign the complete hardware architecture. At the Institute of Microelectronic Systems, the research project RAPANUI [4] provides a complete design space exploration environment based on a generic VLIW architecture, which includes a parameterized pipeline architecture simulator, an aggressive instruction scheduler, and a parameterized HDL description of the generic VLIW architecture [5]. By using this environment, the generic VLIW architecture can be optimized and improved in terms of performance and hardware cost for a specific group of applications. This paper is focused on novel VLIW architecture optimiza tions for efficient processing of stereoscopic video applica tions. More narrowly, it proposes a new mechanism, called X4 operation mode, that allows to concurrently use more functional units (FU) or special FUs with wi...

show abstract

Section: Vliw Architecture Optimizationsmentioning

confidence: 99%

VLIW architecture optimization for an efficient computation of stereoscopic video applications

Payá-Vayá

Martin-Langerwerf

Banz

et al. 2010

The 2010 International Conference on Green Circuits and Systems

View full text Add to dashboard Cite

show abstract

“…The benefit of this concept is that the FU utilization increases because all data register can be swapped in a cycle. Otherwise, a respective VLIW approach would require several of read and write ports on a state-of-the-art register file or sophisticated optimization affecting also the code translation process [15]. However, the interconnects between the FPE registers must also be taken into account at the code translation process.…”

Section: B Data Path Organizationmentioning

confidence: 99%

A fully programmable FSM-based Processing Engine for Gigabytes/s header parsing

Septinus

Pirsch

Blume

et al. 2010

2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation

Self Cite

View full text Add to dashboard Cite

In this paper we discuss a new architecture, which is deployed for multi-standard packet inspection and basic network processing tasks in a high-performance network coprocessor. Thereby, concepts, architecture, compiler tool-chain and VLSI area estimation for this programmable finite state machine based (FSM-based) Processing Engine, FPE, are presented. The microarchitecture comprises an FSM-controlled instruction sequencing mechanism, a novel register organization scheme and a short pipeline instead of a typical multi-staged processor pipeline. This introduces several advantages for efficient handling of conditional branches and small look-ups. Those advantages can be utilized for packet classification applications. The FPE data path performance is compared to an ARM9-type processor in two exemplary header parsing kernels from the "CommBench" benchmark suite. According to the results, the presented engine provides a speed-up of 4 to 10 in terms of required computation cycles to the ARM9. Using a 65 nm VLSI technology, the FPE design is supposed to run at clock frequencies up to 2 GHz and requires about 1.8 mm 2 chip area. Based on the specific transition rule memory organization, which is an essential element of the programmable FSM, a memory utilization of around 95 % can be achieved. However, the FPE micro-architecture requires a customized code translation chain in order to transfer high-level program code into an FSM representation. Basically, this is achieved by three steps: (1) generation of sequential, assembly-like macroinstructions, (2) scheduling and generation of FSM-based horizontal (parallel) micro-code and (3) organization of respective FSM rules in the "instruction" memory. Our studies confirm the advantages of the FPE as a fully programmable highperformance header parsing engine.

show abstract