An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.
The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications, with matching support from specialized accelerator processors. On-chip parallel computation shows great promise for scaling raw processing performance within a given power budget. However, chip multiprocessors (CMPs) often struggle with programmability and scalability issues such as cache coherency and off-chip memory bandwidth and latency.
A c a s e f o r h a r dw a r e t a s k m a n a g em e n t s u p p o r t f o r t h e S t a r S S p r o g r amm i n g m o d e l C on f e r en c e ob j e c t , P o s tp r in t v e r s i on T h i s v e r s i o n i s a v a i l a b l e a t h t t p : / / d x . d o i . o r g / 1 0 . 1 4 2 7 9 / d e p o s i t o n c e -5 7 7 6 . Sugg e s t ed C i t a t i on M e e n d e r i n c k , C o r ; J u u r l i n k , B e n : A c a s e f o r h a r dw a r e t a s k m a n a g em e n t s u p p o r t f o r t h e S t a r S S p r o g r amm i n g m o d e l . -I n : 2 0 1 0 1 3 t h E u r om i c r o C o n f e r e n c e o n D i g i t a l S y s t em D e s i g n : A r c h i t e c t u r e s , T e rm s o f U s e © © 2 0 1 0 I E E E . P e r s o n a l u s e o f t h i s m a t e r i a l i s p e rm i t t e d . P e rm i s s i o n f r om I E E E m u s t b e o b t a i n e d f o r a l l o t h e r u s e s , i n a n y c u r r e n t o r f u t u r e m e d i a , i n c l u d i n g r e p r i n t i n g / r e p u b l i s h i n g t h i s m a t e r i a l f o r a d v e r t i s i n g o r p r om o t i o n a l p u r p o s e s , c r e a t i n g n ew c o l l e c t i v e w o r k s , f o r r e s a l e o r r e d i s t r i b u t i o n t o s e r v e r s o r l i s t s , o r r e u s e o f a n y c o p y r i g h t e d c om p o n e n t o f t h i s w o r k i n o t h e r w o r k s .
Abstract-This paper investigates the scalability of MacroBlock (MB) level parallelization of the H.264 decoder for High Definition (HD) applications. The study includes three parts. First, a formal model for predicting the maximum performance that can be obtained taking into account variable processing time of tasks and thread synchronization overhead. Second, an implementation on a real multiprocessor architecture including a comparison of different scheduling strategies and a profiling analysis for identifying the performance bottlenecks. Finally, a trace-driven simulation methodology has been used for identifying the opportunities of acceleration for removing the main bottlenecks. It includes the acceleration potential for the entropy decoding stage and thread synchronization and scheduling. Our study presents a quantitative analysis of the main bottlenecks of the application and estimates the acceleration levels that are required to make the MB-level parallel decoder scalable.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.