The growing power of processors allows us to implement increasingly complex multimedia algorithms. However, this processor power is only available if the algorithms are implemented in a way that exploits the multi-core parallelism of these processors. Today, this requires that the skillsets required for algorithm development and for parallel programming are tightly combined to achieve this.By providing a language, compiler and runtime that allows algorithm developers to specify algorithms as a series of data-transforming kernels written in C++, while the parallelization opportunities are built into the compiler and runtime, we hope to alleviate this need for a dual skillset.In this paper, we focus on the performance improvements that our system can achieve by combining language design, compiler knowledge, and runtime decisions to overcome performance bottlenecks from fine-grained kernel scheduling and cache-line contention without adapting the algorithms they implement.