Programming modern embedded vision systems brings various challenges, due to the steep learning curve for programmers and the different characteristics of the devices. Quasar, a new high-level programming language and development environment, considerably simplifies the development. Quasar has a compiler that detects and optimizes parallel programming patterns and a heterogeneous runtime that distributes the computational load over the available compute devices (CPUs and Graphical Processing Unit [GPUs]). In this paper, we focus on runtime aspects of Quasar. We show that with good approximation, the execution time of a GPU kernel function can be factorized in a compile-time-specific component and a runtime-specific component. We show that this approximation leads to a computationally simple runtime load balancing rule. Moreover, the load balancing rule permits efficient implicit concurrency of kernel functions and automatic scaling to multiple compute devices (eg, multi-CPU/GPU systems). Based on an appropriate mathematical scheduling model, we investigate the command queue size trade-off between memory usage and device utilization. The result is a programming environment for embedded vision systems for which automatic parallelization and implicit concurrency detection allow scaling the program efficiently to multi-CPU/GPU systems. Finally, benchmark results are provided to demonstrate the performance of our approach compared with OpenACC and CUDA (Compute Unified Device Architecture).