The OpenCL specification tightly binds a command queue to a specific device. For best performance, the user has to find the ideal queuedevice mapping at command queue creation time, an effort that requires a thorough understanding of the underlying device architectures and kernels in the program. In this paper, we propose to add scheduling attributes to the OpenCL context and command queue objects that can be leveraged by an intelligent runtime scheduler to automatically perform ideal queuedevice mapping. Our proposed extensions enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and to automatically gain performance without sacrificing programmability. As an example, we design and implement an OpenCL runtime for task-parallel workloads, called MultiCL, which efficiently schedules command queues across devices. Our case studies include the SNU benchmark suite and a real-world seismology simulation. To benefit from our runtime optimizations, users have to apply our proposed scheduler extensions to only four source lines of code, on average, in existing OpenCL applications. We evaluate both single-node and multinode experiments and also compare with SOCL, our closest related work. We show that MultiCL maps command queues to the optimal device set in most cases with negligible runtime overhead.