SUMMARYTo sustain growing multimedia workload, modern digital signal processing (DSP) processors are commonly equipped with subword instructions to accelerate signal processing. Besides subword, functional units of very long instruction word (VLIW) DSP processors can also be employed to process multiple data streams in parallel. However, because of power and area concerns, many embedded VLIW DSP processors adopt distributed register files to reduce read/write ports and wire connection by privatizing register files for clusters and even for functional units. The distributed design presents great challenges to compilers in distributing single instruction, multiple data (SIMD) workload to functional units. In this paper, we address the issue in supporting SIMD parallelism on VLIW DSP processors with subword instructions and distributed register files. Currently, industrial practices have adopted intrinsics that enable developers to utilize hardware resources and compete with hand-coded assembly in performance. However, it is still an open issue to provide such a solution for VLIW DSP processors with distributed register files. In this work, we provide SIMD intrinsics to allow programmers to write highly optimized codes by following given programming guides. In addition, an enhanced register allocation scheme and data replication optimizations are devised to enable efficient code generation. In our experiments, DSPstone benchmark and a set of H.264 kernels are used to evaluate the proposed programming and optimization schemes. The result shows that by combining SIMD intrinsics and compiler optimizations, one is able to obtain remarkable performance improvements, speedups of 2.9 and 3.5 for DSPstone and H.264 kernels, respectively.