Abstract. Many non-linear (NL) interference cancellation (IC) algorithms characterize vector based operations instead of matrix operations, and their performance can be further improved by a proper sorted preprocessing of the channel information, e.g., successive interference cancellation (SIC) and Tomlinson-Harashima precoding (THP) for multi-user multiple-input and multiple-output (MIMO) communication systems. However, on the one hand, application-specific architectures are efficient but not flexible in performing NL IC algorithms; on the other hand, existing single-core based flexible architectures aimed for intensive matrix operations are not efficient when mapping these algorithms. Alternatively, the multiprocessor system-on-chip (MPSoC) architecture can provide better diversity to implement demanding NL algorithms. This paper proposes an efficient and programmable MPSoC prototype to bridge the efficiency and flexibility gap, especially for versatile Gram-Schmidt process (GSP) aided NL IC algorithms. This prototype incorporates several slave processors and one master processor based on the division of computing and control of the mapped algorithms. The slave processor integrates a coarse-grained programmable element (CGPE) with special support for atomic vector operations, while the maser processor is responsible for task schedule and data transportation among slave processors and memories. For demonstration, a tightly coupled sorted QR decomposition (SQRD) aided THP is mapped to the proposed MPSoC, where a speculative dynamic runtime schedule (SDRS) strategy is applied to reduce the computational delay. The synthesis results are presented to show the efficiency and feasibility of our MPSoC prototype.