Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper we describe the design of an opensource RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters. We introduce instructionextensions and microarchitectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy. For typical data-intensive sensor processing workloads the proposed core is on average 3.5× faster and 3.2× more energy-efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. SIMD extensions, such as dot-products, and a built-in L0 storage further reduce the shared memory accesses by 8× reducing contentions by 3.2×. With four NT-optimized cores, the cluster is operational from 0.6 V to 1.2 V achieving a peak efficiency of 67 MOPS/mW in a low-cost 65 nm bulk CMOS technology. In a low power 28 nm FDSOI process a peak efficiency of 193 MOPS/mW (40 MHz, 1 mW) can be achieved.Index Terms-Internet-of-Things, Ultra-low-power, Multi-core, RISC-V, ISA-extensions.