Baseline RISC instruction sets for ultra-low power processors are constantly being tuned to reduce cycle count when executing computation-intensive applications. Performance improvements often come at a non-negligible price in terms of area and critical path length and imply deeper pipelines and complex memory interfaces. This penalizes control-intensive code execution and significantly increases cost and complexity of building multi-core clusters. In addition, some extensions are not easily exploited by compilers and may increase code development effort, especially when considering parallel applications. In this paper we describe our efforts in enhancing a baseline open ISA (OpenRISC) and its LLVM compiler back-end to significantly reduce execution cycles while minimizing the impact on core micro-architecture complexity, number of pipeline stages, area and power. In addition, we improved the core micro-architecture to streamline its integration in a tightly-coupled cluster, sharing instruction cache and data memory, thereby further enhancing parallel execution efficiency. The combined effect of ISA, compiler and micro-architecture evolution gives an average energy efficiency boost of 59% on vector intensive code and 41% otherwise, at an area and power increase of 2.3% and 18% on a four-core processor cluster.