Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL -NT 2 -providing a Matlab -like syntax for parallel numerical computations inside a C++ library. In this paper, we show how NT 2 has been redesigned for shared memory systems in an extensible and portable way. The new NT 2 design relies on a tiered Parallel Skeleton system built using asynchronous task management and automatic compile-time taskification of user level code. We describe how this system can operate various shared memory runtimes and evaluate the design by using several benchmarks implementing linear algebra algorithms.