Abstract-SIMD processors have made their way from supercomputers architectures through embedded real-time signal processing. This trend has been driven by signal processing applications with heavy number-crunching requirements like for example base-band processing on mobile devices.Depending on the data dependencies of algorithms and implementation constraints like real-time, power consumption and die size, the necessary SIMD parallelism can be put into a piece of silicon for a certain application. This poses two challenges: On the one hand, the DSP core design has to be streamlined in such a way that changes on the architecture can be prototyped very fast. On the other hand, the algorithm design and its development have to be done independent of the level of SIMD parallelism available on the DSP in order to enable software reusability.In this paper we report our HW/SW methodology in order to design DSP cores and algorithms that exploit SIMD parallelism. On the hardware development side and taking as a starting point a novel hardware architectural template called STA 1 , we explain how with our approach we automatically generate simulation and hardware models of DSP cores with a scalable level of SIMD parallelism. On the software development side and based on an algebraic model that captures the SIMD computational model, we explain how algorithms can be designed independent of the available SIMD parallelism. We also report how this algebraic model can be easily expressed in Matlab syntax. This enables the automatic code generation from Matlab programs for our family of DSP cores.