We propose an embedded multiprocessor architecture and its associated thread-based programming model. Using a cycle-true simulation model of this architecture, we are able to estimate energy savings for a threaded C program. The savings are obtained by voltage-and frequency-scaling of the individual processors. We port a fingerprint minutiae detection application onto this architecture, and show the resulting performance on single-, dual-, and quad-processor configurations. The energy-scaled quadprocessor version results in a 77 % energy reduction over the single-processor non-scaled implementation, at only a 2.2 % degradation in cycle count.