As technology is reaching physical limits, reducing power consumption is the key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Sub-rountines (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying general matrix-matrix multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra subroutines. In addition to indicating the sources of inefficiencies in current CPUs and GPUs, our results show our prototype linear algebra processor (LAP), double-precision GEMM (DGEMM) can achieve 600 GFLOPS consuming less than 25 Watts in standard 45nm technology, which is up to 50× better than CPUs in terms of energy efficiency.