Gradient descent, conjugate gradient, and other iterative algorithms are a powerful class of algorithms; however, they can take a long time for convergence. Baseline accelerator designs feature insufficient coverage of operations and do not work well on the problems we target. In this thesis we present a novel hardware architecture for accelerating gradient descent and other similar algorithms. To support this architecture, we also present a sparse matrix-vector storage format, and software support for utilizing the format, so that it can be efficiently mapped onto hardware which is also well suited for dense operations. We show that the accelerator design outperforms similar designs which target only the most dominant operation of a given algorithm, providing substantial energy and performance benefits. We further show that the accelerator can be reasonably implemented on a general purpose CPU with small area overhead.ii ACKNOWLEDGMENTS