Abstract-As programmers are asked to manage more complicated parallel machines, it is likely that they will become increasingly dependent on tools such as multi-threaded data race detectors, memory bounds checkers, dynamic dataflow trackers, and various performance profilers to understand and maintain their software. As these tools continue to grow in importance, it is worth exploring the potential for special purpose accelerators for these tasks, especially since commodity multi-cores can only provide limited speedups. Rather than performing all the instrumentation and analysis on the main processor, we explore the idea of using the increasingly highthroughput board level interconnect available on many systems to offload analysis to a parallel off-chip accelerator. There are many non-trivial technical issues in taking such an approach that may not appear in simulation, and to flush them out we have developed a prototype system that maps a DMA based analysis engine, sitting on a PCI-mounted FPGA, into the Valgrind instrumentation framework. Using this prototype we characterize the potential of such a system to both accelerate existing software development tools and enable a new class of heavyweight dynamic analysis. While many issues still remain with such an approach, we demonstrate that program analysis speedups of 29% to 440% could be achieved today with strictly off-the-shelf components on some of the state-of-the-art tools, and we carefully quantify the bottlenecks to illuminate several new opportunities for further architectural innovation.