Comparing different (accelerated) cluster architectures by a single application is a tough piece of work because this application has to be optimized with respect to platform-dependent features. In this work, we demonstrate such an optimization for a data mining algorithm which solves regression and classification problems on vast data sets. Our technique is based on least squares regression, and its major component is the iterative matrix-free solution of a linear system of equations. By processing data sets ranging from several hundreds of thousands instances to multi-million data points in strong-scaling and weak-scaling settings, we are able to estimate the amount of parallelism needed to unleash the performance of classic CPU-based machines and clusters employing Intel Xeon Phi coprocessors and NVIDIA Kepler GPUs. Only in strongscaling experiments, GPUs and coprocessors suffer from their tremendous amount of needed parallelism and get outperformed by dual socket Intel Sandy Bridge nodes at large scale (more than 64 nodes/accelerators). However, in weak-scaling scenarios, a speed-up larger than 2X over an entire CPU node can be achieved by a single accelerator.A. HEINECKE ET. AL.(NAS) Division parallel benchmark suite [9], which requires several application kernels to be run (including iterative solvers and fast Fourier transforms). In case of accelerated clusters, the scalable heterogeneous computing benchmark suite [10] is a good candidate which implements nearly all NAS benchmarks in OpenCL and CUDA, and can be easily executed on accelerators and GPUs. However, research and procurements performed in recent years have demonstrated that even running (just) application kernels might not be sufficient: Sandia Labs highlighted how mini-applications or proxy-applications can be used in order to understand the performance of a supercomputer and even influence its future development [11]. There, the benchmarks are not limited to kernels; they are simplified versions of real simulation codes stemming from several application domains. A similar approach was chosen for the procurement of the latest peta-scale system in Germany, called 'Super-MUC' at the Leibniz Supercomputing Centre: according to Brehm [12], 45% of the benchmarks required during this process were full applications. Finally, the proposal of having an additional ranking of the Top500 list machines (like the Green500 [13] list with respect to power consumption) based on a high-performance CG (HPCG) implementation was recently made [14].In this work, we apply the idea of an HPC benchmark to a full and relevant application, classification and regression of vast data sets. It exhibits different and distinct properties than the benchmarks discussed earlier, poses additional challenges to current and future HPC systems, and we thus propose it as a further extension of an application benchmark portfolio. Furthermore, we demonstrate its use to benchmark different clusters and supercomputers. A fair application-driven comparison is ensured by optimizing our dat...