In recent years, the progress in both hardware and software enabled user-space applications to capture packets at 10 Gbit/s line rate. However, processing packets at such rates with software running on Commercial Off-The-Shelf (COTS) hardware is still far from being trivial. In the literature, this challenge has been extensively studied for Network Intrusion Detection Systems (NIDS), where operations are per-packet and easier to parallelize also thanks to hardware acceleration. Conversely, the scalability of Statistical Traffic Analyzers (STA) is intrinsically more complex as it implies tracking per-flow state to collect statistics. This challenge received less attention so far, and it is the focus of this work.We discuss the design choices to enable a STA to collects hundreds of per-flow metrics at a multi 10 Gbit/s line rate. We leverage a handful of hardware advancements proposed over the last years (e.g., RSS queues, NUMA architecture), and we provide insights on the trade-offs they imply when combined with state of the art packet capture libraries and multi-process paradigm. We outline the principles to achieve an optimized STA, and we apply them to engineer DPDKStat, a solution combining the Intel DPDK framework with the traffic analyzer Tstat. Using traces collected from real networks, we demonstrate that DPDKStat achieves 40 Gbit/s of aggregated rate with a single COTS PC.