Hand gesture recognition applications based on surface electromiographic (sEMG) signals can benefit from on-device execution to achieve faster and more predictable response times and higher energy efficiency. However, deploying state-of-the-art deep learning (DL) models for this task on memory-constrained and battery-operated edge devices, such as wearables, requires a careful optimization process, both at design time, with an appropriate tuning of the DL models’ architectures, and at execution time, where the execution of large and computationally complex models should be avoided unless strictly needed. In this work, we pursue both optimization targets, proposing a novel gesture recognition system that improves upon the state-of-the-art models both in terms of accuracy and efficiency. At the level of DL model architecture, we apply for the first time tiny transformer models (which we call bioformers) to sEMG-based gesture recognition. Through an extensive architecture exploration, we show that our most accurate bioformer achieves a higher classification accuracy on the popular Non-Invasive Adaptive hand Prosthetics Database 6 (Ninapro DB6) dataset compared to the state-of-the-art convolutional neural network (CNN) TEMPONet (+3.1%). When deployed on the RISC-V-based low-power system-on-chip (SoC) GAP8, bioformers that outperform TEMPONet in accuracy consume 7.8×–44.5× less energy per inference. At runtime, we propose a three-level dynamic inference approach that combines a shallow classifier, i.e., a random forest (RF) implementing a simple “rest detector” with two bioformers of different accuracy and complexity, which are sequentially applied to each new input, stopping the classification early for “easy” data. With this mechanism, we obtain a flexible inference system, capable of working in many different operating points in terms of accuracy and average energy consumption. On GAP8, we obtain a further 1.03×–1.35× energy reduction compared to static bioformers at iso-accuracy.