We present an experimental prototype of a microwave speech recognizer empowered by a programmable metasurface that can recognize voice commands and speaker identities remotely even in noisy environments and if the speaker’s mouth is hidden behind a wall or face mask. Thereby, we enable voice-commanded human machine interactions in many important but to-date inaccessible application scenarios, including smart health care and factory scenarios. The programmable metasurface is the pivotal hardware ingredient of our system because its large aperture and huge number of degrees of freedom allows our system to perform a complex sequence of tasks, orchestrated by artificial-intelligence tools. First, the speaker’s mouth is localized by imaging the scene and identifying the region of interest. Second, microwaves are efficiently focused on the speaker’s mouth to encode information about the vocalized speech in reflected microwave biosignals. The efficient focusing on the speaker’s mouth is the origin of our system’s robustness to various types of parasitic motion. Third, a dedicated neural network directly retrieves the sought-after speech information from the measured microwave biosignals. Relying solely on microwave data, our system avoids visual privacy infringements. We expect that the presented strategy will unlock new possibilities for future smart homes, ambient-assisted health monitoring and care, smart factories, as well as intelligent surveillance and security.