Listeners often operate in complex acoustic environments, consisting of many concurrent sounds. Accurately encoding and maintaining such auditory objects in short‐term memory is crucial for communication and scene analysis. Yet, the neural underpinnings of successful auditory short‐term memory (ASTM) performance are currently not well understood. To elucidate this issue, we presented a novel, challenging auditory delayed match‐to‐sample task while recording MEG. Human participants listened to ‘scenes’ comprising three concurrent tone pip streams. The task was to indicate, after a delay, whether a probe stream was present in the just‐heard scene. We present three key findings: First, behavioural performance revealed faster responses in correct versus incorrect trials as well as in ‘probe present’ versus ‘probe absent’ trials, consistent with ASTM search. Second, successful compared with unsuccessful ASTM performance was associated with a significant enhancement of event‐related fields and oscillatory activity in the theta, alpha and beta frequency ranges. This extends previous findings of an overall increase of persistent activity during short‐term memory performance. Third, using distributed source modelling, we found these effects to be confined mostly to sensory areas during encoding, presumably related to ASTM contents per se. Parietal and frontal sources then became relevant during the maintenance stage, indicating that effective STM operation also relies on ongoing inhibitory processes suppressing task‐irrelevant information. In summary, our results deliver a detailed account of the neural patterns that differentiate successful from unsuccessful ASTM performance in the context of a complex, multi‐object auditory scene.