The present study aimed to disentangle the influence of gesture type, physical involvement level, and individual differences in learner characteristics, i.e., working memory (WM) capacity and musicality, in determining the effectiveness of L2 lexical stress training. To this end, 60 native speakers of Dutch read aloud Spanish phrases containing cognates, which were counterbalanced for lexical stress position compared to their Dutch counterpart (e.g., 'piRÁmides' in Spanish, 'piraMIdes' in Dutch). They did so as a pre-test before receiving lexical stress training (T1) and as a post-test both directly after training (T2), and approximately one hour later (T3). Subjects received lexical stress training in one of five conditions varying in gesture type and physical involvement level: audio-visual (AV), AV-beat-perception, AV-beat-production, AVmetaphoric-perception, AV-metaphoric-production. Between T2 and T3, subjects performed a WM capacity and musical aptitude task. The results show that irrespective of training condition subjects significantly improved their L2 lexical stress production from T1 to T2 and T3. Although differences between training conditions were non-significant, there were several significant three-way interactions between WM capacity or musical aptitude and testing time and training condition. This underlines the importance of considering task and learner characteristics in determining the gestural benefit in learning L2 prosody.