“…Multimodal approaches that jointly process audio and language are becoming increasingly important within music understanding and generation, giving rise to a new area of research, which we refer to as music-and-language (M&L). Several recent works have emerged in this domain, proposing methods to automatically generate music descriptions [21,7,9], synthesise music from a text prompt [2,14,30,6], search for music based on language queries [8,22,13], and more [20,17]. However, evaluating M&L models remains a challenge due to a lack of public and accessible datasets with paired audio and language, resulting in the widespread use of private data [21,22,23,14,2,13] and inconsistent evaluation practices.…”