In this chapter, we describe a series of studies related to our research on using gestural sonic objects in music analysis. These include developing a method for annotating the qualities of gestural sonic objects on multimodal recordings; ranking which features in a multimodal dataset are good predictors of basic qualities of gestural sonic objects using the Random Forests algorithm; and a supervised learning method for automated spotting designed to assist human annotators. The subject of our analyses is a performance of Fragmente2, a choreomusical composition based on the Japanese composer Makoto Shinohara’s solo piece for tenor recorder Fragmente (1968). To obtain the dataset, we carried out a multimodal recording of a full performance of the piece and obtained synchronised audio, video, motion, and electromyogram (EMG) data describing the body movements of the performers. We then added annotations on gestural sonic objects through dedicated qualitative analysis sessions. The task of annotating gestural sonic objects on the recordings of this performance has led to a meticulous examination of related theoretical concepts to establish a method applicable beyond this case study. This process of gestural sonic object annotation—like other qualitative approaches involving manual labelling of data—has proven to be very time-consuming. This motivated the exploration of data-driven, automated approaches to assist expert annotators.