“…The annotated note events, with their associated timings, were used as time-frequency regions of importance [12] to estimate the following frame-wise continuous performance descriptors: fundamental frequency (f 0 ), power, spectral centroid, spectral flux, spectral slope, and spectral flatness. Summary descriptors were calculated from the continuous data, using the methods described in [13] to generate four pitch-related descriptors (perceived pitch, jitter, vibrato depth, and vibrato rate), two power-related descriptors (average power and shimmer), and four timbre-related descriptors (average spectral centroid, average spectral flux, average spectral slope, and average spectral flatness). The use of transcription as a proxy for score data in AMPACT facilitated the filtering of any bleed-through in the Unmix-isolated vocal tracks.…”