“…Early works relied on MFCC input to reduce computation [62], [63] for genre classification. Many works have been then introduced based on time-frequency representations e.g., CQT for chord recognition [45], guitar chord recognition [46], genre classification [116], transcription [96], melspectrogram for boundary detection [89], onset detection [90], hit song prediction [118], similarity learning [68], instrument recognition [39], music tagging [26], [15], [17], [59], and STFT for boundary detection [36], vocal separation [100], and vocal detection [88]. One-dimensional CNN for raw audio input is used for music tagging [26], [60], synthesising singing voice [9], polyphonic music [112], and instruments [29].…”