“…First, a featureextraction step takes as input the audio signal recorded by one (for the detection-only problem) or multiple (if the localization is addressed too) microphones and builds a corresponding representation to be used as input to a neural network. Most methods use time-frequency feature representations, such as spectrograms [13], [14], [16], [19], gammatonegrams [14], [15], Mel-frequency cepstral coefficients (MFCCs) [13], [17], [19], or, less commonly, gammatone-frequency cesptral coefficients (GFCCs), constant-Q transform (CQT) and chromagrams [17]. Others take the raw waveform of the windowed audio signal as input feature [18].…”