Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters.
In recent years, there has been a lot of work in transcribing polyphonic music using non-negative spectrogram factorization. However, most of them focus on transcribing audio signal into the occurrence of notes, onset and pitch of notes. In this paper, a concept for automatic transcription of frequncy modulated muscial expressions such as vibrato, glissando is proposed. To transcribe those musical expressions from polyphonic music signal, hidden Markov model constrained shift-invariant probablistic latent component analysis is used. From a impulse distribution which reveals the frequency variation of each note, each expression can be modelled in accordance with designed rules. Experiments showed that the impusle distribution can be used to transcribe expressions from polyphonic music signals.
This paper presents a novel onset detection algorithm based on cepstral analysis. Instead of considering unnecessary mel scale or any interests of non-harmonic components, we selec tively focus on the changes in particular cepstral coefficients that represent the harmonic structure of an input signal. In comparison with a conventional time-frequency analysis, the advantage of using cepstral coefficients is that it shows the harmonic structure more clearly, and gives a robust detection function even when the envelope of waveform fluctuates or slowly increases. As a detection function, harmonic cepstrum regularity (HCR) is derived by the summation of several har monic cepstral coefficients, but their quefrency indices are defined from the previous frame so as to reflect the tempo ral changes in the harmonic structure. Experiments show that the proposed algorithm achieves significant improvement in performance over other algorithms, particularly for pitched instruments with soft onsets, such as violin and singing voice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.