High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)

Dorran, David; Lawlor, Robert; Coyle, Eugene

doi:10.1109/icassp.2003.1198877

Cited by 9 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pairs of real-speech and Jabberwocky stories were matched in terms of silence-to-signal ratio by increasing silences (the portions of signal with amplitude between 0.001 and − 0.001 of the maximum amplitude and longer than 50 ms) with the adequate time constant. The length of each story was also matched by slightly changing the sound tempo with a MATLAB (Mathworks Inc.) implementation of the VSOLA (variable parameter synchronized overlap add) algorithm 47 . The volume of acoustic stimuli was set between 45 and 50 dB following participants' preferences and in line with our previous studies 15,27,28 .…”

Section: Methodsmentioning

confidence: 99%

Sleepers track informative speech in a multitalker environment

et al. 2019

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Sleepers track informative speech in a multitalker environment

et al. 2019

View full text Add to dashboard Cite

“…Once the peak has been determined, the lowest energy point between the two peaks is configured as the syllabic boundary (Jarman et al, 2003;Kwon & Kim, 2011;O'Haver, 2001). The time scale is modified by the Synchronized Overlap-Add Algorithm (Covell, Withgott, & Slaney, 1998;Dorran et al, 2003;Hejna & Musicus, 2003;Ninness & Henriksen, 2008).…”

Section: Methodsmentioning

confidence: 99%

“…But editing sound effects is yet another field that demands knowledge and expertise that most users do not possess. If there were intuitive tools that could allow an individual to create character motions through simple finger strokes and match sound effects to the specific situations of a scene, then content creation could become a much easier endeavor (Dorran, Lawlor, & Coyle, 2003;Gillet & Richard, 2005;Ishihara, Nakatani, Ogata, & Okuno, 2004;Ishihara et al, 2003;Jarman, Daly, Anderson, & Wahl, 2003;Kwon & Kim, 2011).…”

Section: Introductionmentioning

confidence: 98%

Voice-Driven Sound Effect Manipulation

Kwon

2012

International Journal of Human-Computer Interaction

View full text Add to dashboard Cite

Authoring tools for sketching the motion of characters to be animated have been studied for contents such as computer animations, games, and user-created content. However the natural interface for sound editing has not been sufficiently studied. This article proposes an intuitive interface method in which sound sample is selected and edited by speaking sound-imitation words (onomatopoeia). An experiment with the method based on statistical models, which is generally used for pattern recognition, showed up to 99% in the accuracy of recognition. In the other experiment for sound editing, syllable segmentation was first executed, and then a syllabic time scale of sound samples was modified by the Synchronized Overlap-Add algorithm. The energy by the syllable was then modified according to utterances of sound-imitation words. The experiment showed that the proposed method, compared to modification by the whole, achieved about 20.4% and 65.6% relative improvement in the time displacement of peaks and syllabic boundaries between modified sound samples and sound-imitation utterances.

show abstract

“…This section introduces the theoretical basis of the real-time iterative inversion (RTISI) and its implementation. In recent years, several TSM algorithms [4] have been proposed [5,6,7,8]. This paper adopts the successful RTISI algorithm [9,10], which processes according to Fig.…”

Section: Speech Time-scale Modificationmentioning

confidence: 99%

LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge

Miao,

McLoughlin

2019

Preprint

View full text Add to dashboard Cite

This paper presents a novel Dialect Identification (DID) system developed for the Fifth Edition of the Multi-Genre Broadcast challenge, the task of Fine-grained Arabic Dialect Identification (MGB-5 ADI Challenge). The system improves upon traditional DNN x-vector performance by employing a Convolutional and Long Short Term Memory-Recurrent (CLSTM) architecture to combine the benefits of a convolutional neural network front-end for feature extraction and a back-end recurrent neural to capture longer temporal dependencies. Furthermore we investigate intensive augmentation of one low resource dialect in the highly unbalanced training set using time-scale modification (TSM). This converts an utterance to several time-stretched or timecompressed versions, subsequently used to train the CLSTM system without using any other corpus. In this paper, we also investigate speech augmentation using MUSAN and the RIR datasets to increase the quantity and diversity of the existing training data in the normal way. Results show firstly that the CLSTM architecture outperforms a traditional DNN x-vector implementation. Secondly, adopting TSM-based speed perturbation yields a small performance improvement for the unbalanced data, finally that traditional data augmentation techniques yield further benefit, in line with evidence from related speaker and language recognition tasks. Our system achieved 2nd place ranking out of 15 entries in the MGB-5 ADI challenge, presented at ASRU 2019.

show abstract

High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)

Cited by 9 publications

References 12 publications

Sleepers track informative speech in a multitalker environment

Sleepers track informative speech in a multitalker environment

Voice-Driven Sound Effect Manipulation

LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge

Contact Info

Product

Resources

About