fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Wang, Changhan; Tang, Yun; Ma, Xutai; Wu, Anne; Okhonko, Dmytro; Pino, Juan

doi:10.48550/arxiv.2010.05171

Cited by 39 publications

(19 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our systems with the Speech-to-Text Transformer model available in Fairseq [20], to evaluate the performance of our systems with respect to a baseline. In particular, we use the small architecture, which is the one with reported results 1 .…”

Section: Methodsmentioning

confidence: 99%

“…These kinds of sequences are about an order of magnitude longer than text inputs, therefore, the computational cost of training the model can rise critically. Hence, a common approach in ST systems is to add convolutional layers before the Transformer encoder that reduce the input sequence length [20]. Other systems also include 2D self-attention layers and a distance penalty in the attention, to bias it towards the local context [12].…”

Section: Related Workmentioning

confidence: 99%

“…To build our model, we use the encoder of the Longformer model available in Hugging Face 2 , and a regular Transformer decoder available in Fairseq [20]. In order to get the most realistic comparison possible between our results and the ones obtained with the Speech-to-Text Transformer [20], we try to create a model as similar as possible to theirs. We build our model with 12 encoder layers and 6 decoder layers.…”

Section: Speech-to-text Longformer Implementationmentioning

confidence: 99%

“…To ensure a reliable comparison, we perform all ASR and ST experiments under the same conditions and parameters. Specifically, we try to use the same parameters as in the implementation by [20], when possible. In ASR trainings we use 4 CPUs and 2 workers to load the data.…”

Section: Training Parametersmentioning

confidence: 99%

“…We fix a maximum of 100000 updates for every training. In ST trainings we use the same parameters as for ASR, but for the learning rate, that is 2 • 10 −3 , as done in [20]. We conducted the training of all our experiments in an NVIDIA GeForce RTX 2080 Ti GPU.…”

Section: Training Parametersmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Transformer for Direct Speech Translation

Alastruey¹,

Gállego²,

Costa-jussà³

2021

Preprint

View full text Add to dashboard Cite

The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer.In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -the Longformer-and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Speech-to-text Longformer Implementationmentioning

confidence: 99%

Section: Training Parametersmentioning

confidence: 99%

Section: Training Parametersmentioning

confidence: 99%

See 3 more Smart Citations

Efficient Transformer for Direct Speech Translation

Alastruey¹,

Gállego²,

Costa-jussà³

2021

Preprint

View full text Add to dashboard Cite

show abstract

Advancing Naturalistic Affective Science with Deep Learning

Lin,

Bulls,

Tepfer

et al. 2023

Affec Sci

View full text Add to dashboard Cite

People express their own emotions and perceive others' emotions via a variety of channels, including facial movements, body gestures, vocal prosody, and language. Studying these channels of affective behavior offers insight into both the experience and perception of emotion.Prior research has predominantly focused on studying individual channels of affective behavior in isolation using tightly controlled, non-naturalistic experiments. This approach limits our understanding of emotion in more naturalistic contexts where different channels of information tend to interact. Traditional methods struggle to address this limitation: manually annotating behavior is time-consuming, making it infeasible to do at large scale; manually selecting and manipulating stimuli based on hypotheses may neglect unanticipated features, potentially generating biased conclusions; and common linear modeling approaches cannot fully capture the complex, nonlinear, and interactive nature of real-life affective processes. In this methodology

show abstract

A large-scale and PCR-referenced vocal audio dataset for COVID-19

Budd,

Baker,

Karoune

et al. 2024

Sci Data

View full text Add to dashboard Cite

The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the ‘Speak up and help beat coronavirus’ digital survey alongside demographic, symptom and self-reported respiratory condition data. Digital survey submissions were linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,565 of 72,999 participants and 24,105 of 25,706 positive cases. Respiratory symptoms were reported by 45.6% of participants. This dataset has additional potential uses for bioacoustics research, with 11.3% participants self-reporting asthma, and 27.2% with linked influenza PCR test results.

show abstract

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Cited by 39 publications

References 30 publications

Efficient Transformer for Direct Speech Translation

Efficient Transformer for Direct Speech Translation

Advancing Naturalistic Affective Science with Deep Learning

A large-scale and PCR-referenced vocal audio dataset for COVID-19

Contact Info

Product

Resources

About