2021
DOI: 10.48550/arxiv.2107.03069
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Transformer for Direct Speech Translation

Abstract: The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer.In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convoluti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(5 citation statements)
references
References 16 publications
0
5
0
Order By: Relevance
“…First, it can be observed from Table 1, that the efficient architecture based only on Local Attention (local_attention) already obtains the same results as the baseline, suggesting the presence of unnecessary computations in self-attention. Unlike previous works (Alastruey et al, 2021), this architecture maintains the convolutional layers, so the amount of global content within the attention mechanism is higher using the same window size. On the other hand, while the architecture based exclusively on ConvAttention (conv_attention), man-3 More details in Table 2 in the Appendices.…”
Section: Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…First, it can be observed from Table 1, that the efficient architecture based only on Local Attention (local_attention) already obtains the same results as the baseline, suggesting the presence of unnecessary computations in self-attention. Unlike previous works (Alastruey et al, 2021), this architecture maintains the convolutional layers, so the amount of global content within the attention mechanism is higher using the same window size. On the other hand, while the architecture based exclusively on ConvAttention (conv_attention), man-3 More details in Table 2 in the Appendices.…”
Section: Resultsmentioning
confidence: 99%
“…However, after this compression, the resulting sequences are still considerably longer and more redundant than their text counterparts. Alastruey et al (2021) proposed the use of efficient Transformers for ST, but, as observed in different tasks by Tay et al (2021), they suffer from a drop in performance quality. The main reason for this deterioration is that most efficient Transformers propose strategies that deprive the model of the ability to learn all types of content from the input stream.…”
Section: Multi-head Multi-attentionmentioning
confidence: 99%
See 3 more Smart Citations