ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682674
|View full text |Cite
|
Sign up to set email alerts
|

Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes

Abstract: We present two end-to-end models: Audio-to-Byte (A2B) and Byteto-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual processing. In this work, we model text via a sequence of Unicode bytes, specifically, the UTF-8 variable length byte sequence for each character. Bytes allow us to avoid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
74
0
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 124 publications
(75 citation statements)
references
References 23 publications
0
74
0
1
Order By: Relevance
“…For example, deep neural networks trained on the ImageNet dataset can be adapted to other classification problems using small amounts of task-specific data by retraining the last layers or finetuning the weights with a small learning rate. In speech recognition, a model can be pretrained on languages with more transcribed data and then adapted to a low-resource language [75] or domains [76].…”
Section: Datamentioning
confidence: 99%
See 1 more Smart Citation
“…For example, deep neural networks trained on the ImageNet dataset can be adapted to other classification problems using small amounts of task-specific data by retraining the last layers or finetuning the weights with a small learning rate. In speech recognition, a model can be pretrained on languages with more transcribed data and then adapted to a low-resource language [75] or domains [76].…”
Section: Datamentioning
confidence: 99%
“…With increasing adoption of speech based applications, extending speech support for more speakers and languages has become more important. Transfer learning has been used to boost the performance of ASR systems on low resource languages with data from rich resource languages [75]. With the success of deep learning models in ASR, other speech related tasks also embraces deep learning techniques, such as voice activity detection [95], speaker recognition [96], language recognition [97] and speech translation [98].…”
Section: Applicationsmentioning
confidence: 99%
“…End-to-end TTS models have typically used character [2] or phoneme [8,23] input representations, or hybrids between them [24,25]. Recently, [19] proposed using inputs derived from the UTF-8 byte encoding in multilingual settings. We evaluate the effects of using these representations for multilingual TTS.…”
Section: Input Representationsmentioning
confidence: 99%
“…Following [19] we experiment with an input representation based on the UTF-8 text encoding, which uses 256 possible values as each input token where the mapping from graphemes to bytes is language-dependent. For languages with single-byte characters (e.g., English), this representation is equivalent to the grapheme representation.…”
Section: Utf-8 Encoded Bytesmentioning
confidence: 99%
“…Recently, encoder-decoder framework has been successfully applied to TTS system. In [12], Li et al presents two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis, modeling text using a sequence of Unicode bytes, specifically, the UTF-8 variable length byte sequence for each character. The B2A model is able to synthesize code-switching text and the speech is fluent, but the speaker voice is changed for different language.…”
Section: Related Workmentioning
confidence: 99%