One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed codemixed text and discuss their benefits.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.