Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system. We release code to run our experiments along with our annotated dataset here 1 .
Disfluency Type
Description ExampleFiller Words like uhh, err, uhmm that are often uttered to retain turn of speaking. Each language has a different set of filler words commonly uttered.EN: Write a message to um Sarah. DE: Fortsetzen ähm meines Lauftrainings. FR: Montre euh mes applications. HI: मे रा उम्म पल्सरे ट फट बत में चे क करो Repetition Consists of words or phrases that are repeated in conversational speech EN: Add this number to my to my contacts. DE: ein Instagram-Foto machen machen. FR: Enregistre mes 400 calories enregistre. HI: क्या तु म हॉ टल का हॉ टल का एक नोट बना सकते हो? Correction Disfluencies that consist of words incorrectly spoken and immediately corrected with a fluent phrase EN: Get me the order my order status on the desk chair I ordered from Overstock. DE: HD Video auf aufnehmen. FR: Reprendre l'exercice d'étirem d'étirement HI: रा राहु राहुल का मै से ज पढ़ो False Start Examples where the speaker changes their chain-of-thought mid sentence to utter a completely different fluent phrase EN: In an email let's email Tom Hardy about Saturday's video shoot. DE: Facebook uh Jahr Facebook bitte. FR: Envoi de le envoi du SMS à maman. HI: कल उम्म आज क ब्लड प्रे शर री ड ग बताओ Fluent Examples which do not contain any disfluent words or phrases EN: Can you make a note for Johnny that says dinner at eight on my laptop? DE: Nummer zu Kontakten hinzufügen.. FR: Je veux j'aimerais ouvrir TikTok.. HI: क्या आप योसे माइट ने शनल पाकर् को ईमे ल कर सकते हैं ?Table 2: Types of sentences observed in the DISCO corpus. All disfluencies are marked in red; EN-English, DE-German, FR-French, HI-Hindi. Examples in languages other than English, with their corresponding gloss and transliteration can be found in Appendix E