Research efforts in syntactic parsing have focused on written texts. As a result, speech parsing is usually performed on transcriptions, either in unrealistic settings (gold transcriptions) or on predicted transcriptions. Parsing speech from transcriptions, though straightforward to implement using out-of-the-box tools for Automatic Speech Recognition (ASR) and dependency parsing has two important limitations. First, relying on transcriptions will lead to error propagation due to recognition mistakes. Secondly, many acoustic cues that are important for parsing (prosody, pauses, . . . ) are no longer available in transcriptions.To address these limitations, we introduce wav2tree, an end-to-end dependency parsing model whose only input is the raw signal. Our model builds on a pretrained wav2vec2 encoder with a CTC loss to perform ASR. We extract token segmentation from the CTC layer to construct vector representations for each predicted token. Then, we use these token representations as input to a generic parsing algorithm. The whole model is trained end-to-end with a multitask objective (ASR, parsing) to reduce error propagation. Our experiments on the Orféo treebank of spoken French show that direct parsing from speech is feasible: wav2tree outperforms a pipeline approach based on wav2vec (for ASR) and FlauBERT (for parsing).
Cet article présente une étude de corpus portant sur les variations d'ordre linéaire en français parlé spontané. Nous avons étudié plusieurs corpus de dialogue finalisé correspondant à différentes tâches applicatives afin d'évaluer l'influence du contexte discursif sur ces phénomènes. Nous insistons dans un premier temps sur l'intérêt d'études de corpus pour orienter les recherches en Traitement Automatique des Langues. Nous présentons ensuite notre méthodologie d'analyse ainsi que les principaux résultats de l'étude. Ceux-ci montrent que la tâche et le rôle du locuteur dans l'interaction n'ont pas d'influence significative sur la réalisation des dislocations orales, alors que le degré d'interactivité joue au contraire sur leur fréquence. Ces variations d'ordonnancement respectent toutefois de fortes régularités imposées par le système de la langue. Aussi concluons-nous que le français parlé spontané reste une langue à ordre SVO fixe. Mots-clés :inversions ; dislocation ; variation de l'ordre linéaire ; parole spontanée ; linguistique de corpus ; Summary This paper presents a corpus study on word order variations (WOV) in spontaneous spoken French. We have studied several corpus of spoken dialogue dedicated to different tasks to assess the influence of the discourse context on WOVs. At first, we show how the contribution of pilot corpus studies should benefit to Natural Language Processing researches. Then, we present our methodology and the main results of this study. In particular, we observe that the task and the role of the speaker have no influence on WOVs, while the frequency of WOVs is on the contrary highly influenced by the degree of interactivity of the dialogues. These WOVs respect some noticeable structural regularities which are imposed by French ordering constraints. This is why we conclude that conversational spoken French must be still considered as a language with a rigid SVO ordering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.