“…Although achieving state-of-the-art (SOTA) results on written benchmarks (Wang et al, 2018), they are not tailored to spoken dialog (SD). Indeed, Tran et al (2019) have suggested that training a parser on conversational speech data can improve results, due to the discrepancy between spoken and written language (e.g., disfluencies (Stolcke and Shriberg, 1996), fillers (Shriberg, 1999;Dinkar et al, 2020), different data distribution). Furthermore, capturing discourse-level features, which distinguish dialog from other types of text (Thornbury and Slade, 2006), e.g., capturing multi-utterance dependencies, is key to embed dialog that is not explicitly present in pre-training objectives (Devlin et al, 2018;Yang et al, 2019;Liu et al, 2019), as they often treat sentences as a simple stream of tokens.…”