Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1374
|View full text |Cite
|
Sign up to set email alerts
|

A Large-Scale Corpus for Conversation Disentanglement

Abstract: Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets. We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure. Our dataset is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context. We use our … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
90
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
2

Relationship

1
9

Authors

Journals

citations
Cited by 64 publications
(91 citation statements)
references
References 29 publications
0
90
0
1
Order By: Relevance
“…Several researchers have defined tasks related to discourse structure, including sentence ordering (Chen et al, 2016;Logeswaran et al, 2016;Cui et al, 2018), sentence clustering (Wang et al, 2018b), and disentangling textual threads (Elsner andCharniak, 2008, 2010;Lowe et al, 2015;Mehri and Carenini, 2017;Jiang et al, 2018;Kummerfeld et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Several researchers have defined tasks related to discourse structure, including sentence ordering (Chen et al, 2016;Logeswaran et al, 2016;Cui et al, 2018), sentence clustering (Wang et al, 2018b), and disentangling textual threads (Elsner andCharniak, 2008, 2010;Lowe et al, 2015;Mehri and Carenini, 2017;Jiang et al, 2018;Kummerfeld et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…However, these methods heavily rely on hand-engineered features that are often too specific to the particular datasets (or domains) on which the model is trained and evaluated. For example, many of the features used in (Kummerfeld et al, 2019) are only applicable to the Ubuntu IRC dataset. This hinders the model's generalization and adaptability to other domains.…”
Section: Introductionmentioning
confidence: 99%
“…LRL Corpora -Social Media: Today, social media is a rich source to develop text corpora for different NLP tools [71]- [75]. Leveraging the content of these social media platforms, Cross-Lingual Arabic Blog Alerts (COLBA) [76] project has focused on collecting Arabic content from different social media platforms like blogs, discussion forums, and chats to develop NLP tools.…”
Section: Related Workmentioning
confidence: 99%