Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.217
|View full text |Cite
|
Sign up to set email alerts
|

GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation

Abstract: Data collection for the knowledge graph-to-text generation is expensive. As a result, research on unsupervised models has emerged as an active field recently. However, most unsupervised models have to use non-parallel versions of existing small supervised datasets, which largely constrain their potential. In this paper, we propose a large-scale, general-domain dataset, GenWiki. Our unsupervised dataset has 1.3M text and graph examples, respectively. With a human-annotated test set, we provide this new benchmar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 20 publications
(33 citation statements)
references
References 25 publications
0
33
0
Order By: Relevance
“…It also used in-domain unlabeled documents during training, which we do not use. Jin et al (2020) demonstrated that the choice of seed keywords has a significant impact on the model's accuracy. STM,S label is the result of STM using only unigrams in the category name as seed keywords.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
See 4 more Smart Citations
“…It also used in-domain unlabeled documents during training, which we do not use. Jin et al (2020) demonstrated that the choice of seed keywords has a significant impact on the model's accuracy. STM,S label is the result of STM using only unigrams in the category name as seed keywords.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
“…To investigate the contribution of the in-domain unlabeled document to STM's superior performance, we trained an STM model with the manually-curated keywords in Jin et al (2020) and the Wikipedia dataset we used to train wiki2cat (denoted as STM,D wiki ). There is a noticeable decrease in performance in STM,D wiki without indomain unlabeled documents.…”
Section: Results Of Coarse-grained Contextual Classificationmentioning
confidence: 99%
See 3 more Smart Citations