Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.572
|View full text |Cite
|
Sign up to set email alerts
|

NYTWIT: A Dataset of Novel Words in the New York Times

Abstract: We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems. We hope this resource will … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 12 publications
0
4
0
Order By: Relevance
“…Tasks include predicting the degree of compositionality, i.e., the semantic relatedness of the constituents to the overall meaning; and predicting the meaning of the compound and evaluating it by detecting synonyms, paraphrases, or semantic relations. These tasks rely on a wide range of datasets (Biemann and Giesbrecht, 2011;Reddy et al, 2011;Hendrickx et al, 2013;Juhasz et al, 2015;Levin et al, 2019;Cordeiro et al, 2019;Pinter et al, 2020a); for a recent analysis, see Schulte im Walde (forthc. ).…”
Section: Summary Of Surveyed Approachesmentioning
confidence: 99%
“…Tasks include predicting the degree of compositionality, i.e., the semantic relatedness of the constituents to the overall meaning; and predicting the meaning of the compound and evaluating it by detecting synonyms, paraphrases, or semantic relations. These tasks rely on a wide range of datasets (Biemann and Giesbrecht, 2011;Reddy et al, 2011;Hendrickx et al, 2013;Juhasz et al, 2015;Levin et al, 2019;Cordeiro et al, 2019;Pinter et al, 2020a); for a recent analysis, see Schulte im Walde (forthc. ).…”
Section: Summary Of Surveyed Approachesmentioning
confidence: 99%
“…Empirical analyses such as the one in Pinter et al (2017) show that indeed, the overwhelming majority of downstream datasets contain words not present in the pre-training corpora. Pinter et al (2020a) present a diachronical dataset showcasing the volume of novel terms entering a large, steady daily publication in English over time; but even a snapshot of a language at a given moment contains unlimited domain-specific terms, morphological derivations, named entities, potential loanwords, typographical errors, and other sources of OOVs which would appear very reasonably in text analysis tasks and which the downstream model should be given the faculty to handle. In fact, according to Kornai (2002), statistical reasoning leads us to conclude that languages have an infinite vocabulary.…”
Section: Out-of-vocabulary Wordsmentioning
confidence: 99%
“…This is not a comprehensive list. More types of novel words are identified in Pinter et al (2020a), and not all suggestions in the taxonomy above correspond to actual existing work. Limiting this discussion to a strict interpretation of written-form uniqueness also prevents us from considering as OOVs concepts which are spelled in the same way as other words, either by chance (homography, for example row as a line or a fight), by naming (e.g., Space Force), or by processes such as zero-derivation (the verb smoke, derived from the noun).…”
Section: Out-of-vocabulary Wordsmentioning
confidence: 99%
“…Dataset and annotation. We collected a dataset of novel blends from the New York Times (NYT), starting from the output of a Twitter bot extracting all novel words with their originating contexts, a process described in Pinter et al (2020). For each blend, we annotated the bases and the semantic relation between them, following the taxonomy defined by Tratz and Hovy (2010).…”
mentioning
confidence: 99%