2018
DOI: 10.1007/978-3-319-99133-7_17
|View full text |Cite
|
Sign up to set email alerts
|

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

Abstract: Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 14 publications
0
3
0
Order By: Relevance
“…Essentially, they allow for an evaluation of the quality of a domain-specific web corpus and can also be used to pre-assess the portability of NLP tools from one domain-specific corpus to a different corpus belonging to another domain. Similar experiments have also been carried out on Swedish corpora with much the same results (Santini et al, 2018), showing that our approach may become a language-independent standardized step in corpus evaluation practice (intrinsic evaluation metrics).…”
Section: Discussionmentioning
confidence: 58%
See 1 more Smart Citation
“…Essentially, they allow for an evaluation of the quality of a domain-specific web corpus and can also be used to pre-assess the portability of NLP tools from one domain-specific corpus to a different corpus belonging to another domain. Similar experiments have also been carried out on Swedish corpora with much the same results (Santini et al, 2018), showing that our approach may become a language-independent standardized step in corpus evaluation practice (intrinsic evaluation metrics).…”
Section: Discussionmentioning
confidence: 58%
“…In this experiment, we evaluate how good the performance of the eCare term extractor is to bootstrap a web corpus based on the domain of the use cases. We measure the domainhood (or domain-specificity) against a reference corpus representing general language (see also Santini et al, 2018).…”
Section: Extrinsic Evaluation: Assessing Domainhoodmentioning
confidence: 99%
“…Since each corpus varies in email counts and email lengths, relative term frequencies are used. 20 While relative term frequencies control for corpus size, the scaling can introduce distortions, which complicate statistical tests. For a robustness measure, we sample emails from the larger corpus (Org-2) until the total term count equals the Org-1 and use absolute term frequency.…”
Section: Experiments Datasets and Their Characterizationmentioning
confidence: 99%