2020
DOI: 10.48550/arxiv.2003.06389
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Know thy corpus! Robust methods for digital curation of Web corpora

Abstract: This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…Additionally, we calculated the frequency for each word in our only including English Telegram messages. Then, we used the frequency list of Open-WebTex (Sharoff, 2020) as a reference corpus for computing the log-likelihood (LL). This metric helps to identify the most indicative words in our corpus when compared against the reference.…”
Section: Resultsmentioning
confidence: 99%
“…Additionally, we calculated the frequency for each word in our only including English Telegram messages. Then, we used the frequency list of Open-WebTex (Sharoff, 2020) as a reference corpus for computing the log-likelihood (LL). This metric helps to identify the most indicative words in our corpus when compared against the reference.…”
Section: Resultsmentioning
confidence: 99%
“…A popular choice that meets the aforementioned requirement is NVIDIA A100 GPU. According to Google Cloud Service Provider, 4 the hourly price of NVIDIA A100 GPU is around 3.93$. Now the information provided in [27,35] the training cost can be estimated as 21 days × 24 h × 3.93 USD × 2048 GPUs would yield 4.057 million dollars.…”
Section: Estimated Training Cost Of Llmmentioning
confidence: 99%
“…1.5 billion and 175 billion, and the pre-training data, i.e. WebText [4] and CommonCrawl [5], respectively. CommonCrawl is 15 × larger than WebText.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation