2023
DOI: 10.1017/s0332586523000021
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing the unrestricted web: The finnish corpus of online registers

Abstract: This article introduces the Finnish Corpus of Online Registers (FinCORE) representing the full range of registers – situationally defined text varieties such as news and blogs – on the Finnish Internet. The extreme range of language use found online has challenged the study of registers. It has been unclear what registers the entire Internet includes, and if they can be sufficiently defined to allow for their analysis or classification, previous studies focusing on restricted sets of registers and English. Fin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 63 publications
0
7
0
Order By: Relevance
“…Removal of machine-translated and machine-generated content is a typical step in construction of web-based corpora. To this end, we trained a dedicated classifier using the FinCORE dataset (Skantsi & Laippala, 2023) discussed in greater detail in Section 3.4.4. This step removed the most material out of all the filtering steps, resulting in the removal of more than a billion tokens, as many low-quality noisy documents were identified in this step, in addition to genuine machine translated content.…”
Section: Removal Of Machine-translated Contentmentioning
confidence: 99%
See 3 more Smart Citations
“…Removal of machine-translated and machine-generated content is a typical step in construction of web-based corpora. To this end, we trained a dedicated classifier using the FinCORE dataset (Skantsi & Laippala, 2023) discussed in greater detail in Section 3.4.4. This step removed the most material out of all the filtering steps, resulting in the removal of more than a billion tokens, as many low-quality noisy documents were identified in this step, in addition to genuine machine translated content.…”
Section: Removal Of Machine-translated Contentmentioning
confidence: 99%
“…Other linguistic studies which make use of the dependency syntax structures include a study on discourse connectives (Laippala, Kyröläinen, Kanerva, & Ginter, 2018) and on emoticons . The corpus also served as the source data for the work of Skantsi and Laippala (2023) on Finnish text register classification, used to provide the text register metadata described in Section 3.4.4.…”
Section: Linguistic Researchmentioning
confidence: 99%
See 2 more Smart Citations
“…To this end, we apply a register identification model based on the Fin-CORE corpus, trained using XLM-R (Conneau et al, 2020). The model and corpus were both presented by Skantsi and Laippala (2022). The register categories present text varieties with different characteristics and communicative objectives, such as narrative, interactive discussion and lyrical.…”
Section: Register Analysismentioning
confidence: 99%