2023
DOI: 10.21203/rs.3.rs-3138153/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Finnish Internet Parsebank

Abstract: We present a Finnish web corpus with multiple text sources and rich additional annotations. The corpus is based in large parts on a dedicated Internet crawl, supplementing data from the Common Crawl initiative and the Finnish Wikipedia. The size of the corpus is 6.2 billion tokens from 9.5 million source documents. The text is enriched with morphological analyses, word lemmas, dependency trees, named entities and text register (genre) identification. Paragraph-level scores of an n-gram language model, as well … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 15 publications
0
0
0
Order By: Relevance