2021
DOI: 10.1075/rs.19015.sha
|View full text |Cite
|
Sign up to set email alerts
|

Genre annotation for the Web

Abstract: This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 31 publications
0
5
0
Order By: Relevance
“…Multiple studies searched for the most informative features in this task. They experimented with lexical features (words, word or character n-grams), grammatical features (part-of-speech tags) [31,38], text statistics [8], visual features of HTML web pages such as HTML tags and images [43][44][45], and URLs of web documents [10,46,47]. However, the results for the discriminative features varied across studies and datasets.…”
Section: Machine Learning Methods For Automatic Genre Identificationmentioning
confidence: 99%
See 1 more Smart Citation
“…Multiple studies searched for the most informative features in this task. They experimented with lexical features (words, word or character n-grams), grammatical features (part-of-speech tags) [31,38], text statistics [8], visual features of HTML web pages such as HTML tags and images [43][44][45], and URLs of web documents [10,46,47]. However, the results for the discriminative features varied across studies and datasets.…”
Section: Machine Learning Methods For Automatic Genre Identificationmentioning
confidence: 99%
“…These studies addressed the difficulties with this task, which impact both manual and automatic genre identification. The main challenges identified were (1) varying levels of genre prototypicality in web texts, (2) the presence of features of multiple genres in one text, and (3) the existence of texts that might not have any discernible purpose or features [1,31].…”
Section: Challenges In Automatic Genre Identificationmentioning
confidence: 99%
“…The models need to find a higher pattern in texts, often based on textual or syntactic characteristics, unrelated to the topic of the document. In addition, classification of genres was shown to be a hard task because texts can be more or less prototypical examples of their genre classes, can show signals of multiple classes or lack signals of any genre (Sharoff, 2021;Zu Eissen and Stein, 2004). That is why this text categorization task is very challenging for non-neural methods which were shown to be too dataset-dependent and were not capable of generalizing to unseen datasets (Sharoff et al, 2010).…”
Section: Automatic Genre Identificationmentioning
confidence: 99%
“…To this end, genre researchers devised sets of genre categories which aim to cover all of the diversity of texts found on the web, and provided manually annotated datasets (see Egbert et al (2015); Sharoff (2018); Kuzman et al (2022b)). Classification of genres was shown to be a hard task as texts can display characteristics of multiple genres (Sharoff, 2021), and most genre classification models were not able to generalize outside of the dataset on which they were trained (Sharoff et al, 2010). However, recent advances in deep neural technologies led to a breakthrough in this field, and Transformer-based language models (Vaswani et al, 2017), fine-tuned on manually-annotated genre datasets, showed the ability to identify genres in various web corpora and languages (see Rönnqvist et al (2021); Kuzman et al (2022a)).…”
Section: Related Workmentioning
confidence: 99%