Problems in the Use-Centered Development of a Taxonomy of Web Genres

Crowston, Kevin; Kwaśnik, Barbara H.; Rubleske, Joseph

doi:10.1007/978-90-481-9178-9_4

Cited by 55 publications

(38 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, given the aforementioned problems that "experts" have identifying web genre/register categories, it is not surprising that nonexpert web users also vary in their understanding of genre/register labels (see Crowston, Kwasnik, & Rubleske, 2010), and previous research has shown that reliability among end users is often unacceptably low (Rosso & Haas, 2010). To address this concern, some studies adopt an alternative approach to the manual coding of web documents, relying on actual Internet users rather than "experts."…”

Section: Automatic Genre Identificationmentioning

confidence: 99%

Developing a bottom‐up, user‐based method of web register classification

Egbert

Biber

Davies

2015

Asso for Info Science & Tech

101

View full text Add to dashboard Cite

This paper introduces a project to develop a reliable, cost-effective method for classifying Internet texts into register categories, and apply that approach to the analysis of a large corpus of web documents. To date, the project has proceeded in 2 key phases. First, we developed a bottom-up method for web register classification, asking end users of the web to utilize a decision-tree survey to code relevant situational characteristics of web documents, resulting in a bottom-up identification of register and subregister categories. We present details regarding the development and testing of this method through a series of 10 pilot studies. Then, in the second phase of our project we applied this procedure to a corpus of 53,000 web documents. An analysis of the results demonstrates the effectiveness of these methods for web register classification and provides a preliminary description of the types and distribution of registers on the web. Literature Review Registers and GenresOver the past 3 decades, register has emerged as one of the most important predictors of linguistic variation, and a wide range of registers have been described and compared

show abstract

Section: Automatic Genre Identificationmentioning

confidence: 99%

Developing a bottom‐up, user‐based method of web register classification

Egbert

Biber

Davies

2015

Asso for Info Science & Tech

101

View full text Add to dashboard Cite

show abstract

“…The hierarchical genre collection (HGC) (Stubbe and Ringlstetter 2007), the Syracuse corpus (Crowston et al 2011), KRYS I (Berninger et al 2008) and the corpus constructed in Egbert and Biber (2013), Egbert et al (2015) use a relatively large number of genre labels (between 32 and 292 labels), leading to high granularity. Their focus is therefore on high coverage and the construction of a detailed taxonomy.…”

Section: Existing Genre-annotated Web Corporamentioning

confidence: 99%

“…Therefore, genres that do not normally use this format, such as homepage and shop, are not included. The Syracuse (Crowston et al 2011) collection consists of 3027 web pages annotated based on 292 very specific genres. The genre palette in this collection was developed bottom-up by asking three groups of people (teachers, journalists, engineers) to produce web genre terms themselves.…”

Section: Existing Genre-annotated Web Corporamentioning

confidence: 99%

“…Table 17 shows that 45.34 % of pages in LWGC-R did not belong to any of our 15 predefined genre categories, indicating a somewhat more than 50 % coverage for our 15 genres. Researchers in genre classification have come up with long lists of genre classes, e.g., 292 genre labels in the Syracuse corpus (Crowston et al 2011) or 500 genre labels listed in Dimter (1981). Therefore, the web pages categorized as other in this experiment could belong to any genre class in these taxonomies.…”

Section: Lwgc-r: Source and Topic Diversitymentioning

confidence: 99%

See 1 more Smart Citation

Crowdsourcing for web genre annotation

Asheghi

Sharoff

Markert

2016

Lang Resources & Evaluation

View full text Add to dashboard Cite

Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are either not tested for inter-annotator reliability or exhibit low inter-coder agreement. Annotation has also mostly been carried out by a small number of experts, leading to concerns with regard to scalability of these annotation efforts and transferability of the schemes to annotators outside these small expert groups. In this paper, we tackle these problems by using crowd-sourcing for genre annotation, leading to the Leeds Web Genre Corpus-the first web corpus which is, demonstrably reliably annotated for genre and which can be easily and cost-effectively expanded using naive annotators. We also show that the corpus is source and topic diverse.

show abstract

“…Also each user usually concentrates on a small number of types of texts relevant to their everyday life, providing situation-specific labels, such as 'uncontrolled resource page' or ambiguous ones, such as 'article', thus necessitating more research into linking the genre labels to the way they are actually used. For more information on the problems with the user-based genre taxonomies see (Crowston et al, 2010).…”

Section: Introductionmentioning

confidence: 99%

Functional Text Dimensions for the annotation of web corpora

Sharoff¹

2018

Corpora

View full text Add to dashboard Cite

This paper presents an approach to classify large Web corpora into genres by means of Functional Text Dimensions (FTDs). This offers a topological approach to text typology in which the texts are described in terms of their similarity to prototype genres. The suggested set of categories is designed to be applicable to any text on the Web and to be reliable in annotation practice. Interannotator agreement results show that the suggested categories produce Krippendorff's α above 0.76. In addition to the functional space of 18 dimensions, similarity between annotated documents can be described visually within a space of reduced dimensions obtained through t-distributed Statistical Neighbour Embedding. Reliably annotated texts also provide the basis for automatic genre classification, which can be done in each FTD, as well as as within the space of reduced dimensions. An example comparing texts from the Brown Corpus, the BNC and ukWac, a large Web corpus, is provided.

show abstract

Problems in the Use-Centered Development of a Taxonomy of Web Genres

Cited by 55 publications

References 15 publications

Developing a bottom‐up, user‐based method of web register classification

Developing a bottom‐up, user‐based method of web register classification

Crowdsourcing for web genre annotation

Functional Text Dimensions for the annotation of web corpora

Contact Info

Product

Resources

About