Proceedings of the 24th ACM International on Conference on Information and Knowledge Management 2015
DOI: 10.1145/2806416.2806613
|View full text |Cite
|
Sign up to set email alerts
|

The Influence of Pre-processing on the Estimation of Readability of Web Documents

Abstract: This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an import… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

5
10
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 19 publications
(15 citation statements)
references
References 9 publications
5
10
0
Order By: Relevance
“…Using Justext or Boilerplate resulted in higher correlations with human understandability assessments, and the ForcePeriod heuristic was shown to be better than DoNotForcePeriod . These results confirm the speculations of Palotti et al [13]: they found these settings to produce lower variances in understandability estimations and thus hypothesised that they were better suited to the task.…”
Section: Evaluation Of Preprocessing Pipelines and Heuristicssupporting
confidence: 91%
See 4 more Smart Citations
“…Using Justext or Boilerplate resulted in higher correlations with human understandability assessments, and the ForcePeriod heuristic was shown to be better than DoNotForcePeriod . These results confirm the speculations of Palotti et al [13]: they found these settings to produce lower variances in understandability estimations and thus hypothesised that they were better suited to the task.…”
Section: Evaluation Of Preprocessing Pipelines and Heuristicssupporting
confidence: 91%
“…Although no single setting outperformed the others in both collections, we found that the use of CLI and FRE with Justext provided the most stable results across the collections, with correlations as high as the best ones in both collections. These results confirmed the advice put forward by Palotti et al [13], i.e. in general, if using readability measures, then CLI is to be preferred, along with an appropriate HTML extraction pipeline, regardless of the heuristic for sentence ending.…”
Section: Evaluation Of Preprocessing Pipelines and Heuristicssupporting
confidence: 87%
See 3 more Smart Citations