Box clustering segmentation: A new method for vision-based web page preprocessing

Zelený, Jan; Burget, Radek; Zendulka, Jaroslav

doi:10.1016/j.ipm.2017.02.002

Cited by 29 publications

(26 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most algorithms use the DOM tree structure in some way, for example to identify headings [26], block nodes [1,6], or regularities [16], and to compute the tree depth [24] or the tree distance [18] of nodes. Other algorithms use the text density [22] or visual appearance of DOM nodes when rendered (e.g., their size or color; Baluja [4], Zeleny et al [37]). Few algorithms exclusively exploit visual cues, e.g., using edge detection on screenshots [8,12].…”

Section: Related Workmentioning

confidence: 99%

“…Few algorithms exclusively exploit visual cues, e.g., using edge detection on screenshots [8,12]. Indeed, recent publications argue that only visual features provide for the necessary robustness for a generalizable algorithm [12,37], but this claim has not been verified. Our dataset provides the resources required by all the various approaches, enabling a fair and comprehensive comparison.…”

Section: Related Workmentioning

confidence: 99%

“…Others evaluate based on the web page's text only, which allows for using existing evaluation measures for this task (e.g., Kohlschütter and Nejdl [22], Manabe and Tajima [28]), but restricts the evaluation to text-only segments. Yet others measure the overlap between an automatic segmentation and a ground truth [37], or count matching cases (one-to-one, one-to-many, zero-to-one, etc. ; Sanoja and Gançarski [32]).…”

Section: Related Workmentioning

confidence: 99%

“…; Sanoja and Gançarski [32]). Such matching measures, however, unfairly handle cases of over-and undersegmentation: The measure proposed by Zeleny et al [37] penalizes splitting a ground truth segment into several small ones more than returning just one of the small segments. The measure of Sanoja and Gançarski does not penalize splitting a ground truth segment at all, making it trivial to achieve the maximum score.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Web Page Segmentation Revisited

Kiesel

Kneist

Meyer

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450 crowdsourced segmentations for 8,490 web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model" of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Web Page Segmentation Revisited

Kiesel

Kneist

Meyer

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

show abstract

“…[24] proposes a method based on analyzing the Document Object Model (DOM) of a web page. [4] and [31] uses a visual approach by analyzing the page rendition in a browser to extract areas. Other techniques may be based on image processing [5], semantic structures or graph resolution [13].…”

Section: Software Architecture and Evaluationsmentioning

confidence: 99%

Concurrent Speech Synthesis to Improve Document First Glance for the Blind

Maurel

Dias

Ferrari

et al. 2019

2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)

View full text Add to dashboard Cite

Skimming and scanning are two well-known reading processes, which are combined to access the document content as quickly and efficiently as possible. While both are available in visual reading mode, it is rather difficult to use them in non visual environments because they mainly rely on typographical and layout properties. In this article, we introduce the concept of tag thunder as a way (1) to achieve the oral transposition of the web 2.0 concept of tag cloud and (2) to produce an innovative interactive stimulus to observe the emergence of self-adapted strategies for non-visual skimming of written texts. We first present our general and theoretical approach to the problem of both fast, global and non-visual access to web browsing; then we detail the progress of development and evaluation of the various components that make up our software architecture. We start from the hypothesis that the semantics of the visual architecture of web pages can be transposed into new sensory modalities thanks to three main steps (web page segmentation, keywords extraction and sound spatialization). We note the difficulty of simultaneously (1) evaluating a modular system as a whole at the end of the processing chain and (2) identifying at the level of each software brick the exact origin of its limits; despite this issue, the results of the first evaluation campaign seem promising.

show abstract