2006
DOI: 10.1080/08839510600903858
|View full text |Cite
|
Sign up to set email alerts
|

Textual Article Clustering in Newspaper Pages

Abstract: In the analysis of a newspaper page an important step is the clustering of various text blocks into logical units, i.e., into articles. We propose three algorithms based on text processing techniques to cluster articles in newspaper pages. Based on the complexity of the three algorithms and experiment on actual pages from the Italian newspaper L'Adige, we select one of the algorithms as the preferred choice to solve the textual clustering problem.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 11 publications
0
3
0
Order By: Relevance
“…On the left side is represented the basic entityrelationship (ER) model used to design the database and, on the right side, the schematic diagram shows how the data was organized after the conclusion of the content analysis research. It should be pointed out that no text processing technique, such as text-clustering algorithms (Aiello and Pegoretti 2006), was employed during the analysis of the data. The outcomes of the content research were expressed by means of production rules.…”
Section: Methods Implementationmentioning
confidence: 99%
“…On the left side is represented the basic entityrelationship (ER) model used to design the database and, on the right side, the schematic diagram shows how the data was organized after the conclusion of the content analysis research. It should be pointed out that no text processing technique, such as text-clustering algorithms (Aiello and Pegoretti 2006), was employed during the analysis of the data. The outcomes of the content research were expressed by means of production rules.…”
Section: Methods Implementationmentioning
confidence: 99%
“…Few authors [3], [10] have also worked on article extraction based on similarity of text from the text blocks generated from segmenting newspaper images. Furmaniak [10] identified paragraphs in the newspaper page and then measured similarity is measured between neighboring OCR'ed paragraphs.…”
Section: Related Workmentioning
confidence: 99%
“…A set of rules on layout understanding were created based on visual information such as distance, size, color, etc. Opposite to the aforementioned techniques relying merely on the visual information, Aiello et al [14] introduced a semantic information based method to determine the reading order. A lexical analysis the technique was adopted to rank candidate reading orders based on partof-speech (POS) probability.…”
Section: Research Gapmentioning
confidence: 99%