2008
DOI: 10.1007/978-3-540-89533-6_49
|View full text |Cite
|
Sign up to set email alerts
|

Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres

Abstract: Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite successful. Genre identification worked well for easily defined genre categories such as weather reports. We also propose ad… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2009
2009
2018
2018

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…A detailed report of the successes and failures of the task per class is presented. A pipeline for the automatic processing of historical newspapers is specified in [21]. The pipeline begins with OCR'd text in XML format, and includes article segmentation, genre and subject recognition, and finally event extraction.…”
Section: A Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…A detailed report of the successes and failures of the task per class is presented. A pipeline for the automatic processing of historical newspapers is specified in [21]. The pipeline begins with OCR'd text in XML format, and includes article segmentation, genre and subject recognition, and finally event extraction.…”
Section: A Related Workmentioning
confidence: 99%
“…The pipeline begins with OCR'd text in XML format, and includes article segmentation, genre and subject recognition, and finally event extraction. In [21] the results are reported for the first two steps, namely article segmentation and genre recognition. Their future plans include topic and event categorization from the material.…”
Section: A Related Workmentioning
confidence: 99%
See 1 more Smart Citation