2007
DOI: 10.1007/978-3-540-73354-6_26
|View full text |Cite
|
Sign up to set email alerts
|

A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

Abstract: Abstract. Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events. Keywords:W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2008
2008
2021
2021

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 8 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…The pipeline was deployed in the GATE, a popular and mature open source JAVA-based suite of tools with over 20 years of continuous development from the Natural Language Processing (NLP) research group at the University of Sheffield (Cunningham et al, 2013). The GATE platform has been employed for the construction of NER pipelines in the broader digital humanities domain, including processing of 18th century court proceedings (Bontcheva et al, 2002), historical newspapers (Allen et al, 2007), and archaeological greyliterature report (Vlachidis and Tudhope, 2016).…”
Section: The Ner Methodsmentioning
confidence: 99%
“…The pipeline was deployed in the GATE, a popular and mature open source JAVA-based suite of tools with over 20 years of continuous development from the Natural Language Processing (NLP) research group at the University of Sheffield (Cunningham et al, 2013). The GATE platform has been employed for the construction of NER pipelines in the broader digital humanities domain, including processing of 18th century court proceedings (Bontcheva et al, 2002), historical newspapers (Allen et al, 2007), and archaeological greyliterature report (Vlachidis and Tudhope, 2016).…”
Section: The Ner Methodsmentioning
confidence: 99%
“…Of course, there should probably be an ongoing community review panel to ensure that appropriate material is saved, that local records are well managed, and that there is coordination and consistency across the records. Allen, et al (2007) describe the utility of such a model for processing digitized collections of historical newspapers.…”
Section: Comprehensive Community Information Repository and Semantic mentioning
confidence: 99%
“…However, it does not seem feasible to do that by entering specific facts; there are just too many. Generative models such as cyclic models for the seasons may be better (see [2]). These could be simple such as listing the months of the baseball season, the years in which there are presidential elections, the years during which the Wright Brothers were working, or the locations of major buildings in the city.…”
Section: The Big Picturementioning
confidence: 99%