Comparison of distance measures for historical spelling variants

Kempken, Sebastian; Luther, Wolfram; Pilz, Thomas

doi:10.1007/978-0-387-34747-9_31

Cited by 14 publications

(13 citation statements)

References 8 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Studies on HDR have generally focused on the differences between historical and modern languages. OCR errors have been omitted from the experimental settings by using manually created or manually corrected test data (e.g., Braun et al., ; Gotscharek, Reffle, Ringsletter, Schulz, & Neumann, ; Hauser, Heller, Leiss, Schulz, & Wanzeck, ; Kempken et al., , Koolen et al., ; O'Rourke et al., ). An exception is Pilz, Luther, Fuhr, and Ammon (), who created rules for handling OCR errors both manually and automatically based on edit costs between character replacements.…”

Section: Related Researchmentioning

confidence: 99%

“…Kempken et al. () used an edit distance variant where the edit costs were automatically learned from the German historical document collection. They concluded that algorithms that are adapted to the specific historical phenomena of the collection can reach a better translation recall and precision than standard edit distance and n ‐grams (Kempken et al., ).…”

Section: Related Researchmentioning

confidence: 99%

“…Moreover, documents from different time periods, regions, and sources can be treated differently, to adjust for the temporal and regional differences in spelling and differences in typography and layout in different publications. Most studies on HDR, however, have focused on query translation, that is, generating query word variants at retrieval time (Braun, Wiesman, & Sprinkhuizen‐Kuyper, ; Ernst‐Gerlach & Fuhr, ; Kempken, Luther, & Pilz, ; O'Rourke, Robertson, & Willett, ; Robertson & Willett, ). Query translation has similar benefits in HDR as in CLIR.…”

Section: Introductionmentioning

confidence: 99%

“…Most HDR studies (Braun et al., ; Ernst‐Gerlach & Fuhr, ; Kempken et al., ; Koolen et al., ; O'Rourke et al., ; Robertson & Willett, ) have focused on handling the graphical variants of modern words occurring in historical documents. The focus has been on recognizing correct pairs of modern and historical spelling variants, even if the need to handle OCR errors has been acknowledged.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Järvelin

Keskustalo

Sormunen

et al. 2015

Asso for Info Science & Tech

View full text Add to dashboard Cite

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation. IntroductionDigitization is a good way to preserve cultural heritage documents and make them widely accessible for researchers and the general public. Cultural institutions are aware of this potential and often consider digitization of their cultural heritage collections as an obligation. Consequently, the quantity of digitized historical documents available is constantly growing. Transforming print cultural heritage collections into digital resources accessible and searchable through modern information and communication technologies requires that the digitized document images are transformed into digital text through optical character recognition (OCR). While OCR can currently reach over 99% accuracy in recognition of characters from high-quality images of original documents with a simple book layout, the accuracy for historical newspapers is lower than that. OCR quality is dependent on the environment and the condition of the original documents: print and paper quality, typefaces, and layout complexity affect the accuracy of the result. Generally, the older the newspaper is, the lower the accuracy rate is likely to be. Holley (2009) reported raw character recognition accuracy rates varying from 71% to 98% in a sample of digitized newspapers from 1803-1954, the lowest rate indicating almost every third character being erroneously recognized and virtually all words containing errors. Even a 98% accuracy rate results in an error in, on average, every sixth word in Finnish text (with an average word length of around eight characters), if the errors are evenly distributed. Such error rates may lead to a quadrupling of the number of unique index words and sign...

show abstract

Section: Related Researchmentioning

confidence: 99%

Section: Related Researchmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

Järvelin

Keskustalo

Sormunen

et al. 2015

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…Uma alternativaé empregar medidas de distâncias entre strings, que não requerem o VSM. Muitas dessas medidas foram definidas para diferentes fins e aplicações (Cohen et al, 2003;Gravano et al, 2001;Huang e Madey, 2004;Kempken et al, 2006 …”

unclassified