The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.
IntroductionDigitization is a good way to preserve cultural heritage documents and make them widely accessible for researchers and the general public. Cultural institutions are aware of this potential and often consider digitization of their cultural heritage collections as an obligation. Consequently, the quantity of digitized historical documents available is constantly growing. Transforming print cultural heritage collections into digital resources accessible and searchable through modern information and communication technologies requires that the digitized document images are transformed into digital text through optical character recognition (OCR). While OCR can currently reach over 99% accuracy in recognition of characters from high-quality images of original documents with a simple book layout, the accuracy for historical newspapers is lower than that. OCR quality is dependent on the environment and the condition of the original documents: print and paper quality, typefaces, and layout complexity affect the accuracy of the result. Generally, the older the newspaper is, the lower the accuracy rate is likely to be. Holley (2009) reported raw character recognition accuracy rates varying from 71% to 98% in a sample of digitized newspapers from 1803-1954, the lowest rate indicating almost every third character being erroneously recognized and virtually all words containing errors. Even a 98% accuracy rate results in an error in, on average, every sixth word in Finnish text (with an average word length of around eight characters), if the errors are evenly distributed. Such error rates may lead to a quadrupling of the number of unique index words and sign...