In recent years, many scientific institutions have digitized their collections, which often include a large variety of topographic raster maps. These raster maps provide accurate (historical) geographical information but cannot be integrated directly into a geographical information system (GIS) due to a lack of metadata. Additionally, the text labels on the map are usually not annotated, making it inefficient to query for specific toponyms. Manually georeferencing and annotating the text labels on these maps is not cost-effective for large collections. This work presents a fully automated georeferencing approach based on text recognition and geocoding pipeline. After recognizing the text on the maps, publicly available geocoders were used to determine a region of interest. The approach was validated on a collection of historical and contemporary topographic maps. We show that this approach can geolocate the topographic maps fairly accurately, resulting in an average georeferencing error of only 316 m (1.67%) and 287 m (0.90%) for 16 historical maps and 9 contemporary maps spanning 19 km and 32 km, respectively (scale 1:25,000 and 1:50,000). Furthermore, this approach allows the maps to be queried based on the recognized visible text and found toponyms, which further improves the accessibility and quality of the collection.
Scanned historical topographic maps contain valuable geographical information. Often, these maps are the only reliable source of data for a given period. Many scientific institutions have large collections of digitized historical maps, typically only annotated with a title (general location), a date, and a small description. This allows researchers to search for maps of some locations, but it gives almost no information about what is depicted on the map itself. To extract useful information from these maps, they still need to be analyzed manually, which can be very tedious. Current commercial and open-source text recognition tools underperform when applied to maps, especially on densely annotated regions. They require additional processing to provide accurate results. Therefore, this work presents an automatic map processing approach focusing mainly on detecting the mentioned toponyms and georeferencing the map. Commercial and open-source tools were used as building blocks, to provide scalability and accessibility. As lower-quality scans generally decrease the performance of text recognition tools, the impact of both the scan and compression quality was studied. Moreover, because most maps were too large to effectively process as a whole with state-of-the-art commercial recognition tools, a tiling approach was used. The tile size affects recognition performance, therefore a study was conducted to determine the optimal parameters. First, the map boundaries were detected with computer vision techniques. Afterward, the coordinates surrounding the map were extracted using a commercial OCR system. After projecting the coordinates to the WGS84 coordinate system, the maps were georeferenced. Next, the map was split into overlapping tiles, and text recognition was performed. A small region of interest was determined for each detected text label, based on its relative position. This region limited the potential toponym matches given by publicly available gazetteers. Multiple gazetteers were combined to find additional candidates for each text label. Optimal toponym matches were selected with string similarity metrics. Furthermore, the relative positions of the detected text and the actual locations of the matched toponyms were used to filter out additional false positives. Finally, the approach was validated on a selection of 1 : 25 000 topographic maps of Belgium from 1975-1992. By automatically georeferencing the map and recognizing the mentioned place names, the content and location of each map can now be queried.
Since herbarium specimens are increasingly becoming digitised and accessible in online repositories, an important need has emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives. Particularly, automatic enrichment of multi specimen herbaria sheets pose unique challenges and problems that have not been adequately addressed. The complexity of localization of species in a page increases exponentially when multiple specimens are present in the same page. This already challenges the performance of models that work accurately with single specimens. Therefore in this work, we have performed experiments to identify the models that perform well for the plant specimen localization problem. The major bottleneck for performing such experiments was the lack of labelled data. We also address this problem, by proposing tools and algorithms to semi-automatically generate annotations for herbarium images. Based on our experiments, segmentation models perform much better than detection models for the task of plant localization. Our binary segmentation model can accurately extract specimens from the background and achieves an F1 score of 0.977. The ablation experiments for multi specimen instance segmentation show that our proposed augmentation method provides a 38% increase in performance (0.51 mAP@0.9 versus 0.37) on a dataset of 1500 plant instances.
PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.