There is a growing need to access historical Ottoman documents stored in large archives and therefore managing tools for automatic searching, indexing and transcription of these documents is required. In this paper, we present a method for the retrieval of Ottoman documents based on word matching. The method first successfully segments the documents into word images and then uses a hierarchical matching technique to find the similar instances of the word images. The experiments show that even with simple features promising results can be achieved.
Large archives of Ottoman documents are challenging to many historians all over the world. However, these archives remain inaccessible since manual transcription of such a huge volume is difficult. Automatic transcription is required, but due to the characteristics of Ottoman documents, character recognition based systems may not yield satisfactory results. It is also desirable to store the documents in image form since the documents may contain important drawings, especially the signatures. Due to these reasons, in this study we treat the problem as an image retrieval problem with the view that Ottoman words are images, and we propose a solution based on image matching techniques. The bag-of-visterms approach, which is shown to be successful to classify objects and scenes, is adapted for matching word images. Each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors extracted from salient points. Similar words are then matched based on the similarity of the distributions of the visual terms. The experiments are carried out on printed and handwritten documents which included over 10,000 words. The results show that, the proposed system is able to retrieve words with high accuracies, and capture the semantic similarities between words.
OzetçeOsmanlı arşivleri dünyanın pek çok yerinden araştırmacının ilgi alanına girmektedir. Fakat bu belgelerin elle çevirisi zor bir iş oldugu için, bu arşivler kullanılamaz durumdadır. Otomatik çeviri gerekmektedir, fakat Osmanlıca'nın yazmä ozelliklerinden dolayı karakter tabanlı tanıma sistemleri istenen başarıyı gösterememektedir. Ayrıca, belgeler minyatür ve tugra gibiönemli kısımlar içerdigi için, imge formatında saklanmaları gerekmektedir. Bu nedenle, bu çalışmada Osmanlıca kelimeleri imge olarak görerek probleme imge erişim problemi olarak yaklaşıldı ve kelime eşleme teknigiüzerine bir çözümönerisinde bulunuldu. Nesne tanımada başarılı olan görselögeler kümesi (bag-of-visterms) teknigi kelime eşleme işlemine uyarlandı ve böylece her kelime imgesi taç noktalarından çıkarılan SIFTözelliklerinin vektör nicemlemesiyle sembolize edildi. Benzer kelimeler görselögelerin dagılımına göre eşlendi. Deneyler 10,000 kelimeninüzerindeki matbu ve elyazması belgeüzerinde yapıldı. Sonuçlar sistemin benzer kelimeleri yüksek dogrulukla eşledigini ve anlamsal benzerlikleri buldugunu gösteriyor. AbstractLarge archives of Ottoman documents are challenging to many historians all over the world. However, these archives remain inaccessible since manual transcription of such a huge volume is difficult. Automatic transcription is required, but due to the characteristics of Ottoman documents, character recognition based systems may not yield satisfactory results. It is also desirable to store the documents in image form since the documents may contain important drawings, especially the signatures. Due to these reasons, in this study we treat the problem as an image retrieval problem with the view that Ottoman words are images, and we propose a solution based on image matching techniques. The bag-of-visterms approach, which is shown to be successful to classify objects and scenes, is adapted for matching word images. Each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors extracted from salient points. Similar words are then matched based on the similarity of the distributions of the visual terms. The experiments are carried out on printed and handwritten documents which included over 10,000 words. The results show that, the proposed system is able to retrieve words with high accuracies, and capture the semantic similarities between words. Ş ekil 1: Osmanlı alfabesindeki harfler. Osmanlıca Arapçadaki 28 harften farklı olarak 5 harf daha içermektedir, bunlar şekilde çerçeve içine alınmıştır.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.