Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 2015
DOI: 10.1145/2756406.2756926
|View full text |Cite
|
Sign up to set email alerts
|

Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Abstract: We propose an approach to index raster images of dictionary pages which in turn would require very little manual effort to enable direct access to the appropriate pages of the dictionary for lookup. Accessibility is further improved by feedback and crowdsourcing that enables highlighting of the specific location on the page where the lookup word is found, annotation, digitization, and fielded searching. This approach is equally applicable on simple scripts as well as complex writing systems. Using our proposed… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 33 publications
0
4
0
Order By: Relevance
“…Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015). Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015). Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015).…”
Section: Introductionmentioning
confidence: 99%
“…This study aims to develop a publicly available large‐scale benchmark corpus that contains real examples of cross‐language text reuse at sentence/passage level for the English‐Urdu language pair. Urdu belongs to the Indo‐Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, ). Moreover, it has a strong Perso‐Arabic influence in its vocabulary and is written in a Perso‐Arabic script from right to left.…”
Section: Introductionmentioning
confidence: 99%
“…Our CNNs comprise of mainly two layers (as shown in Figure 1): a convolutional layer followed by max pooling and a fully connected layer for the classification. For the CNN input, we consider a document (partial) as a sequence of words and use pre-trained word embeddings 1 https://scikit-learn.org/stable/ for each word. These pre-trained word embeddings are trained on the Google News dataset using the Word2Vec 2 [35] algorithm.…”
Section: Methodsmentioning
confidence: 99%
“…Nwala et al [36] studied bootstrapping of the web archive collections from the social media and showed that sources such as Reddit, Twitter, and Wikipedia can produce collections that are similar to expert generated collections (i.e., Archive-It collections). Alam et al [1] proposed an approach to index raster images of dictionary pages and built a Web application that supports word indexes in various languages with multiple dictionaries. Alam et al [2] used CDX summarization for web archive profiling, whereas AlNoamany et al [3] proposed the Dark and Stormy Archive (DSA) framework for summarizing the holdings of these collections and arranging them into a chronological order.…”
Section: Related Workmentioning
confidence: 99%