Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Alam, Sawood; Mehmood, Fateh ud din B.; Nelson, Michael L.

doi:10.1145/2756406.2756926

Cited by 3 publications

(4 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015). Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015).…”

Section: Introductionmentioning

confidence: 99%

“…This study aims to develop a publicly available large‐scale benchmark corpus that contains real examples of cross‐language text reuse at sentence/passage level for the English‐Urdu language pair. Urdu belongs to the Indo‐Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, ). Moreover, it has a strong Perso‐Arabic influence in its vocabulary and is written in a Perso‐Arabic script from right to left.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Muneer¹,

Sharjeel

Iqbal

et al. 2018

Asso for Info Science & Tech

View full text Add to dashboard Cite

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f 1 =0.732 binary, f 1 =0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Muneer¹,

Sharjeel

Iqbal

et al. 2018

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…Our CNNs comprise of mainly two layers (as shown in Figure 1): a convolutional layer followed by max pooling and a fully connected layer for the classification. For the CNN input, we consider a document (partial) as a sequence of words and use pre-trained word embeddings 1 https://scikit-learn.org/stable/ for each word. These pre-trained word embeddings are trained on the Google News dataset using the Word2Vec 2 [35] algorithm.…”

Section: Methodsmentioning

confidence: 99%

“…Nwala et al [36] studied bootstrapping of the web archive collections from the social media and showed that sources such as Reddit, Twitter, and Wikipedia can produce collections that are similar to expert generated collections (i.e., Archive-It collections). Alam et al [1] proposed an approach to index raster images of dictionary pages and built a Web application that supports word indexes in various languages with multiple dictionaries. Alam et al [2] used CDX summarization for web archive profiling, whereas AlNoamany et al [3] proposed the Dark and Stormy Archive (DSA) framework for summarizing the holdings of these collections and arranging them into a chronological order.…”

Section: Related Workmentioning

confidence: 99%

Identifying Documents In-Scope of a Collection from Web Archives

Patel

Caragea

Phillips

et al. 2020

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

View full text Add to dashboard Cite

Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.

show abstract

Urdu Short Paraphrase Detection at Sentence Level

Hafeez

Muneer

Sharjeel

et al. 2023

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, devloped, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result ws obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available as free to download option for research purposes.

show abstract

Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Cited by 3 publications

References 33 publications

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Identifying Documents In-Scope of a Collection from Web Archives

Urdu Short Paraphrase Detection at Sentence Level

Contact Info

Product

Resources

About