Cross-language information retrieval (CLIR) deals with retrieving relevant documents in one language using queries expressed in another language. As CLIR tools rely on translation techniques, they are challenged by the properties of highly derivational and flexional languages like Arabic. Much work has been done on CLIR for different languages including Arabic. In this article, we introduce the reader to the motivations for solving some problems related to Arabic CLIR approaches. The evaluation of these approaches is discussed starting from the 2001 and 2002 TREC Arabic CLIR tracks, which aim to objectively evaluate CLIR systems. We also study many other research works to highlight the unresolved problems or those that require further investigation. These works are discussed in the light of a deep study of the specificities and the tasks of Arabic information retrieval (IR). Particular attention is given to translation techniques and CLIR resources, which are key issues challenging Arabic CLIR. To push research in this field, we discuss how a new standard collection can improve Arabic IR and CLIR tracks.
Automatic text summarization is the process of generating or extracting a brief representation of an input text. There are several algorithms for extractive summarization in the literature tested by using English and other languages datasets; however, only few extractive Arabic summarizers exist due to the lack of large collection in Arabic language. This paper proposes and assesses new extractive single-document summarization approaches based on analogical proportions which are statements of the form "a is to b as c is to d". The goal is to study the capability of analogical proportions to represent the relationship between documents and their corresponding summaries. For this purpose, we suggest two algorithms to quantify the relevance/irrelevance of an extracted keyword from the input text, to build its summary. In the first algorithm, the analogical proportion representing this relationship is limited to check the existence/non-existence of the keyword in any document or summary in a binary way without considering keyword frequency in the text, whereas the analogical proportion of the second algorithm considers this frequency. We have assessed and compared these two algorithms with some languageindependent summarizers (LexRank, TextRank, Luhn and LSA (Latent Semantic Analysis)) using our large corpus ANT (Arabic News Texts) and a small test collection EASC (Essex Arabic Summaries Corpus) by computing ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (BiLingual Evaluation Understudy) metrics. The best-achieved results are ROUGE-1 = 0.96 and BLEU-1 = 0.65 corresponding to educational documents from EASC collection which outperform the best LexRank algorithm. The proposed algorithms are also compared with three other Arabic extractive summarizers, using EASC collection, and show better results in terms of ROUGE-1 = 0.75 and BLEU-1 = 0.47 for the first algorithm, and ROUGE-1 = 0.74 and BLEU-1 = 0.49 for the second one. Experimental results show the interest of analogical proportions for text summarization. In particular, analogical summarizers significantly outperform three among four language-independent summarizers in the case of BLEU-1 for ANT collection and they are not significantly outperformed by any other summarizer in the case of EASC collection.
Purpose-The purpose of this paper is to make a scientific contribution to web information retrieval (IR). Design/methodology/approach-A multiagent system for web IR is proposed based on new technologies: Hierarchical Small-Worlds (HSW) and Possibilistic Networks (PN). This system is based on a possibilistic qualitative approach which extends the quantitative one. Findings-The paper finds that the relevance of the order of documents changes while passing from a profile to another. Even if the selected terms tend to select the relevant document, these terms are not the most frequent of the document. This criterion shows the asset of the qualitative approach of the SARIPOD system in the selection of relevant documents. The insertion of the factors of preference between query terms in the calculations of the possibility and the necessity consists in increasing the scores of possibilistic relevance of the documents containing these terms with an aim of penalizing the scores of relevance of the documents not containing them. The penalization and the increase in the scores are proportional to the capacity of the terms to discriminate between the documents of the collection. Research limitations/implications-It is planned to extend the tests of the SARIPOD system to other grammatical categories, like refining the approach for the substantives by considering for example, the verbal occurrences in names definitions, etc. Also, it is planned to carry out finer measurements of the performances of SARIPOD system by extending the tests with other types of web documents. Practical implications-The system can be useful to help research students find their relevant scientific papers. It must be located in the document server of any research laboratory. Originality/value-The paper presents SARIPOD, a new qualitative possibilistic model for web IR using multiagent system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with đź’™ for researchers
Part of the Research Solutions Family.