2AIRTC: The Amharic Adhoc Information Retrieval Test Collection

Yeshambel, Tilahun; Mothe, Josiane; Assabie, Yaregal

doi:10.1007/978-3-030-58219-7_5

Cited by 6 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many IR and NLP applications need stem or root extraction prior to other processes. We conducted a preliminary analysis on the usefulness of stem-based and root-based retrieval using the corpora we built, the 2AIRTC [16] and the Amharic stopword list [33] which are all available at https://www.irit.fr/AmharicResources/. We found that rootbased approach is better for retrieving more number of relevant documents.…”

Section: Discussionmentioning

confidence: 99%

“…In this paper, we present a collection which consists in two lexicons of 170,000 morphologically annotated Amharic terms where both stems and roots are annotated, as well as corpora of texts where documents have been re-written using these lexicons. These texts are part of the 2AIRTC, the Amharic Adhoc Information Retrieval Test Collection where documents, queries and query relevance are provided [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Morphologically Annotated Amharic Text Corpora

Yeshambel

Mothe

Assabie

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Morphologically Annotated Amharic Text Corpora

Yeshambel

Mothe

Assabie

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…We can quote a few such studies. Demeke and Getachew [23] created Walta Information Center news corpus; Yeshambel et al [24] built 2AIRTC; and Yeshambel et al [10] created stem-based and root-based morphologically annotated Amharic corpora semiautomatically. The sizes of corpora created by Demeke and Getachew [23], Yeshambel et al [24] and Yeshambel et al [10] are 1,065, 12,586, and 6,069 documents, respectively.…”

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

“…Demeke and Getachew [23] created Walta Information Center news corpus; Yeshambel et al [24] built 2AIRTC; and Yeshambel et al [10] created stem-based and root-based morphologically annotated Amharic corpora semiautomatically. The sizes of corpora created by Demeke and Getachew [23], Yeshambel et al [24] and Yeshambel et al [10] are 1,065, 12,586, and 6,069 documents, respectively. Mindaye et al [13] and Samuel and Bjorn [25] created Amharic word-based stopword list whereas Alemayehu and Willett [26] built stem-based stopwords list.…”

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

“…Consequently, they would not be used for accurately testing the performance of IR techniques. In our previous work [24], we developed an Amharic IR test collection that consists in a corpus, topic set and the associated relevance judgment. It allows researchers to evaluate retrieval system automatically though the size is still small relative to standard test collections.…”

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

See 1 more Smart Citation

Amharic Semantic Information Retrieval System

Yeshambel

Mothe

Assabie

2022

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Amharic is the official language of Ethiopia, currently having a population of over 118 million. Developing effective information retrieval (IR) system for Amharic has been a challenging task due to limited resources coupled with complex morphology of the language. This paper presents the development of Amharic semantic IR system using query expansion based on deep neural learning model and WordNet. In order to optimize the retrieval result, we propose Amharic text representation using root forms of words applied for stopword identification, indexing, term matching and query expansion. Comparisons are made with the conventional stem-based text representation for information retrieval, and we show that using the root forms of words is better for both resource construction and system development. The effectiveness of the proposed Amharic semantic IR system is evaluated on Amharic Adhoc Information Retrieval Test Collection (2AIRTC).

show abstract

Amharic Question Answering for Biography, Definition, and Description Questions

Abedissa

Libsie

2019

Communications in Computer and Information Science

View full text Add to dashboard Cite

Question Answering (QA) returns concise answers or answer lists from natural language text given a context document. To advance robust models' development, large amounts of resources go into curating QA datasets. There is a surge of QA datasets for languages like English, however, this is not the case for Amharic. Amharic, the official language of Ethiopia, is the second most spoken Semitic language in the world. There is no published or publicly available Amharic QA dataset. Hence, to foster the research in Amharic QA, we present the first Amharic QA (AmQA) dataset. We crowdsourced 2628 question-answer pairs over 378 Wikipedia articles. Additionally, we run an XLMRLarge-based baseline model to spark opendomain QA research interest. The best-performing baseline achieves an F-score of 69.58 and 71.74 in reader-retriever QA and reading comprehension settings respectively.

show abstract

2AIRTC: The Amharic Adhoc Information Retrieval Test Collection

Cited by 6 publications

References 16 publications

Morphologically Annotated Amharic Text Corpora

Morphologically Annotated Amharic Text Corpora

Amharic Semantic Information Retrieval System

Amharic Question Answering for Biography, Definition, and Description Questions

Contact Info

Product

Resources

About