Morphologically Annotated Amharic Text Corpora

Yeshambel, Tilahun; Mothe, Josiane; Assabie, Yaregal

doi:10.1145/3404835.3463237

Cited by 6 publications

(11 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can quote a few such studies. Demeke and Getachew [23] created Walta Information Center news corpus; Yeshambel et al [24] built 2AIRTC; and Yeshambel et al [10] created stem-based and root-based morphologically annotated Amharic corpora semiautomatically. The sizes of corpora created by Demeke and Getachew [23], Yeshambel et al [24] and Yeshambel et al [10] are 1,065, 12,586, and 6,069 documents, respectively.…”

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

“…Demeke and Getachew [23] created Walta Information Center news corpus; Yeshambel et al [24] built 2AIRTC; and Yeshambel et al [10] created stem-based and root-based morphologically annotated Amharic corpora semiautomatically. The sizes of corpora created by Demeke and Getachew [23], Yeshambel et al [24] and Yeshambel et al [10] are 1,065, 12,586, and 6,069 documents, respectively. Mindaye et al [13] and Samuel and Bjorn [25] created Amharic word-based stopword list whereas Alemayehu and Willett [26] built stem-based stopwords list.…”

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

“…stem and root) could be extracted easily and quickly. The two morphological analyses performed in this work are stem-based and root-based morphological analysis using lexicons created by Yeshambel et al [10]. The lexicons are constructed from a corpus.…”

Section: Morphological Analysismentioning

confidence: 99%

“…In this work, Amharic surface words are segmented into their morphemes by analyzing the internal structure of words and their contexts. The annotation is made semi-automatically using Amharic lexicons built by Yeshambel et al [10]. For comparison of stem-based and root-based text representations, we created the stem-based and root-based corpora from the same document collection.…”

Section: Context-based Morphologically Annotated Corporamentioning

confidence: 99%

“…The existing Amharic IR systems face challenges in searching relevant documents because of the morphological complexity and semantic richness of the language. Amharic exhibits complex morphology that poses challenges in NLP and IR [9,10]. The base of Amharic word can be stem or root.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Amharic Semantic Information Retrieval System

Yeshambel

Mothe

Assabie

2022

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Amharic is the official language of Ethiopia, currently having a population of over 118 million. Developing effective information retrieval (IR) system for Amharic has been a challenging task due to limited resources coupled with complex morphology of the language. This paper presents the development of Amharic semantic IR system using query expansion based on deep neural learning model and WordNet. In order to optimize the retrieval result, we propose Amharic text representation using root forms of words applied for stopword identification, indexing, term matching and query expansion. Comparisons are made with the conventional stem-based text representation for information retrieval, and we show that using the root forms of words is better for both resource construction and system development. The effectiveness of the proposed Amharic semantic IR system is evaluated on Amharic Adhoc Information Retrieval Test Collection (2AIRTC).

show abstract

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

Section: Evaluation Of Amharic Ir Corpora Resources and Nlp Toolsmentioning

confidence: 99%

Section: Morphological Analysismentioning

confidence: 99%

Section: Context-based Morphologically Annotated Corporamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Amharic Semantic Information Retrieval System

Yeshambel

Mothe

Assabie

2022

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

show abstract

Much Ado About Gender

Pinney

Raj

Hanna³

et al. 2023

Proceedings of the 2023 Conference on Human Information Interaction and Retrieval

View full text Add to dashboard Cite

Shaping the Future of Endangered and Low-Resource Languages---Our Role in the Age of LLMs: A Keynote at ECIR 2024

Mothe

2024

SIGIR Forum

View full text Add to dashboard Cite

Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.

show abstract

Morphologically Annotated Amharic Text Corpora

Cited by 6 publications

References 22 publications

Amharic Semantic Information Retrieval System

Amharic Semantic Information Retrieval System

Much Ado About Gender

Shaping the Future of Endangered and Low-Resource Languages---Our Role in the Age of LLMs: A Keynote at ECIR 2024

Contact Info

Product

Resources

About