This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic , but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.
Existing datasets that contain boolean questions, such as BoolQ and TYDI QA , provide the user with a YES/NO response to the question. However, a one word response is not sufficient for an explainable system. We promote explainability by releasing a new set of annotations marking the evidence in existing TYDI QA and BoolQ datasets. We show that our annotations can be used to train a model that extracts improved evidence spans compared to models that rely on existing resources. We confirm our findings with a user study which shows that our extracted evidence spans enhance the user experience. We also provide further insight into the challenges of answering boolean questions, such as passages containing conflicting YES and NO answers, and varying degrees of relevance of the predicted evidence.
Searching online information is increasingly a daily activity for many people. The multilinguality of online content is also increasing (e.g. the proportion of English web users, which has been decreasing as a fraction the increasing population of web users, dipped below 50% in the summer of 2001). To improve the ability of an English speaker to search mutlilingual content, we built a system that supports cross-lingual search of an Arabic newswire collection and provides on demand translation of Arabic web pages into English. The cross-lingual search engine supports a fast search capability (sub-second response for typical queries) and achieves state-of-the-art performance in the high precision region of the result list. The on demand statistical machine translation uses the Direct Translation model along with a novel statistical Arabic Morphological Analyzer to yield state-of-the-art translation quality. The on demand SMT uses an efficient dynamic programming decoder that achieves reasonable speed for translating web documents.
Recent machine reading comprehension datasets include extractive and boolean questions but current approaches do not offer integrated support for answering both question types. We present a multilingual machine reading comprehension system and front-end demo that handles boolean questions by providing both a YES/NO answer and highlighting supporting evidence, and handles extractive questions by highlighting the answer in the passage. Our system, GAAMA 2.0, is ranked first on the TYDI QA leaderboard at the time of this writing. We contrast two different implementations of our approach. The first includes several independent stacks of transformers allowing easy deployment of each component. The second is a single stack of transformers utilizing adapters to reduce GPU memory footprint in a resource-constrained environment.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.