A process that attempts to solve abbreviation ambiguity is presented. Various contextrelated features and statistical features have been explored. Almost all features are domain independent and language independent. The application domain is Jewish Law documents written in Hebrew. Such documents are known to be rich in ambiguous abbreviations. Various implementations of the one sense per discourse hypothesis are used, improving the features with new variants. An accuracy of 96.09% has been achieved by SVM.
In many languages, abbreviations are widely used either in writing or talking. However, abbreviations are likely to be ambiguous. Therefore, there is a need for disambiguation. That is, abbreviations should be expanded correctly. Disambiguation of abbreviations is critical for correct understanding not only for the abbreviations themselves but also for the whole text. Little research has been done concerning disambiguation of abbreviations for documents in English and Latin. Nothing has been done for the Hebrew language. In this ongoing work, we investigate a basic model, which expands abbreviations contained in Jewish Law Documents written in Hebrew. This model has been implemented in a prototype system. Currently, experimental results show that abbreviations are expanded correctly in a rate of almost 60%.
In many languages abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly defined and in many cases they are ambiguous. This research presents a process that attempts to solve the problem of abbreviation ambiguity using modern machine learning (ML) techniques. Various baseline features are explored, including context-related methods and statistical methods. The application domain is Jewish Law documents written in Hebrew and Aramaic, which are known to be rich in ambiguous abbreviations. Two research approaches were implemented and tested: general and individual. Our system applied four common ML methods to find a successful integration of the various baseline features. The best result was achieved by the SVM ML method in the individual research, with 98.07% accuracy. IntroductionIn the field of natural language processing (NLP), one of the attractive research subjects is the word sense disambiguation (WSD) problem. Word sense disambiguation is the task of assigning to each occurrence of an ambiguous word in a text one of its possible senses. To solve this widespread problem, many research systems have been developed and executed for a variety of languages, e.g.: (1) WSD system in Thai, disambiguating both verbs and nouns (the system result was not reported).In this research project, the goal is to solve a subproblem of WSD, the abbreviations disambiguation problem in Jewish Law documents, which are written in the Hebrew script, but they mix the Hebrew and Aramaic languages. This problem has been researched by a mere handful of previously developed systems, none of which with the above languages.It is important to note that previous research concerning this subproblem did not focus on defining a generic model or generic model creation process. The various researches attempted to create human-like computational and decision processes for specific contexts, such as medical articles or Latin literature. Each research is composed of a set of context-specific assumptions, which helped improve the system performance or limit the system to solve specific types of abbreviation instances, but in turn lessened the generality of the developed system or solution method.In this research, in addition to its uniqueness in handling Jewish texts, specifically law documents, the research aspires to find a generic model creation process. The developed process considers other languages and does not define preexecution assumptions, albeit additional languages were not tested using this process. The only limitation to this process is the input itself: the languages of the different text documents and the man-made solution database inputted during the learning process limit the context of documents that may be solved by the resulting disambiguation system. This claim is supported by the fact that the researched domain contains a mixture of the Hebrew and Aramaic languages, thus exampling the generic nature of the learning process. In addition to the generic model, ...
Disambiguation of ambiguous initialisms and acronymsis critical to the proper understanding of various types of texts. A model that attempts to solve this has previously been presented. This model contained various baseline features, including contextual relationship features, statistical features, and language-specific features. The domain of Jewish law documents written in Hebrew and Aramaic is known to be rich in ambiguous abbreviations and therefore this model was implemented and applied over 2 separate corpuses within this domain. Several common machine-learning (ML) methods were tested with the intent of finding a successful integration of the baseline feature variants. When the features were evaluated individually, the best averaged results were achieved by a library for support vector machines (LIBSVM); 98.07% of the ambiguous abbreviations, which were researched in the domain, were disambiguated correctly. When all the features were evaluated together, the J48 ML method achieved the best result, with 96.95% accuracy. In this paper, we examine the system's degree of success and the degree of its professionalism by conducting a comparison between this system's results and the results achieved by 39 participants, highly fluent in the research domain. Despite the fact that all the participants had backgrounds in religious scriptures and continue to study these texts, the system's accuracy rate, 98.07%, was significantly higher than the average accuracy result of the participants, 91.65%. Further analysis of the results for each corpus implies that participants overcomplicate the required task, as well as exclude vital information needed to properly examine the context of a given initialism.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.