In many languages abbreviations are very common and are widely used in both written and spoken language. However, they are not always explicitly defined and in many cases they are ambiguous. This research presents a process that attempts to solve the problem of abbreviation ambiguity using modern machine learning (ML) techniques. Various baseline features are explored, including context-related methods and statistical methods. The application domain is Jewish Law documents written in Hebrew and Aramaic, which are known to be rich in ambiguous abbreviations. Two research approaches were implemented and tested: general and individual. Our system applied four common ML methods to find a successful integration of the various baseline features. The best result was achieved by the SVM ML method in the individual research, with 98.07% accuracy.
IntroductionIn the field of natural language processing (NLP), one of the attractive research subjects is the word sense disambiguation (WSD) problem. Word sense disambiguation is the task of assigning to each occurrence of an ambiguous word in a text one of its possible senses. To solve this widespread problem, many research systems have been developed and executed for a variety of languages, e.g.: (1) WSD system in Thai, disambiguating both verbs and nouns (the system result was not reported).In this research project, the goal is to solve a subproblem of WSD, the abbreviations disambiguation problem in Jewish Law documents, which are written in the Hebrew script, but they mix the Hebrew and Aramaic languages. This problem has been researched by a mere handful of previously developed systems, none of which with the above languages.It is important to note that previous research concerning this subproblem did not focus on defining a generic model or generic model creation process. The various researches attempted to create human-like computational and decision processes for specific contexts, such as medical articles or Latin literature. Each research is composed of a set of context-specific assumptions, which helped improve the system performance or limit the system to solve specific types of abbreviation instances, but in turn lessened the generality of the developed system or solution method.In this research, in addition to its uniqueness in handling Jewish texts, specifically law documents, the research aspires to find a generic model creation process. The developed process considers other languages and does not define preexecution assumptions, albeit additional languages were not tested using this process. The only limitation to this process is the input itself: the languages of the different text documents and the man-made solution database inputted during the learning process limit the context of documents that may be solved by the resulting disambiguation system. This claim is supported by the fact that the researched domain contains a mixture of the Hebrew and Aramaic languages, thus exampling the generic nature of the learning process. In addition to the generic model, ...