This paper describes the creation of the new Bangor Arabic Annotated Corpus (BAAC) which is a Modern Standard Arabic (MSA) corpus that comprises 50K words manually annotated by parts-of-speech. For evaluating the quality of the corpus, the Kappa coefficient and a direct percent agreement for each tag were calculated for the new corpus and a Kappa value of 0.956 was obtained, with an average observed agreement of 94.25%. The corpus was used to evaluate the widely used Madamira Arabic part-of-speech tagger and to further investigate compression models for text compressed using partof-speech tags. Also, a new annotation tool was developed and employed for the annotation process of BAAC. Keywords-Component; arabic language; corpus; annotated corpora; analysis results I. BACKGROUND AND MOTIVATION The Arabic language "انعربيت" is acknowledged to be one of the most largely used languages, with 330 million people using the language as their first language, as shown in Table 1, plus 1.4 billion more using it as a secondary language [1]. The majority of the speakers are located across twenty-two nations, primarily in the Middle East, North Africa and Asia, and the United Nations considers the Arabic language as one of its five official languages. The Arabic language is part of the Semitic languages that includes Tigrinya, Amharic, Hebrew, etc., and shares almost the same structure as those languages. It has 28 letters, two gendersfeminine and masculine, as well as singular, dual and plural forms. The Arabic language has a right-to-left writing system with the basic grammatical structure that consists of verb-subject-object and other structures, such as VOS, VO and SVO [2]-[4]. TABLE I. THE MOST UNIVERSALLY USED LANGUAGES Rank Language Users (millions)
This research is part of an attempt to discover significant factors of readability for connected expository prose. Keeping the content of two paragraphs identical, I varied their forms using the Functional Sentence Perspectivists' rule for relating old and new information within the sentences of discourse. The rule-governed form contains a chain of old and new information; in the variant this chain is disrupted. In two tests involving subjective readability decisions, a significant number of 272 high-school subjects found the rule-governed paragraph more readable than the variant. This is additional evidence that we should follow the Functional Sentence Perspectivists' rule in writing discourse.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.