SUMMARYIn Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both overstemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of overstemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
Writing an effective syllabus is critically important for instructors to provide effective education at universities. However, little is known about how to create a well-written syllabus. It is necessary to elucidate what kind of information must be included in a syllabus. To achieve this goal, we focus on the searchable information in syllabi and analyze an actual syllabus collection that includes 6,493 syllabus documents of a national university in Japan. First, we investigate syllabus classification and syllabus search by using established text mining methods and an information retrieval method. The results of our experiments demonstrate that (i) knowledge discovery from syllabus documents is a challenging and non-trivial task, and (ii) just adding one particular word can already increase the searchability in syllabus search. Next, we investigate methods that provide word suggestions using deep learning approaches and large text corpora. In this experiment, we used a bibliographic database of university libraries in Japan, which contains 3,990,646 bibliographic entries, and a version of Japanese Wikipedia, which contains 2,351,545 articles. The results indicate that (iii) a vocabulary from a bibliographic database of university libraries is effective to ameliorate the efficacy measured by the mean reciprocal rank, and (iv) a wide range of vocabulary is essential in improving the recall in word suggestions. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.