Optical Character Recognition (OCR) is a tool in computational technology that allows a recognition of printed characters by manipulating photoelectric devices and computer software. It runs by converting images or texts that are scanned beforehand into machine-readable and editable texts. There are a various numbers of OCR tools in the market for commercial and research use, which are obtainable for free or restrained with purchases. An OCR tool is able to enhance the accuracy of the results which as well relies on pre-processing and subdivision of algorithms. This study intends to investigate the performances of OCR tools in converting the Parliamentary Reports of Hansard Malaysia for developing the Malaysian Hansard Corpus (MHC). By comparing four OCR tools, the study has converted ten reports of Parliamentary Reports which contains a number of 62 pages to see the conversion accuracy and error rate of each conversion tool. In this study, all of the tools are manipulated to convert Adobe Portable Document Format (PDF) files into Plain Text File (txt). The objective of this study is to give an overview based on accuracy and error rate of how each OCR tools essentially works and how it can be utilized to provide assistance towards corpus building. The study indicates that each tool possesses a variety of accuracy and error rates to convert the whole documents from PDF into txt or plain text files. The study proposes that a step of corpus building can be made easier and manageable when a researcher understands the way an OCR tool works in order to choose the best OCR tool prior to the outset of the corpus development.
Air or its English equivalent 'water' is very important in our everyday life so much so that when the tap runs dry, it even made it as one of the topics debated by politicians. This paper looks at the issues that surround air/water in Malaysian Parliamentary debates by specifically focusing its relation to the state of Selangor. The air/water related issues were examined based on the collocates of air and Selangor in the Malaysian Hansard Corpus (MHC) from Parliament 1 (P1) to Parliament 13 (P13). The findings show that air is consistently present as one of the collocates of Selangor from Parliament 4 (P4) to Parliament 13 (P13). However, air started to show an upward trend starting in Parliament 7 (P7) and continued to Parliament 13 (P13). The recurring issues during those periods are the never-ending water-related problems and the steps taken by the government to overcome the problems. In P7 and P8, the focus is on the source of water as it collocates with pembersihan logi air (water treatment plant) and kawasan tadahan air (water catchment area). In Parliament 10, Parliament 11, Parliament 12 and Parliament 13 the recurring issue with air and Selangor is penyaluran air mentah (the transfer of raw water) from the neighbouring state Pahang to Selangor. Another issue observed is penstrukturan air (restructuring) of water supply and services which was first observed in Parliament 12 and continues to Parliament 13. Thus, by focusing on the collocates of air this corpus-driven account has managed to show the trend of the parliamentary debates in relation to air and Selangor. Therefore, parliamentary debates where various issues of national interest are often raised offer opportunities for more critical analysis of issues that are important to the public.
In this study, we propose an alternative approach to analyzing a domain-specific time series corpus for detecting word evolution. The method trains a target corpus in time series into a temporal word embedding (TWE) model. The advantage of TWE is that one can see how the meaning of a word changes over time. We have chosen the TWEC approach to model a Malay domain-specific time-series corpus, the Malaysian Hansard Corpus (MHC), to a TWE model and called the model as MHC-TWEC. Two primary analyses, i.e., self-similarity analysis and user-defined method analysis, were performed to validate the effectiveness of the MHC-TWEC model in quantifying semantic shift on MHC visually. From those analyses, we visually find out that the TWE model can capture the semantic shift in the temporal corpus (the MHC).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.