Quantifying Semantic Shift Visually on a Malay Domain Specific Corpus Using Temporal Word Embedding Approach

Tiun, Sabrina; Saad, Saidah; Noor, Nor Fariza Mohd; Jalaludin, Azhar; Rahman, Anis Nadiah Che Abdul

doi:10.17576/apjitm-2020-0902-01

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even though the data used is from parliament, it is also relevant to general use of the Malay language. In addition, many previous studies have utilised the MHC (Nor Fariza, Anis Nadiah, Azhar, Imran & Sabrina, 2019;Norsimah, Azhar, Anis Nadiah & Imran, 2019;Sabrina, Nor Fariza, Azhar & Anis Nadiah, 2020;Sabrina, Saidah et al, 2020). It is expected that the production of Malay stop words relating to the corpus will assist future research in terms of stop word removal and text processing in general.…”

Section: Introductionmentioning

confidence: 99%

Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018

Rahman

Abdullah

Zainudin

et al. 2021

gema

Self Cite

View full text Add to dashboard Cite

Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propose a semantic approach towards identifying and removing Malay, conventional Malay spelling and English functional words in analysing a time-series corpus, namely the Malaysian Hansard Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a combination of Z-method of most frequently occurring words, words that appear once, and the classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to Parliament 13 (year 2018). The study then categorised the stop word list according to domainspecific related words. The resulting list comprised 587 stop words. New stop words that emerged from the MHC include parliamentary-related words like 'Berhormat' (salutation to the members of the Parliament), 'Pertua' (salutation to the Speaker of the House), 'ketawa' (laugh) and 'tepuk' (clap). Other than typical English stop words like 'and' and 'the', there are also words like 'hon'ble' (short for 'Honourable') and 'honourable'. The list also includes stop words in conventional Malay spelling like 'untok' (for), 'lebeh' (more), and 'kapada' (to). The proposed set of stop words can be further utilised to assist natural language processing and text analysis.

show abstract

Section: Introductionmentioning

confidence: 99%