A Lazy Man’s Way to Part-of-Speech Tagging

Zamin, Norshuhani; Oxley, Alan; Bakar, Zainab Abu; Farhan, Syed Ahmad

doi:10.1007/978-3-642-32541-0_9

Cited by 7 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The tagging accuracy for using a words' starting information is 39.02% on the third iteration as compared to using a words' ending information, which is 38.36% on the fourth iteration.The difference of 0.66% reflects about 105 tokens (out of 15,882). This finding strengthens the argument to use words' starting information for character-based prediction of unknown words' POS.…”

Section: Resultsmentioning

confidence: 99%

“…This estimation is recursively calculated by considering the marginal distribution of tags ( ) produced by HMM training, formulated in Equation (2) and the standard division in Equation (15) to every successive character.…”

Section: Predicting Pos Through a Words Startingmentioning

confidence: 99%

“…On the other hand, in [14] uses morphological analyser and applies machine learning technique. The other related work is in [15], which applies statistical unsupervised method using N-gram and Dice Coefficient for similarity measurement purpose. The other proposed methods for Malay POS tagging are based on supervised methods [16][17] and syntactic drift with data-driven approach [18][19].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes

Mohamed¹,

Omar

Aziz

2018

J. Fundam and Appl Sci.

View full text Add to dashboard Cite

The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. Therefore, training cannot properly map possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the agglutinative Malay language is examined to assign unknown words' probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second. Keywords PENALIZING UNKNOWN WORDS' EMISSIONS IN HMM POS TAGGER BASED ON MALAY AFFIX MORPHEMES ABSTRACTThe challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. Therefore, training cannot properly map possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the agglutinative Malay language is examined to assign unknown words' probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second. PENALIZING UNKNOWN WORDS' EMISSIONS IN HMM POS TAGGER BASED ON MALAY AFFIX MORPHEMES ABSTRACTThe challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. Therefore, training cannot properly map possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the agglutinative Malay language is examined to assign unknown words' probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Predicting Pos Through a Words Startingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes

Mohamed¹,

Omar

Aziz

2018

J. Fundam and Appl Sci.

View full text Add to dashboard Cite

show abstract

“…A module to fi lter unwanted names, those with low confi dence, so as to increase the tagger's performance rate has been integrated to the framework.  How the hybrid approach handles the projection of linguistic tags for both POS tagging (Zamin et al, 2012a;Zamin et al, 2012b) and NER tagging at a fairly accurate rate have been successfully demonstrated.…”

Section: mentioning

confidence: 96%

“…Figure 3 shows an example of bigram pair-wise matching for the word 'the' against the lexemes 'unbelievable' and 'unreliable.' The technical details of this algorithm are given by Zamin, Oxley, Abu Bakar & Farhan (2012b) with worked examples. All the proper names appearing in the corpus are also stored as they are in our lexicon.…”

Section: Word Alignermentioning

confidence: 99%

Projecting Named Entity Tags From a Resource Rich Language to a Resource Poor Language

Zamin

Oxley

Bakar

2013

Journal of Information and Communication Technology

Self Cite

View full text Add to dashboard Cite

Named Entities (NE) are the prominent entities appearing in textual documents. Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc. This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism. A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism. The English corpus is the translated version of the Malay corpus. Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping. The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure. An evaluation of the selected open source NER tool for English is also presented.

show abstract