Lexicon and grammar in probabilistic tagging of written English

In this paper we present some experiments on the use of a probabilistic model to tag English text, i.e. to assign to each word the correct tag (part of speech) in the context of the sentence. The main novelty of these experiments is the use of untagged text in the training of the model. We have used a simple triclass Marlcov model and are looking for the best way to estimate the parameters of this model, depending on the kind and amount of training data provided. Two approaches in particular are compared and combined:• using text that has been tagged by hand and computing relative frequency counts,• using text without tags and training the model as a hidden Markov process, according to a Maximum Likelihood principle.Experiments show that the best training is obtained by using as much tagged text as possible. They also show that Maximum Likelihood training, the procedure that is routinely used to estimate hidden Markov models parameters from training data, will not necessarily improve the tagging accuracy. In fact, it will generally degrade this accuracy, except when only a limited amount of hand-tagged text is available. IntroductionA lot of effort has been devoted in the past to the problem of tagging text, i.e. assigning to each word the correct tag (part of speech) in the context of the sentence. Two main approaches have generally been considered:

show abstract

“…We use the "treebank" data described in Beale (1988). It contains 42,186 sentences (about one million words) from the Associated Press.…”

Section: Text Datamentioning

confidence: 99%

Tagging text with a probabilistic model

Mérialdo

1991

[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing

192

252

View full text Add to dashboard Cite

show abstract

“…The use more diversified and complex examples contained in large corpora was also explored (Klein & Simmons, 1963). Probabilistic methods were also used in assigning grammatical codes to different words in the corpora (Beale, 1988;Bahl & Mercer, 1976).…”

Section: Related Workmentioning

confidence: 99%

A Data-Intensive Approach to Named Entity Recognition Combining Contextual and Intrinsic Indicators

Osesina

Talburt

2012

International Journal of Business Intelligence Research

View full text Add to dashboard Cite

Over the past decade, huge volumes of valuable information have become available to organizations. However, the existence of a substantial part of the information in unstructured form makes the automated extraction of business intelligence and decision support information from it difficult. By identifying the entities and their roles within unstructured text in a process known as semantic named entity recognition, unstructured text can be made more readily available for traditional business processes. The authors present a novel NER approach that is independent of the text language and subject domain making it applicable within different organizations. It departs from the natural language and machine learning methods in that it leverages the wide availability of huge amounts of data as well as high-performance computing to provide a data-intensive solution. Also, it does not rely on external resources such as dictionaries and gazettes for the language or domain knowledge.

show abstract

“…The tagging is a 76 tag projection of the set of 159 tags originally used in conjunction with a treebanking effort at Lancaster University. For more details, see Beale (1988). tagged sentences.…”

Section: Training Hmm Taggers With Saum-welchmentioning

confidence: 99%

Syntactic Wordclass Tagging

Halteren¹

1999

Text, Speech and Language Technology

View full text Add to dashboard Cite

Lexicon and grammar in probabilistic tagging of written English

Cited by 5 publications

References 8 publications

Tagging text with a probabilistic model

Tagging text with a probabilistic model

A Data-Intensive Approach to Named Entity Recognition Combining Contextual and Intrinsic Indicators

Syntactic Wordclass Tagging

Contact Info

Product

Resources

About