Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams

Deligne, Sabine; Bimbot, Frédéric

doi:10.1109/icassp.1995.479391

Cited by 86 publications

(58 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These strings should rather work as individual vocabulary items in the model. It has been shown that increased performance of n-gram models can be obtained by adding larger units consisting of common word sequences to the vocabulary; see e.g., (Deligne and Bimbot, 1995). Nevertheless, in the near future we wish to explore possibilities of using complementary and more standard evaluation measures, such as precision, recall, and F-measure of the discovered morph boundaries.…”

Section: Discussionmentioning

confidence: 99%

Unsupervised segmentation of words using prior distributions of morph length and frequency

Creutz¹

2003

Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL '03

View full text Add to dashboard Cite

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data.

show abstract

Section: Discussionmentioning

confidence: 99%

Unsupervised segmentation of words using prior distributions of morph length and frequency

Creutz¹

2003

Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL '03

View full text Add to dashboard Cite

show abstract

“…Many unsupervised methods have been proposed for segmenting raw character sequences with no boundary information into words [1,2,4,5,8,14,15]. Brent [1] gives a good survey of these methods.…”

Section: Of Tokens T H E M O S T F a V O U R I T E M U S I C O F A L mentioning

confidence: 99%

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Islam

Inkpen

Kiringa

2007

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

Abstract. In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.

show abstract

“…Therefore, we partition our data into meaningful ngrams first. Based on the work of Deligne and Bimbot [35], we compute multigram models for the documents in our corpus the following way: Each sentence is considered as a sequence of n-grams with variable length. The likelihood of a sentence is computed by summing up the individual likelihoods of the n-grams corresponding to each possible segmentation of the sentence.…”

Section: B Preprocessingmentioning

confidence: 99%

Diversifying Product Review Rankings: Getting the Full Picture

Krestel

Dokoohaki

2011

2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology

View full text Add to dashboard Cite

Abstract-E-commerce Web sites owe much of their popularity to consumer reviews provided together with product descriptions. On-line customers spend hours and hours going through heaps of textual reviews to build confidence in products they are planning to buy. At the same time, popular products have thousands of user-generated reviews. Current approaches to present them to the user or recommend an individual review for a product are based on the helpfulness or usefulness of each review. In this paper we look at the top-k reviews in a ranking to give a good summary to the user with each review complementing the others. To this end we use Latent Dirichlet Allocation to detect latent topics within reviews and make use of the assigned star rating for the product as an indicator of the polarity expressed towards the product and the latent topics within the review. We present a framework to cover different ranking strategies based on the user's need: Summarizing all reviews; focus on a particular latent topic; or focus on positive, negative or neutral aspects. We evaluated the system using manually annotated review data from a commercial review Web site.

show abstract

Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams

Cited by 86 publications

References 8 publications

Unsupervised segmentation of words using prior distributions of morph length and frequency

Unsupervised segmentation of words using prior distributions of morph length and frequency

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Diversifying Product Review Rankings: Getting the Full Picture

Contact Info

Product

Resources

About