Learning string-edit distance

Ristad, Eric Sven; Yianilos, P.N.

doi:10.1109/34.682181

Cited by 641 publications

(363 citation statements)

References 23 publications

Supporting

Mentioning

360

Contrasting

Unclassified

Order By: Relevance

“…We also identified the diameter of variation sets, which we quantified in terms of normalized Levenshtein distance (Ristad & Yianilos, 1998). To allow comparison of edit distance across sentences of different lengths, we normalized it by dividing the raw edit distance by the length of the longest of the two sequences, which brings it into the range between 0 and 1.…”

Section: Variation Sets In Childesmentioning

confidence: 99%

An empirical generative framework for computational modeling of language acquisition

et al. 2010

View full text Add to dashboard Cite

A B S T R A C TThis paper reports progress in developing a computer model of language acquisition in the form of (1) a generative grammar that is (2) algorithmically learnable from realistic corpus data, (3) viable in its large-scale quantitative performance and (4) psychologically real. First, we describe new algorithmic methods for unsupervised learning of generative grammars from raw CHILDES data and give an account of the generative performance of the acquired grammars. Next, we summarize findings from recent longitudinal and experimental work that suggests how certain statistically prominent structural properties of child-directed speech may facilitate language acquisition. We then present a series of new analyses of CHILDES data indicating that the desired properties [*] During the preparation of this paper, Shimon

show abstract

Section: Variation Sets In Childesmentioning

confidence: 99%

An empirical generative framework for computational modeling of language acquisition

et al. 2010

View full text Add to dashboard Cite

show abstract

“…Some recent work tried to overcome the previously mentioned drawbacks by automatically learning the primitive edit costs, rather than hand-tuning them for each domain. Several probabilistic models have been proposed to learn a stochastic ED in the form of stochastic transducers [9,1,8], conditional random fields (CRF) [7], or pair-Hidden Markov Models (pair-HMM) [5]. These models provide a probability distribution over the edit operations and thus over the string pairs.…”

Section: Introductionmentioning

confidence: 99%

“…The motivations that justify the learning of such a transducer are the following. First, we think that an efficient way to model a stochastic ED actually consists in viewing it as a stochastic transduction between the input X and output Y alphabets [8,9]. In other words, it means that the relation constituted by a set of (input,output ) strings can be compiled in the form of a 2-tape automaton, called a stochastic finite-state transducer.…”

Section: Introductionmentioning

confidence: 99%

“…In other words, it means that the relation constituted by a set of (input,output ) strings can be compiled in the form of a 2-tape automaton, called a stochastic finite-state transducer. The interpretation of the ED as a stochastic transduction naturally leads to two possible string distances [9]: the first one describes the most likely transduction between the two strings, while the second is defined by aggregating all transductions between them. In this paper, we focus on the first stochastic distance, a so-called Viterbi Edit Distance [9].…”

Section: Introductionmentioning

confidence: 99%

“…The interpretation of the ED as a stochastic transduction naturally leads to two possible string distances [9]: the first one describes the most likely transduction between the two strings, while the second is defined by aggregating all transductions between them. In this paper, we focus on the first stochastic distance, a so-called Viterbi Edit Distance [9]. We motivate this choice by the fact that we will use an adaptation of the well-known Viterbi algorithm for learning the structure and the parameters of the conditional edit transducer.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Discriminative Model of Stochastic Edit Distance in the Form of a Conditional Transducer

Marc

Janodet

Sebban

2006

Grammatical Inference: Algorithms and Applications

View full text Add to dashboard Cite

Abstract. Many real-world applications such as spell-checking or DNA analysis use the Levenshtein edit-distance to compute similarities between strings. In practice, the costs of the primitive edit operations (insertion, deletion and substitution of symbols) are generally hand-tuned. In this paper, we propose an algorithm to learn these costs. The underlying model is a probabilitic transducer, computed by using grammatical inference techniques, that allows us to learn both the structure and the probabilities of the model. Beyond the fact that the learned transducers are neither deterministic nor stochastic in the standard terminology, they are conditional, thus independant from the distributions of the input strings. Finally, we show through experiments that our method allows us to design cost functions that depend on the string context where the edit operations are used. In other words, we get kinds of context-sensitive edit distances.

show abstract