Proceedings of the Workshop on Linguistic Distances - LD '06 2006
DOI: 10.3115/1641976.1641984
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of string distance algorithms for dialectology

Abstract: We examine various string distance measures for suitability in modeling dialect distance, especially its perception. We find measures superior which do not normalize for word length, but which are are sensitive to order. We likewise find evidence for the superiority of measures which incorporate a sensitivity to phonological context, realized in the form of n-gramsalthough we cannot identify which form of context (bigram, trigram, etc.) is best. However, we find no clear benefit in using gradual as opposed to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
44
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
5
3
2

Relationship

3
7

Authors

Journals

citations
Cited by 56 publications
(46 citation statements)
references
References 11 publications
2
44
0
Order By: Relevance
“…They report on applications in more than a dozen languages, and note that Gooskens and Heeringa (2004) show that Levenshtein distances correlate well (r ≈ 0.7) with naïve speakers' judgments of the degree of dialect differences among Norwegian dialects. Although there have been many dialect studies comparing results from LD-based analyses with others, unfortunately no other direct validation experiments have been conducted with dialectal data from other languages (but see Heeringa et al, 2006) . This means that the current paper will importantly supplement the Norwegian research as a validation study.…”
Section: Related Workmentioning
confidence: 99%
“…They report on applications in more than a dozen languages, and note that Gooskens and Heeringa (2004) show that Levenshtein distances correlate well (r ≈ 0.7) with naïve speakers' judgments of the degree of dialect differences among Norwegian dialects. Although there have been many dialect studies comparing results from LD-based analyses with others, unfortunately no other direct validation experiments have been conducted with dialectal data from other languages (but see Heeringa et al, 2006) . This means that the current paper will importantly supplement the Norwegian research as a validation study.…”
Section: Related Workmentioning
confidence: 99%
“…As illustrated in Figure 3, cognate words can be very similar or quite different with respect to their targets, if measured using string similarity algorithms such as the Levenshtein distances (cf. Heeringa, Kleiweg, Gooskens, and Nerbonne 2006). At least from a psycholinguistic point of view it seems reasonable to construe the category of cognate as a radial category with fuzzy boundaries rather than a clearcut category based on genealogical relations across languages.…”
Section: Example 2: Interlingual Inferencing Of Cognate Wordsmentioning
confidence: 99%
“…Some frameworks use training data to semiautomatically find an entity matching strategy to solve a match problem. The quality of the computer string matching process is found to be higher than the manually linked record (done by humans) [17]. TAILOR [18] is a flexible record matching toolbox which allows the users to apply different duplicate detection methods on the datasets.…”
Section:  Soft Tf-idf: This Technique Is Based On Jaromentioning
confidence: 99%