Automatic syllabification of words is challenging, not least because the syllable is not easy to define precisely. Consequently, no accepted standard algorithm for automatic syllabification exists. There are two broad approaches: rule-based and data-driven. The rule-based method effectively embodies some theoretical position regarding the syllable, whereas the data-driven paradigm tries to infer “new” syllabifications from examples assumed to be correctly syllabified already. This article compares the performance of several variants of the two basic approaches. Given the problems of definition, it is difficult to determine a correct syllabification in all cases and so to establish the quality of the “gold standard” corpus used either to evaluate quantitatively the output of an automatic algorithm or as the example-set on which data-driven methods crucially depend. Thus, we look for consensus in the entries in multiple lexical databases of pre-syllabified words. In this work, we have used two independent lexicons, and extracted from them the same 18,016 words with their corresponding (possibly different) syllabifications. We have also created a third lexicon corresponding to the 13,594 words that share the same syllabifications in these two sources. As well as two rule-based approaches (Hammond's and Fisher's implementation of Kahn's), three data-driven techniques are evaluated: a look-up procedure, an exemplar-based generalization technique, and syllabification by analogy (SbA). The results on the three databases show consistent and robust patterns. First, the data-driven techniques outperform the rule-based systems in word and juncture accuracies by a very significant margin but require training data and are slower. Second, syllabification in the pronunciation domain is easier than in the spelling domain. Finally, best results are consistently obtained with SbA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.