2019
DOI: 10.3758/s13428-019-01223-3
|View full text |Cite
|
Sign up to set email alerts
|

WordSeg: Standardizing unsupervised word form segmentation from text

Abstract: A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. Word… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
21
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1
1

Relationship

4
3

Authors

Journals

citations
Cited by 23 publications
(22 citation statements)
references
References 30 publications
1
21
0
Order By: Relevance
“…First, we used the automatic utterance boundaries provided by the LENA software (“A,” short for “automatic boundaries”), as well as combined together the text from segments labeled as continuations of each other by coders (“H” for “human boundaries”). Second, since performance is dependent on corpus size (see Bernard et al, 2018), we had three versions of each CDS corpus: the full one, a shortened CDS corpus to match the ADS corpus in number of words, and a shortened CDS corpus to match the ADS corpus in number of utterances. After crossing these two factors, performance could be compared between, on the one hand, ADS-A/H (ADS with automatic or human utterance boundaries), and, on the other hand, one of (1) CDS-A/H-full (corresponding full CDS corpus), (2) CDS-A/H-WM (cut at the same number of word tokens found in the corresponding ADS), or (3) CDS-A/H-UM (cut at the same number of utterances).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…First, we used the automatic utterance boundaries provided by the LENA software (“A,” short for “automatic boundaries”), as well as combined together the text from segments labeled as continuations of each other by coders (“H” for “human boundaries”). Second, since performance is dependent on corpus size (see Bernard et al, 2018), we had three versions of each CDS corpus: the full one, a shortened CDS corpus to match the ADS corpus in number of words, and a shortened CDS corpus to match the ADS corpus in number of utterances. After crossing these two factors, performance could be compared between, on the one hand, ADS-A/H (ADS with automatic or human utterance boundaries), and, on the other hand, one of (1) CDS-A/H-full (corresponding full CDS corpus), (2) CDS-A/H-WM (cut at the same number of word tokens found in the corresponding ADS), or (3) CDS-A/H-UM (cut at the same number of utterances).…”
Section: Methodsmentioning
confidence: 99%
“…Each algorithm (with default parameters, except as noted below) was run using the WordSeg package (Bernard et al, 2018), which also performs the evaluation. Due to space restrictions, we cannot provide fuller descriptions here, but we refer readers to Bernard et al (2018), where the algorithms and the evaluation are explained. In a nutshell, both training and evaluation are done over the whole corpus because these algorithms are unsupervised, and thus there is no risk of overfitting.…”
Section: Methodsmentioning
confidence: 99%
“…For lack of space, we will only briefly describe the algorithms drawn from WordSeg (see Johnson and Goldwater 2009;Monaghan and Christiansen 2010;Lignos 2012;Daland and Zuraw 2013;Saksida et al 2017;Bernard et al 2018). All algorithms were used with their default parameters.…”
Section: Methodsmentioning
confidence: 99%
“…Most previous computational research has used as input texts representing phonologized language, that is, sequences of phonemes with no overt word boundaries, and the task is to retrieve these. Several algorithms inspired by laboratory research on infant word segmentation are currently represented in WordSeg, an open source package (Bernard et al, 2018). Are such algorithms as robust to cross-linguistic variation as human infants are?…”
Section: Unsupervised Bottom-up Segmentation Across Languagesmentioning
confidence: 99%
See 1 more Smart Citation