a b st r a c t This paper presents an algorithm for learning the construction grammar of a language from a large corpus. This grammar induction algorithm has two goals: first, to show that construction grammars are learnable without highly specified innate structure; second, to develop a model of which units do or do not constitute constructions in a given dataset. The basic task of construction grammar induction is to identify the minimum set of constructions that represents the language in question with maximum descriptive adequacy. These constructions must (1) generalize across an unspecified number of units while (2) containing mixed levels of representation internally (e.g., both item-specific and schematized representations), and (3) allowing for unfilled and partially filled slots. Additionally, these constructions may (4) contain recursive structure within a given slot that needs to be reduced in order to produce a sufficiently schematic representation. In other words, these constructions are multi-length, multi-level, possibly discontinuous co-occurrences which generalize across internal recursive structures. These co-occurrences are modeled using frequency and the ΔP measure of association, expanded in novel ways to cover multi-unit sequences. This work provides important new evidence for the learnability of construction grammars as well as a tool for the automated corpus analysis of constructions.k e y w o r d s : construction grammar, grammar induction, multi-unit association measures, poverty of the stimulus.
This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features for dialectometry. This offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation. The learned CxG is evaluated on how well it describes held-out test corpora while dialectometry is evaluated on how well it can model regional varieties of English. The method is tested using two distinct datasets: First, the International Corpus of English representing eight outer circle varieties; Second, a web-crawled corpus representing five inner circle varieties. Results show that the method (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of English with a high degree of accuracy, thus (3) supporting dialectometric methods for measuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation. This is important for cognitive sociolinguistics because it operationalizes the idea that competition between constructions is organized at the functional level so that dialectometry needs to represent as much of the available functional space as possible.
A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequencybased and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.