Unsupervised induction of stochastic context-free grammars using distributional clustering

Clark, Alexander

doi:10.3115/1117822.1117831

Cited by 106 publications

(64 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Bod 2009 and references therein) and does thus not constrain the computational realization of the statistical induction processes underlying language learning (cf. Clark, 2001;Klein and Manning, 2002;Zuidema, 2006;Bod & Smets, 2012;inter alia). In this paper, we were interested to what extent advanced L2 learners have succeeded in identifying generalizations pertaining to variables that figure in psycholinguistic accounts of sentence-level processing (e.g.…”

Section: Discussionmentioning

confidence: 99%

Missing Generalizations: A Supervised Machine Learning Approach to L2 Written Production

Wiechmann

Kerz²

2014

Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

View full text Add to dashboard Cite

Recent years have witnessed a growing interest in usage-based models of language, which characterize linguistic knowledge in terms of emerging generalizations derived from experience with language via processes of similarity-based distributional analysis and analogical reasoning. Language learning then involves building the right generalizations, i.e. the recognition and recreation of the statistical regularities underlying the target language. Focusing on the domain of relativization, this study examines to what extent the generalizations of advanced second language learners pertaining to the usage of complex constructions differ from those of experts in written production. We approach this question through supervised machine learning employing as a primary modeling tool random forests with conditional inference trees as base learners.

show abstract

Section: Discussionmentioning

confidence: 99%

Missing Generalizations: A Supervised Machine Learning Approach to L2 Written Production

Wiechmann

Kerz²

2014

Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

View full text Add to dashboard Cite

show abstract

“…Grammar induction (Clark, 2001;Klein and Manning, 2002;Klein and Manning, 2004;Haghighi and Klein, 2006;Smith and Eisner, 2006;Snyder et al, 2009, inter alios) involves the learning of grammars from unlabeled sentences. Here, unlabeled means that the sentences are often POS tagged, but no syntactic structures for the sentences are available.…”

Section: Related Workmentioning

confidence: 99%

Learning Grammar Specifications from IGT: A Case Study of Chintang

Bender¹,

Crowgey²,

Goodman³

et al. 2014

Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

View full text Add to dashboard Cite

We present a case study of the methodology of using information extracted from interlinear glossed text (IGT) to create of actual working HPSG grammar fragments using the Grammar Matrix focusing on one language: Chintang. Though the results are barely measurable in terms of coverage over running text, they nonetheless provide a proof of concept. Our experience report reflects on the ways in which this task is non-trivial and on mismatches between the assumptions of the methodology and the realities of IGT as produced in a large-scale field project.

show abstract

“…They subsequently cluster syntactic units until the grammar has been constructed. For example, EMILE [1] clusters expressions that occur in the same context, while CDC [10] creates sets of sequences within a context before selecting clusters that satisfy the MDL principle (see above).…”

Section: Grammar Inferencementioning

confidence: 99%

“…The principle finds its primary application in data reduction, where "any regularity in a given set of data can be used to compress the data" [20]. Examples include CDC [10] and e-GRIDS [38]. -Greedy search algorithms make decisions based on their internal logic which may lead to the creation, removal or fusion of rules.…”

Section: Grammar Inferencementioning

confidence: 99%

“…Overgeneralization occurs when the inference process produces a grammar whose language is larger than the unknown language. The use of negligible items results in an unnecessarily evolutionary GA-based [42] L A g t s [ 8] heuristic ALLiS [13] Inductive CYK [36] ABL [54] MDL e-GRIDS [38] CDC [10] VEGGIE [4,5] Eiland et al [17] greedy search ADIOS CDC Incremental parsing [3,44] Sequitur [37] GraphViz [45,46] clustering EMILE [1] C D C large grammar. To limit the impact of over-generalization, it is recommended to also use a set of negative examples.…”

Section: Grammar Inferencementioning

confidence: 99%

See 1 more Smart Citation

SEQUIN: a grammar inference framework for analyzing malicious system behavior

Luh

Schramm

Wagner

et al. 2018

J Comput Virol Hack Tech

View full text Add to dashboard Cite

Targeted attacks on IT systems are a rising threat to the confidentiality of sensitive data and the availability of critical systems. The emergence of Advanced Persistent Threats (APTs) made it paramount to fully understand the particulars of such attacks in order to improve or devise effective defense mechanisms. Grammar inference paired with visual analytics (VA) techniques offers a powerful foundation for the automated extraction of behavioral patterns from sequential event traces. To facilitate the interpretation and analysis of APTs, we present SEQUIN, a grammar inference system based on the Sequitur compression algorithm that constructs a context-free grammar (CFG) from string-based input data. In addition to recursive rule extraction, we expanded the procedure through automated assessment routines capable of dealing with multiple input sources and types. This automated assessment enables the accurate identification of interesting frequent or anomalous patterns in sequential corpora of arbitrary quantity and origin. On the formal side, we extended the CFG with attributes that help describe the extracted (malicious) actions. Discovery-focused pattern visualization of the output is provided by our dedicated KAMAS VA prototype.

show abstract

Unsupervised induction of stochastic context-free grammars using distributional clustering

Cited by 106 publications

References 11 publications

Missing Generalizations: A Supervised Machine Learning Approach to L2 Written Production

Missing Generalizations: A Supervised Machine Learning Approach to L2 Written Production

Learning Grammar Specifications from IGT: A Case Study of Chintang

SEQUIN: a grammar inference framework for analyzing malicious system behavior

Contact Info

Product

Resources

About