Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics 2014
DOI: 10.3115/v1/e14-1014
|View full text |Cite
|
Sign up to set email alerts
|

Generalizing a Strongly Lexicalized Parser using Unlabeled Data

Abstract: Statistical parsers trained on labeled data suffer from sparsity, both grammatical and lexical. For parsers based on strongly lexicalized grammar formalisms (such as CCG, which has complex lexical categories but simple combinatory rules), the problem of sparsity can be isolated to the lexicon. In this paper, we show that semi-supervised Viterbi-EM can be used to extend the lexicon of a generative CCG parser. By learning complex lexical entries for low-frequency and unseen words from unlabeled data, we obtain i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 19 publications
0
6
0
Order By: Relevance
“…Reducing the requirements for training data eases the task for human annotators. It may also make the model more amenable to semi-supervised approaches to CCG parsing, which have typically focused on extending the lexicon (Thomforde and Steedman, 2011;Deoskar et al, 2014). Finally, it may make it easier to convert other annotated resources, such as UCCA (Abend and Rappoport, 2013) or AMR (Banarescu et al, 2013), to CCG training data-as only specific words need to be converted, rather than full sentences.…”
Section: Future Workmentioning
confidence: 99%
“…Reducing the requirements for training data eases the task for human annotators. It may also make the model more amenable to semi-supervised approaches to CCG parsing, which have typically focused on extending the lexicon (Thomforde and Steedman, 2011;Deoskar et al, 2014). Finally, it may make it easier to convert other annotated resources, such as UCCA (Abend and Rappoport, 2013) or AMR (Banarescu et al, 2013), to CCG training data-as only specific words need to be converted, rather than full sentences.…”
Section: Future Workmentioning
confidence: 99%
“…Baldridge (2008) and Ravi et al (2010) were particularly con-cerned with high lexical ambiguity and counteracted this, respectively, by improving lexicon initialization using linguistic principles, and explicitly minimizing model sizes. Deoskar et al (2013), working with lexico-syntactic dependencies similar to supertags, addressed difficulties arising from the long tail of rare and unseen words; and Deoskar et al (2014) addressed a similar issue specifically for generalizing a CCG parser. The problem of outof-vocabulary words has gotten much less severe with the advent of deep contextualized sentence encoders operating on subword units.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…This introduction of parameters for new word types into the lexicon was the only modification made to the parsers, with the remainder of the models being left unchanged. When combined with methods that could adapt the existing model parameters to the statistics of the new domain, such as self-training (e.g., Deoskar et al, 2014), we expect further improvements to be achievable. Nonetheless, there were substantial variations in the strength of the improvement attained, with the weak performance of the Berkeley Parser being a notable disappointment.…”
Section: Discussionmentioning
confidence: 99%