Generalizing a Strongly Lexicalized Parser using Unlabeled Data

Deoskar, Tejaswini; Christodoulopoulos, Christos; Birch, Alexandra; Steedman, Mark

doi:10.3115/v1/e14-1014

Cited by 5 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reducing the requirements for training data eases the task for human annotators. It may also make the model more amenable to semi-supervised approaches to CCG parsing, which have typically focused on extending the lexicon (Thomforde and Steedman, 2011;Deoskar et al, 2014). Finally, it may make it easier to convert other annotated resources, such as UCCA (Abend and Rappoport, 2013) or AMR (Banarescu et al, 2013), to CCG training data-as only specific words need to be converted, rather than full sentences.…”

Section: Future Workmentioning

confidence: 99%

A* CCG Parsing with a Supertag-factored Model

Lewis¹,

Steedman²

2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

116

View full text Add to dashboard Cite

We introduce a new CCG parsing model which is factored on lexical category assignments. Parsing is then simply a deterministic search for the most probable category sequence that supports a CCG derivation. The parser is extremely simple, with a tiny feature set, no POS tagger, and no statistical model of the derivation or dependencies. Formulating the model in this way allows a highly effective heuristic for A * parsing, which makes parsing extremely fast. Compared to the standard C&C CCG parser, our model is more accurate out-of-domain, is four times faster, has higher coverage, and is greatly simplified. We also show that using our parser improves the performance of a state-ofthe-art question answering system.

show abstract

Section: Future Workmentioning

confidence: 99%

A* CCG Parsing with a Supertag-factored Model

Lewis¹,

Steedman²

2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

116

View full text Add to dashboard Cite

show abstract

“…Baldridge (2008) and Ravi et al (2010) were particularly con-cerned with high lexical ambiguity and counteracted this, respectively, by improving lexicon initialization using linguistic principles, and explicitly minimizing model sizes. Deoskar et al (2013), working with lexico-syntactic dependencies similar to supertags, addressed difficulties arising from the long tail of rare and unseen words; and Deoskar et al (2014) addressed a similar issue specifically for generalizing a CCG parser. The problem of outof-vocabulary words has gotten much less severe with the advent of deep contextualized sentence encoders operating on subword units.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories

Prange

Schneider

Srikumar

2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Although current CCG supertaggers achieve high accuracy on the standard WSJ test set, few systems make use of the categories’ internal structure that will drive the syntactic derivation during parsing. The tagset is traditionally truncated, discarding the many rare and complex category types in the long tail. However, supertags are themselves trees. Rather than give up on rare tags, we investigate constructive models that account for their internal structure, including novel methods for tree-structured prediction. Our best tagger is capable of recovering a sizeable fraction of the long-tail supertags and even generates CCG categories that have never been seen in training, while approximating the prior state of the art in overall tag accuracy with fewer parameters. We further investigate how well different approaches generalize to out-of-domain evaluation sets.

show abstract

“…This introduction of parameters for new word types into the lexicon was the only modification made to the parsers, with the remainder of the models being left unchanged. When combined with methods that could adapt the existing model parameters to the statistics of the new domain, such as self-training (e.g., Deoskar et al, 2014), we expect further improvements to be achievable. Nonetheless, there were substantial variations in the strength of the improvement attained, with the weak performance of the Berkeley Parser being a notable disappointment.…”

Section: Discussionmentioning

confidence: 99%

Parser Adaptation to the Biomedical Domain without Re-Training

Mitchell¹,

Steedman²

2015

Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

Self Cite

View full text Add to dashboard Cite

We present a distributional approach to the problem of inducing parameters for unseen words in probabilistic parsers. Our KNN-based algorithm uses distributional similarity over an unlabelled corpus to match unseen words to the most similar seen words, and can induce parameters for those unseen words without retraining the parser. We apply this to domain adaptation for three different parsers that employ fine-grained syntactic categories, which allows us to focus on modifying the lexicon, while leaving the structure of the parser itself intact. We demonstrate uplifts for dependency recovery of 2%-6% on novel vocabulary in biomedical text.

show abstract

Generalizing a Strongly Lexicalized Parser using Unlabeled Data

Cited by 5 publications

References 19 publications

A* CCG Parsing with a Supertag-factored Model

A* CCG Parsing with a Supertag-factored Model

Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories

Parser Adaptation to the Biomedical Domain without Re-Training

Contact Info

Product

Resources

About