ii Deep learning has brought a wealth of state-of-the-art results and new capabilities. Although methods have achieved near human-level performance on many benchmarks, numerous recent studies imply that these benchmarks only weakly test their intended purpose, and that simple examples produced either by human or machine, cause systems to fail spectacularly. For example, a recently released textual entailment demo was criticized on social media for predicting that "John killed Mary" entails "Mary killed John" with 92% confidence. Such surprising failures combined with the inability to interpret stateof-the-art models have eroded confidence in our systems, and while these systems are not perfect, the real flaw lies with our benchmarks that do not adequately measure a model's ability to generalize, and are thus easily gameable.This workshop provides a venue for exploring new approaches for measuring and enforcing generalization in models. We have solicited work in the following areas:1. Analysis of existing models and their failings.2. Creation of new evaluation paradigms, e.g. zero-shot learning, Winnograd schema, and datasets that avoid explicit types of gamification.3. Modeling advances such as regularization, compositionality, interpretability, inductive bias, multitask learning, and other methods that promote generalization.Our goals are similar in spirit to those of the recent "Build it Break it" shared task. However, we propose going beyond identifying areas of weakness (i.e. "breaking" existing systems), and discussing evaluations that rigorously test generalization as well as modeling techniques for enforcing it.We received eight archival submissions and seven cross submission, accepting five archival papers and all cross submission. Predominately papers covered the first two stated goals of workshop, with the majority identifying flaws in either methods or data. Of the papers proposing new evaluations, many explored using synthetic data. The papers will be presented as posters at the workshop and we are excited to see what discussions they generate. In addition to twelve papers that will be presented we are equally excited for talks from . Finally , we would also like to thank Yejin, Devi and Dan for helping through service on the steering committee.-Yonatan, Omer, Markiii Organizers:
AbstractIn this paper, we investigate the tendency of end-to-end neural Machine Reading Comprehension (MRC) models to match shallow patterns rather than perform inference-oriented reasoning on RC benchmarks. We aim to test the ability of these systems to answer questions which focus on referential inference. We propose ParallelQA, a strategy to formulate such questions using parallel passages. We also demonstrate that existing neural models fail to generalize well to this setting.
AbstractCommonsense knowledge bases such as Con-ceptNet represent knowledge in the form of relational triples. Inspired by recent work by (Li et al., 2016), we analyse if knowledge base completion models can be used to mine commonsense knowledge fro...