We describe the IUCL+ system for the shared task of the First Workshop on Computational Approaches to Code Switching (Solorio et al., 2014), in which participants were challenged to label each word in Twitter texts as a named entity or one of two candidate languages. Our system combines character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognition tools within a Markov model framework that weights these components and assigns a label. Our approach is language-independent, and we submitted results for all data sets (five test sets and three "surprise" sets, covering four language pairs), earning the highest accuracy score on the tweet level on two language pairs (Mandarin-English, Arabicdialects 1 & 2) and one of the surprise sets (Arabic-dialects).
We describe the Indiana University system for SemEval Task 5, the L2 writing assistant task, as well as some extensions to the system that were completed after the main evaluation. Our team submitted translations for all four language pairs in the evaluation, yielding the top scores for English-German. The system is based on combining several information sources to arrive at a final L2 translation for a given L1 text fragment, incorporating phrase tables extracted from bitexts, an L2 language model, a multilingual dictionary, and dependency-based collocational models derived from large samples of targetlanguage text.
We investigate questions of how to reason about learner meaning in cases where the set of correct meanings is never entirely complete, specifically for the case of picture description tasks (PDTs). To operationalize this, we explore different models of representing and scoring non-native speaker (NNS) responses to a picture, including bags of dependencies, automatically determining the relevant parts of an image from a set of native speaker (NS) responses. In more exploratory work, we examine the variability in both NS and NNS responses, and how different system parameters correlate with the variability. In this way, we hope to provide insight for future system development, data collection, and investigations into learner language.
Given that all users of a language can be creative in their language usage, the overarching goal of this work is to investigate issues of variability and acceptability in written text, for both non-native speakers (NNSs) and native speakers (NSs). We control for meaning by collecting a dataset of picture description task (PDT) responses from a number of NSs and NNSs, and we define and annotate a handful of features pertaining to form and meaning, to capture the multi-dimensional ways in which responses can vary and can be acceptable. By examining the decisions made in this corpus development, we highlight the questions facing anyone working with learner language properties like variability, acceptability and native-likeness. We find reliable inter-annotator agreement, though disagreements point to difficult areas for establishing a link between form and meaning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.