ii Introduction The W-NUT 2017 workshop focuses on a core set of natural language processing tasks on top of noisy user-generated text, such as that found on social media, web forums and online reviews. Recent years have seen a significant increase of interest in these areas. The internet has democratized content creation leading to an explosion of informal user-generated text, publicly available in electronic format, motivating the need for NLP on noisy text to enable new data analytics applications. The workshop is an opportunity to bring together researchers interested in noisy text with different backgrounds and encourage crossover. The workshop this year features a shared task on Emerging and Rare entity recognition.The workshop received 27 main track submissions, 17 of which were accepted, in addition to 6 system description papers for the shared task and a task overview paper. There are 3 invited speakers, Bill Dolan, Dirk Hovy and Miles Osborne with each of their talks covering a different aspect of NLP for user-generated text. We would like to thank the Program Committee members who reviewed the papers this year. We would also like to thank the workshop participants.
AbstractThis work presents a fine-grained, textchunking algorithm designed for the task of multiword expressions (MWEs) segmentation. As a lexical class, MWEs include a wide variety of idioms, whose automatic identification are a necessity for the handling of colloquial language. This algorithm's core novelty is its use of non-word tokens, i.e., boundaries, in a bottom-up strategy. Leveraging boundaries refines token-level information, forging highlevel performance from relatively basic data. The generality of this model's feature space allows for its application across languages and domains. Experiments spanning 19 different languages exhibit a broadly-applicable, stateof-the-art model. Evaluation against recent shared-task data places text partitioning as the overall, best performing MWE segmentation algorithm, covering all MWE classes and multiple English domains (including usergenerated text). This performance, coupled with a non-combinatorial, fast-running design, produces an ideal combination for implementations at scale, which are facilitated through the release of open-source software.