ii
IntroductionThis volume contains papers describing systems submitted to the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overview paper summarizing the task, its features, evaluation methodology for the main and additional metrics, and some interesting observations about the submitted systems and the task as a whole.This Shared Task (http://universaldependencies.org/conll17/) can be seen as an extension of the CoNLL 2007 Shared Task on parsing, but there are many important differences that make this year's task unique with several "firsts". Most importantly, the data for this task come from the Universal Dependencies project (http://universaldependencies.org), which provides annotated treebanks for a large number of languages using the same annotation scheme for all of them. In the shared task setting, this allows for more meaningful comparison between systems as well as languages, since differences are much more likely due to true parser differences rather than differences caused by annotation schemes. In addition, the number of languages for which training data were available is unprecedented for a single shared task: a total of 64 treebanks in 45 languages have been provided for training the systems. Additional data have been provided too, as were some baseline systems for those who wanted to try only some particular aspect of parsing. Overall, the task can be described as "closed", since only pre-approved data could be used.For evaluation, there were 81 datasets (standard datasets for the treebank languages provided for training, plus more test sets in known languages, but based on a specially created and annotated parallel corpus, and four surprise language test sets). Participants had to process all the test sets. The TIRA platform has been used for evaluation, as was the case already for the CoNLL 2015 and 2016 Shared Tasks, meaning that participants had to provide their code on a designated virtual machine to be run by the organizers to produce official results. However, test data have been published after the official evaluation period, and participants could run their systems at home to produce additional results they were allowed to include in the system description papers. There was one main evaluation metric -Labeled Attachment Score -for the main ranking table evaluating dependency parsing performance, plus additional metrics for tokenization, word and sentence segmentation, POS tagging, lemmatization and disambiguation of morphological features, and separate metrics computed for interesting subsets of the evaluation data.A total of 32 systems ran successfully and have been ranked (http://universaldependencies. org/conll17/results.html). While there are clear overall winners, we would like to thank all participants for working hard on their submissions and adapting their systems not only to the datasets available, but also to the evaluation platform. We would like to thank all of them for their effort, since it is the participants who are the core o...