Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations 2020
DOI: 10.18653/v1/2020.acl-demos.14
|View full text |Cite
|
Sign up to set email alerts
|

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Abstract: We introduce Sta n z a , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Sta n z a features a language-agnostic fully neural pipeline for text analysis, including tokenization, multiword token expansion, lemmatization, part-ofspeech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Sta n z a on a total of 112 datasets, including the Universal Dependencies treebanks and other mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
720
0
15

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 1,001 publications
(735 citation statements)
references
References 9 publications
0
720
0
15
Order By: Relevance
“…The current versions of MuST-C and MuST-Cinema do not include Japanese as a source language, however we will still perform the analysis on JESC and OpenSubtitles. For the Chink-Chunk algorithm we preprocess the data using the Stanza toolkit (Qi et al 2020). We first tokenise and perform Multi-Word Token (MWT) expansion to split the words into syntactic units.…”
Section: Methodsmentioning
confidence: 99%
“…The current versions of MuST-C and MuST-Cinema do not include Japanese as a source language, however we will still perform the analysis on JESC and OpenSubtitles. For the Chink-Chunk algorithm we preprocess the data using the Stanza toolkit (Qi et al 2020). We first tokenise and perform Multi-Word Token (MWT) expansion to split the words into syntactic units.…”
Section: Methodsmentioning
confidence: 99%
“…AMR parsers in the literature rely on several preand postprocessing rules. We extend these rules for the cross-lingual AMR parsing task based on several multilingual resources such as Wikipedia, BabelNet 4.0 (Navigli and Ponzetto, 2010), DBpedia Spotlight API (Daiber et al, 2013) cation in all languages but Chinese, for which we use Babelfy (Moro et al, 2014) instead, Stanford CoreNLP for English preprocessing pipeline, the Stanza Toolkit (Qi et al, 2020) for Chinese, German and Spanish sentences, and Tint 3 (Aprosio and Moretti, 2016) for Italian. The preprocessing steps consist of: i) lemmatization, ii) PoS tagging, iii) NER, iv) re-categorization of entities and senses, v) removal of wiki links and polarity attributes.…”
Section: Pre-and Postprocessingmentioning
confidence: 99%
“…Preprocessing This step consists of: i) lemmatization, ii) PoS-tagging, iii) NER, iv) re-categorization of entities and senses and v) removal of wiki links and polarity attributes. As NLP pipelines (steps i-iii) we use Stanford CoreNLP for English sentences, the Stanza Toolkit (Qi et al, 2020) for Chinese, German and Spanish sentences, and Tint 13 (Aprosio and Moretti, 2016) for Italian. Re-categorization and anonymization of entities is often used in English AMR parsing to reduce data sparsity Lyu and Titov, 2018;Peng et al, 2017;Konstas et al, 2017).…”
Section: A Cross-lingual Amr Pre-and Postprocessingmentioning
confidence: 99%
“…For the rule-based model, we used the "GUM" model 2 of the Stanford's Stanza toolkit [16] for tokenisation and the "GENIA+PubMed" model 3 of the BLLIP parser [4] for parsing. We converted the resulting trees into Universal Dependencies using the Stanford Dependencies Converter [18].…”
Section: Training and Evaluationmentioning
confidence: 99%