Proceedings of the First International Conference on Human Language Technology Research - HLT '01 2001
DOI: 10.3115/1072133.1072187
|View full text |Cite
|
Sign up to set email alerts
|

Inducing multilingual text analysis tools via robust projection across aligned corpora

Abstract: This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
322
0
10

Year Published

2002
2002
2016
2016

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 310 publications
(332 citation statements)
references
References 12 publications
0
322
0
10
Order By: Relevance
“…We hypothesize that a large component of the error rate in the automatically induced text analysis tools generated by [22] is due to morphosyntactic differences between the source and target languages that are specific to each source-target language pair. Therefore, training POS taggers on additional source languages should result in multiple classifiers which produce independently distributed errors on the target language.…”
Section: Motivationmentioning
confidence: 99%
See 2 more Smart Citations
“…We hypothesize that a large component of the error rate in the automatically induced text analysis tools generated by [22] is due to morphosyntactic differences between the source and target languages that are specific to each source-target language pair. Therefore, training POS taggers on additional source languages should result in multiple classifiers which produce independently distributed errors on the target language.…”
Section: Motivationmentioning
confidence: 99%
“…Labelling data by hand is time-consuming; a natural goal is therefore to generate text analysis tools automatically, using minimal resources. Yarowsky et al [22] present methods for automatically inducing various monolingual text analysis tools for an arbitrary target language, using only the corresponding text analysis tool for a source language and a parallel corpus between the source and target languages. Hwa et al [15] induce a parser for Chinese text via projection from English using a similar method to that of [22].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…A widely-used methodology consists in generating automatic annotations for the resource-poor language by projecting linguistic information through word alignment links (see eg. (Yarowsky et al, 2001; for PoS tagging, (Hwa et al, 2005;Lacroix et al, 2016a) for dependency parsing, (Ehrmann et al, 2011) for Named Entity Recognition, (Kozhevnikov and Titov, 2013) for Semantic Role Labeling, etc.). Implementing this methodology requires the existence of (a) parallel corpora aligned at the word level, and (b) annotation and/or tools on the resource-rich side.…”
Section: Introductionmentioning
confidence: 99%
“…1 Although phrasebased approaches to SMT tend to be robust to wordalignment errors (Lopez and Resnik, 2006), improving word-alignment is still useful for other NLP research that is more sensitive to alignment quality, e.g., projection of information across parallel corpora (Yarowsky et al, 2001).…”
Section: Introductionmentioning
confidence: 99%