Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015
DOI: 10.18653/v1/d15-1150
|View full text |Cite
|
Sign up to set email alerts
|

Part-of-speech Taggers for Low-resource Languages using CCA Features

Abstract: In this paper, we address the challenge of creating accurate and robust partof-speech taggers for low-resource languages. We propose a method that leverages existing parallel data between the target language and a large set of resourcerich languages without ancillary resources such as tag dictionaries. Crucially, we use CCA to induce latent word representations that incorporate cross-genre distributional cues, as well as projected tags from a full array of resource-rich languages. We develop a probability-base… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 19 publications
0
9
0
Order By: Relevance
“…Instead of projecting tag information via word alignment, the transfer in our model is driven by mapping multilingual embedding spaces. Kim et al (2015) also use latent word representations for multilingual transfer. However, similarly to prior work, this representation is learned using parallel data.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead of projecting tag information via word alignment, the transfer in our model is driven by mapping multilingual embedding spaces. Kim et al (2015) also use latent word representations for multilingual transfer. However, similarly to prior work, this representation is learned using parallel data.…”
Section: Related Workmentioning
confidence: 99%
“…There is an expansive body of research on learning multilingual word embeddings (Gouws et al, 2014;Faruqui and Dyer, 2014;Lu et al, 2015;Lauly et al, 2014;Luong et al, 2015). Previous work has shown its effectiveness across a wide range of multilingual transfer tasks including tagging (Kim et al, 2015), syntactic parsing (Xiao and Guo, 2014;Guo et al, 2015;Durrett et al, 2012), and machine translation (Zou et al, 2013;Mikolov et al, 2013b). However, these approaches commonly require parallel sentences or bilingual lexicon to learn multilingual embeddings.…”
Section: Multilingual Word Embeddingsmentioning
confidence: 99%
“…Turning to statistical and machine learning methods for POS tagging, these methods can be listed as vari-ous Hidden Markov model-based methods [9,20,73], maximum entropy-based methods [12,56,74,75,77], perceptron algorithm-based approaches [13,66,71], neural network-based approaches [11,14,33,38,59,60,80], Conditional Random Fields [34,35,37,43,44], Support Vector Machines [25,31,63,69] and other approaches including decision trees [61,62] and hybrid methods [19,36]. Overview about the POS tagging task can be found in [26,28].…”
Section: Related Workmentioning
confidence: 99%
“…To alleviate the problem of word sparsity, we also use task-specific latent continuous word representations, induced on 65 million unlabeled tweets with 1.3 billion tokens. We create three sets of word representations: CCA (Dhillon et al, 2012;Kim et al, 2015a) based on matrix factorization, word2vec (Mikolov et al, 2013) and glove (Pennington et al, 2014), which are gradient based. All word representation algorithms produce 50dimensional word vectors for all words occurring at least 40 times in the data.…”
Section: Basic Featuresmentioning
confidence: 99%
“…An obvious solution to the problem is to develop methods of utilizing a large amount of unlabeled data. One way is to induce word embeddings in a real-valued vector space from a large number of tweets (Kim et al, 2015a;Mikolov et al, 2013;Pennington et al, 2014). It is shown that the task-specific embeddings induced on tweets provide more powerful than those created from out-ofdomain texts (Owoputi et al, 2012;Anastasakos et al, 2014).…”
Section: Introductionmentioning
confidence: 99%