Lack of labeled data is one of the severest problems facing word sense disambiguation (WSD). We overcome the problem by proposing a method that combines automatic labeled data expansion (Step 1) and semisupervised learning (Step 2). The Step 1 and 2 methods are both effective, but their combination yields a synergistic effect.In this article, in Step 1, we automatically extract reliable labeled data from raw corpora using dictionary example sentences, even the infrequent and unseen senses (which are not likely to appear in labeled data). Next, in Step 2, we apply a semi-supervised classifier and achieve an improvement using easy-to-get unlabeled data. In this step, we also show that we can guess even unseen senses.We target a SemEval-2010 Japanese WSD task, which is a lexical sample task. BothStep 1 and Step 2 methods performed better than the best published result (76.4 %). Furthermore, the combined method achieved much higher accuracy (84.2 %). In this experiment, up to 50 % of unseen senses are classified correctly. However, the number of unseen senses are small, therefore, we delete one senses per word and apply our proposed method; the results show that the method is effective and robust even for unseen senses.
ACM Reference Format:Fujita, S. and Fujino, A. 2013. Word sense disambiguation by combining labeled data expansion and semisupervised learning method.