The authors propose a method for automatically generating Japanese-English bilingual thesauri based on bilingual corpora. The term bilingual thesaurus refers to a set of bilingual equivalent words and their synonyms. Most of the methods proposed so far for extracting bilingual equivalent word clusters from bilingual corpora depend heavily on word frequency and are not effective for dealing with low-frequency clusters. These low-frequency bilingual clusters are worth extracting because they contain many newly coined terms that are in demand but are not listed in existing bilingual thesauri. Assuming that single language-pair-independent methods such as frequency-based ones have reached their limitations and that a language-pair-dependent method used in combination with other methods shows promise, the authors propose the following approach: (a) Extract translation pairs based on transliteration patterns; (b) remove the pairs from among the candidate words; (c) extract translation pairs based on word frequency from the remaining candidate words; and (d) generate bilingual clusters based on the extracted pairs using a graph-theoretic method. The proposed method has been found to be significantly more effective than other methods.
IntroductionWe propose a method for automatically generating Japanese-English bilingual thesauri based on bilingual corpora. The thesaurus is a set of bilingual equivalent words and their synonyms (bilingual clusters) that properly accounts for newly coined terms. While such thesauri are useful in many fields, i.e., in cross-language information retrieval, query expansion, and translation in general, few are in existence. This is because existing thesauri are manually updated, a time-consuming process; many of the automatic methods proposed so far are ineffective for this type of work.Recently, the number of bilingual corpora has been increasing. Some among these are frequently updated, introducing newly coined bilingual clusters. If these clusters can be extracted automatically, it would be very useful. While some methods have been proposed for automatically extracting bilingual clusters from bilingual corpora (Kageura, Tsuji, & Aizawa, 2000) and many methods have been proposed for automatically extracting translation pairs from bilingual corpora (Ahrenberg, Andersson, & Merkel, 1998; Boutsis, Piperidi, & Demiros, 1999;Brown et al., 1993;Chen, Kishida, Jiang, Liang, & Gey, 1999;Collier, Hirakawa, & Kumano, 1997; Daille, Gaussier, & Lange, 1994;Fung, 1995;Fung & McKeown, 1994;Gale & Church, 1991;Gaussier, 1998;Hiemstra, 1998;Hull, 1998;Imamura, 2002;Jeong, Myaeng, Lee, & Choi, 1999;Ker & Chang, 1997;Kitamura & Matsumoto, 1996;Kumar & Byrne, 2002;Kupiec, 1993; Lopez, Nossal, Hwa, & Resnick, 2002;Melamed, 2000;Meyers, Yangarber, & Grishman, 1996;Shen & Dorr, 1997; Smadja, McKeown, & Hatsivsilloglou, 1996;Sun, Jin, Du, & Sun, 2000;Toutanova, Ilhan, & Manning, 2002; van der Eijk, 1993;Vogel, Ney, & Tillmann, 1996, Watanabe, Kurohashi, & Aramaki, 2000Wu, 1995;Wu & Xia, 1994;Yamamoto & Matsum...