Yun-Hsuan Sung scite author profile

We present easy-to-use retrieval focused multilingual sentence embedding models, made available on TensorFlow Hub. The models embed text from 16 languages into a shared semantic space using a multi-task trained dualencoder that learns tied cross-lingual representations via translation bridge tasks (Chidambaram et al., 2018). The models achieve a new state-of-the-art in performance on monolingual and cross-lingual semantic retrieval (SR). Competitive performance is obtained on the related tasks of translation pair bitext retrieval (BR) and retrieval question answering (ReQA). On transfer learning tasks, our multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embeddings.

show abstract

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Guo¹,

Shen²,

Yang³

et al. 2018

117

View full text Add to dashboard Cite

This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but that have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstruction and by assessing machine translation systems trained on gold vs. mined sentence pairs. We find that the sentence embeddings can be used to reconstruct the United Nations Parallel Corpus (Ziemski et al., 2016) at the sentence level with a precision of 48.9% for en-fr and 54.9% for enes. When adapted to document level matching, we achieve a parallel document matching accuracy that is comparable to the significantly more computationally intensive approach of Uszkoreit et al. (2010). Using reconstructed parallel data, we are able to train NMT models that perform nearly as well as models trained on the original data (within 1-2 BLEU).

show abstract

Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model

Chidambaram¹,

Yang²,

Cer³

et al. 2019

View full text Add to dashboard Cite

The scarcity of labeled training data across many languages is a significant roadblock for multilingual neural language processing. We approach the lack of in-language training data using sentence embeddings that map text written in different languages, but with similar meanings, to nearby embedding space representations. The representations are produced using a dual-encoder based model trained to maximize the representational similarity between sentence pairs drawn from parallel data. The representations are enhanced using multitask training and unsupervised monolingual corpora. The effectiveness of our multilingual sentence embeddings are assessed on a comprehensive collection of monolingual, cross-lingual, and zeroshot/few-shot learning tasks.1 Models based on this work are available at https: //tfhub.dev/ as: universal-sentence-encoder-xling/ende, universal-sentence-encoder-xling/en-fr, and universalsentence-encoder-xling/en-es. A large multilingual model is available as universal-sentence-encoder-xling/many.

show abstract

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

Yang

Ábrego

Yuan

et al. 2019

View full text Add to dashboard Cite

In this paper, we present an approach to learn multilingual sentence embeddings using a bi-directional dual-encoder with additive margin softmax. The embeddings are able to achieve state-of-the-art results on the United Nations (UN) parallel corpus retrieval task. In all the languages tested, the system achieves P@1 of 86% or higher. We use pairs retrieved by our approach to train NMT models that achieve similar performance to models trained on gold pairs. We explore simple document-level embeddings constructed by averaging our sentence embeddings. On the UN document-level retrieval task, document embeddings achieve around 97% on P@1 for all experimented language pairs. Lastly, we evaluate the proposed model on the BUCC mining task. The learned embeddings with raw cosine similarity scores achieve competitive results compared to current state-of-the-art models, and with a second-stage scorer we achieve a new state-of-the-art level on this task.

show abstract

Multilingual Universal Sentence Encoder for Semantic Retrieval

Yang¹,

Cer²,

Amin³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yun-Hsuan Sung

Multilingual Universal Sentence Encoder for Semantic Retrieval

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

Multilingual Universal Sentence Encoder for Semantic Retrieval

Contact Info

Product

Resources

About