Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding

Barkan, Oren; Razin, Noam; Malkiel, Itzik; Katz, Ori; Caciularu, Avi; Koenigstein, Noam

doi:10.1609/aaai.v34i04.5722

Cited by 32 publications

(17 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”

Section: Computational Costsmentioning

confidence: 99%

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Malkiel

Barkan

Caciularu

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

Language models that utilize extensive selfsupervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learning catalog-specialized language models for text-based item recommendations. We suggest novel training and inference procedures for scoring similarities between pairs of items, that don't require item similarity labels. Both the training and the inference techniques were designed to utilize the unlabeled structure of textual catalogs, and minimize the discrepancy between them. By incorporating four scores during inference, RecoBERT can infer text-based item-to-item similarities more accurately than other techniques. In addition, we introduce a new language understanding task for wine recommendations using similarities based on professional wine reviews. As an additional contribution, we publish annotated recommendations dataset crafted by human wine experts. Finally, we evaluate Re-coBERT and compare it to various state-of-theart NLP models on wine and fashion recommendations tasks.

show abstract

Section: Computational Costsmentioning

confidence: 99%

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Malkiel

Barkan

Caciularu

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…By introducing additional teacher-predicted unlabeled documents as teacher guidance, RD shows its effectiveness in recommendation problems, but as mentioned in the introduction, there still remain problems in applying this method to query-based retrieval of documents. Distilled Sentence Embedding (DSE), introduced by Barkan et al (2020), is a method for sentence embedding distillation that has been shown effective on the GLUE benchmark, but encoding query and documents independently disregards the interaction between query and documents, which is important for ranking problems. In this paper, we will discuss our student training method for ranking problems and propose a novel ranking student model, showing the merit of these ideas.…”

Section: Knowledge Distillingmentioning

confidence: 99%

Query Distillation: BERT-based Distillation for Ensemble Ranking

Zhang¹,

Liu²,

Wen³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

View full text Add to dashboard Cite

Recent years have witnessed substantial progress in the development of neural ranking networks, but also an increasingly heavy computational burden due to growing numbers of parameters and the adoption of model ensembles. Knowledge Distillation (KD) is a common solution to balance the effectiveness and efficiency. However, it is not straightforward to apply KD to ranking problems. Ranking Distillation (RD) has been proposed to address this issue, but only shows effectiveness on recommendation tasks. We present a novel two-stage distillation method for ranking problems that allows a smaller student model to be trained while benefitting from the better performance of the teacher model, providing better control of the inference latency and computational burden. We design a novel BERT-based ranking model structure for list-wise ranking to serve as our student model. All ranking candidates are fed to the BERT model simultaneously, such that the self-attention mechanism can enable joint inference to rank the document list. Our experiments confirm the advantages of our method, not just with regard to the inference latency but also in terms of higher-quality rankings compared to the original teacher model.

show abstract

“…Most related to our study is Hofstätter et al (2020), who demonstrate that KD using a cross-encoder teacher significantly improves the effectiveness of bi-encoders for dense retrieval. Similarly, Barkan et al (2020) investigate the effectiveness of distilling a trained cross-encoder into a bi-encoder for sentence similarity tasks. Gao et al (2020a) explore KD combinations of different objectives such as language modeling and ranking.…”

Section: Introductionmentioning

confidence: 99%

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

Lin¹,

Yang²,

Lin³

2021

Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

View full text Add to dashboard Cite

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the Col-BERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColBERT's expressive MaxSim operator into a simple dot product. The advantage of the bi-encoder teacherstudent setup is that we can efficiently add inbatch negatives during knowledge distillation, enabling richer interactions between teacher and student models. In addition, using Col-BERT as the teacher reduces training cost compared to a full cross-encoder. Experiments on the MS MARCO passage and document ranking tasks and data from the TREC 2019 Deep Learning Track demonstrate that our approach helps models learn robust representations for dense retrieval effectively and efficiently. * Contributed equally.The standard reranker architecture, while effective, exhibits high query latency, on the order of seconds per query (Hofstätter and Hanbury, 2019; Khattab and Zaharia, 2020) because expensive neural inference must be applied at query time on query-passage pairs. This design is known as a cross-encoder (Humeau et al., 2020), which exploits query-passage attention interactions across all transformer layers. As an alternative, a biencoder design provides an approach to ranking with dense representations that is far more efficient than cross-encoders (

show abstract

Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding

Cited by 32 publications

References 11 publications

RecoBERT: A Catalog Language Model for Text-Based Recommendations

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Query Distillation: BERT-based Distillation for Ensemble Ranking

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

Contact Info

Product

Resources

About