2020
DOI: 10.1609/aaai.v34i04.5722
|View full text |Cite
|
Sign up to set email alerts
|

Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding

Abstract: Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations – a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
2

Relationship

6
3

Authors

Journals

citations
Cited by 32 publications
(17 citation statements)
references
References 11 publications
0
17
0
Order By: Relevance
“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”
Section: Computational Costsmentioning
confidence: 99%
“…Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al, 2019;Jiao et al, 2019;Lioutas et al, 2019), which are beyond the scope of this work.…”
Section: Computational Costsmentioning
confidence: 99%
“…By introducing additional teacher-predicted unlabeled documents as teacher guidance, RD shows its effectiveness in recommendation problems, but as mentioned in the introduction, there still remain problems in applying this method to query-based retrieval of documents. Distilled Sentence Embedding (DSE), introduced by Barkan et al (2020), is a method for sentence embedding distillation that has been shown effective on the GLUE benchmark, but encoding query and documents independently disregards the interaction between query and documents, which is important for ranking problems. In this paper, we will discuss our student training method for ranking problems and propose a novel ranking student model, showing the merit of these ideas.…”
Section: Knowledge Distillingmentioning
confidence: 99%
“…Most related to our study is Hofstätter et al (2020), who demonstrate that KD using a cross-encoder teacher significantly improves the effectiveness of bi-encoders for dense retrieval. Similarly, Barkan et al (2020) investigate the effectiveness of distilling a trained cross-encoder into a bi-encoder for sentence similarity tasks. Gao et al (2020a) explore KD combinations of different objectives such as language modeling and ranking.…”
Section: Introductionmentioning
confidence: 99%