Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) 2019
DOI: 10.18653/v1/w19-4308
|View full text |Cite
|
Sign up to set email alerts
|

Pitfalls in the Evaluation of Sentence Embeddings

Abstract: Deep learning models continuously break new records across different NLP tasks. At the same time, their success exposes weaknesses of model evaluation. Here, we compile several key pitfalls of evaluation of sentence embeddings, a currently very popular NLP paradigm. These pitfalls include the comparison of embeddings of different sizes, normalization of embeddings, and the low (and diverging) correlations between transfer and probing tasks. Our motivation is to challenge the current evaluation of sentence embe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
12
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 19 publications
(13 citation statements)
references
References 16 publications
1
12
0
Order By: Relevance
“…The relation of probing to downstream tasks is also unclear, as our multilingual results show. This is supported by recent findings giving contradictory claims regarding, e.g., the importance of the Word Content probing task for downstream performances (Eger et al, 2019;Wang and Kuo, 2020;Perone et al, 2018). Our findings further add to contemporaneous work by Ravichander et al (2020) and Elazar et al (2020), who showed that probes do not necessarily identify linguistic properties required for solving an actual task, thus questioning a common interpretation of probing itself.…”
Section: Downstream Taskssupporting
confidence: 55%
“…The relation of probing to downstream tasks is also unclear, as our multilingual results show. This is supported by recent findings giving contradictory claims regarding, e.g., the importance of the Word Content probing task for downstream performances (Eger et al, 2019;Wang and Kuo, 2020;Perone et al, 2018). Our findings further add to contemporaneous work by Ravichander et al (2020) and Elazar et al (2020), who showed that probes do not necessarily identify linguistic properties required for solving an actual task, thus questioning a common interpretation of probing itself.…”
Section: Downstream Taskssupporting
confidence: 55%
“…It is worthwhile to emphasize that we use only 768dimension vectors for sentence embedding while InferSent uses 4096-dimension vectors. As explained in [14], [30], [52], the increase in the embedding dimension leads to increased performance for almost all models. This may explain SBERT-WK is slightly inferior to InferSent on the SICK-R dataset.…”
Section: A Semantic Textural Similaritymentioning
confidence: 68%
“…Hewitt and Liang (2019) pointed out that a simple linear probe is not enough to evaluate a representation. Recently, we have also seen non-linear probes Eger et al, 2019). There are also efforts to inspect the representations from a geometric persepctive (e.g.…”
Section: Related Workmentioning
confidence: 99%