Finding BERT’s Idiomatic Key

Nedumpozhimana, Vasudevan; Kelleher, John D.

doi:10.18653/v1/2021.mwe-1.7

Cited by 10 publications

(9 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Probing assumes that the accuracy of the classification model (i.e., a probe) on the task indicates whether the embeddings encode information relevant to task target. There is a growing body of work using probing to examine what types of information are encoded in the embeddings created by Trans-former models (Hewitt and Manning, 2019;Liu et al, 2019;Tenney et al, 2019;Nedumpozhimana and Kelleher, 2021), and also exploring what layer in the Transformer architecture different types of information are encoded in (Jawahar et al, 2019). In this work, we adapt the probing methodology to speech embeddings, and use it to understand and compare the phonetic information encoded in different layers of a Transformer model.…”

Section: Related Workmentioning

confidence: 99%

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

Kelleher¹,

Carson-Berndsen²

2022

Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

In recent years large transformer model architectures have become available which provide a novel means of generating high-quality vector representations of speech audio. These transformers make use of an attention mechanism to generate representations enhanced with contextual and positional information from the input sequence. Previous works have explored the capabilities of these models with regard to performance in tasks such as speech recognition and speaker verification, but there has not been a significant inquiry as to the manner in which the contextual information provided by the transformer architecture impacts the representation of phonetic information within these models. In this paper, we report the results of a number of probing experiments on the representations generated by the wav2vec 2.0 model's transformer component, with regard to the encoding of phonetic categorization information within the generated embeddings. We find that the contextual information generated by the transformer's operation results in enhanced capture of phonetic detail by the model, and allows for distinctions to emerge in acoustic data that are otherwise difficult to separate.

show abstract

Section: Related Workmentioning

confidence: 99%

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

Kelleher¹,

Carson-Berndsen²

2022

Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

show abstract

“…These latter methods have more recently been superseded by approaches making use of distributional similarity in the form of both static and contextualized word embeddings (Gharbieh et al, 2016;Ehren, 2017;Senaldi et al, 2019;Hashempour and Villavicencio, 2020;Fakharian, 2021;Garcia et al, 2021;Nedumpozhimana and Kelleher, 2021), while keeping the underlying assumption unchanged: the vector representation of the component words should be distant from the vector representation of the context or of the expression as a whole.…”

Section: Related Workmentioning

confidence: 99%

ID10M: Idiom Identification in 10 Languages

Tedeschi¹,

Martelli²,

Navigli³

2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

Idioms are phrases which present a figurative meaning that cannot be (completely) derived by looking at the meaning of their individual components. Identifying and understanding idioms in context is a crucial goal and a key challenge in a wide range of Natural Language Understanding tasks. Although efforts have been undertaken in this direction, the automatic identification and understanding of idioms is still a largely underinvestigated area, especially when operating in a multilingual scenario. In this paper, we address such limitations and put forward several new contributions: we propose a novel multilingual Transformer-based system for the identification of idioms; we produce a highquality automatically-created training dataset in 10 languages, along with a novel manuallycurated evaluation benchmark; finally, we carry out a thorough performance analysis and release our evaluation suite at https:// github.com/Babelscape/ID10M.

show abstract

“…Finally, these latter methods have been superseded by approaches making use of distributional similarity in the form of both static and contextualized word embeddings (Gharbieh et al, 2016;Ehren, 2017;Senaldi et al, 2019;Liu and Hwa, 2019;Hashempour and Villavicencio, 2020;Kurfalı and Östling, 2020;Fakharian, 2021;Garcia et al, 2021;Nedumpozhimana and Kelleher, 2021), while keeping the underlying assumption unchanged, that is, the vector representation of the component words should be distant from the vector representation of the context, or of the expression as a whole.…”

Section: Related Workmentioning

confidence: 99%

NER4ID at SemEval-2022 Task 2: Named Entity Recognition for Idiomaticity Detection

Tedeschi¹,

Navigli²

2022

Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

View full text Add to dashboard Cite

Idioms are lexically-complex phrases whose meaning cannot be derived by compositionally interpreting their components. Although the automatic identification and understanding of idioms is essential for a wide range of Natural Language Understanding tasks, they are still largely under-investigated. This motivated the organization of the SemEval-2022 Task 2, which is divided into two multilingual subtasks: one about idiomaticity detection, and the other about sentence embeddings. In this work, we focus on the first subtask and propose a Transformer-based dual-encoder architecture to compute the semantic similarity between a potentially-idiomatic expression and its context and, based on this, predict idiomaticity. Then, we show how and to what extent Named Entity Recognition can be exploited to reduce the degree of confusion of idiom identification systems and, therefore, improve performance. Our model achieves 92.1 F 1 in the one-shot setting and shows strong robustness towards unseen idioms achieving 77.4 F 1 in the zeroshot setting. We release our code at https: //github.com/Babelscape/ner4id.

show abstract

Finding BERT’s Idiomatic Key

Cited by 10 publications

References 11 publications

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

ID10M: Idiom Identification in 10 Languages

NER4ID at SemEval-2022 Task 2: Named Entity Recognition for Idiomaticity Detection

Contact Info

Product

Resources

About