Life after BERT: What do Other Muppets Understand about Language?

Vladislav, Lialin,; Zhao, K. X.; Namrata, Shivagunde,; Rumshisky, Anna

doi:10.18653/v1/2022.acl-long.227

Cited by 2 publications

(7 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Han et al (2021) and Liu et al (2022) prompted GPT3 to generate synthetic translation and NLI datasets, respectively. Lialin et al (2022) and Ettinger (2019) evaluated language models on smaller datasets for negation and role reversal. We extend these datasets to around 1500 data points and evaluate 22 models, including GPT3.…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…The field of analysis of pre-trained models has grown rapidly in recent years (Zagoury et al, 2021;Liu et al, 2021;Lialin et al, 2022;bench authors, 2023;Rogers et al, 2020). Methods such as attention pattern analysis (Kovaleva et al, 2019;Kobayashi et al, 2020), linear probing (Tenney et al, 2019), and zero-shot probing (Belinkov et al, 2020;Talmor et al, 2019;Ettinger, 2019;Lialin et al, 2022) allow us to evaluate specific capabilities of pre-trained models. Zero-shot methods give us arguably the most clear picture, as they directly probe what the model learned through the upstream task and allow the researcher to target very specific skills such as understanding of negation or role.…”

Section: Introductionmentioning

confidence: 99%

“…Psycholinguistic datasets used in a study by Ettinger (2019) have been particularly interesting in that they enabled a comparison between model behavior and human response, including both N400 effects and well-reasoned cloze judgments by human speakers. Despite being used in multiple studies since (Lialin et al, 2022;Rogers et al, 2020;Zhang et al, 2020), these datasets are quite small, ranging in size from 18 sentence pairs in negation (NEG-136-SIMP) to a maximum of 44 sentence pairs in the role-reversal dataset .…”

Section: Introductionmentioning

confidence: 99%

“…To understand how well language models perform on these extended datasets, we evaluated 22 models, including GPT3, following the methodology from Lialin et al (2022). Compared to the original test sets, we see a significant drop (up to 57%) in accuracy for both Role and Negation tasks.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Shivagunde,

Lialin,

Rumshisky

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Language model probing is often used to test specific capabilities of models. However, conclusions from such studies may be limited when the probing benchmarks are small and lack statistical power. In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies. We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks. We observe high levels of negation sensitivity in models like BERT and ALBERT demonstrating that previous findings might have been skewed due to smaller test sets. Finally, we observe that while GPT3 has generated all the examples in ROLE-1500 is only able to solve 24.6% of them during probing. The datasets and code are available on Github 1 .

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Shivagunde,

Lialin,

Rumshisky

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

Brain-Inspired Sparse Training in MLP and Transformers with Network Science Modeling via Cannistraci-Hebb Soft Rule

Zhang,

Zhao,

Liao

et al. 2024

Preprint

View full text Add to dashboard Cite

Dynamic sparse training is an effective strategy to alleviate the training and inference demands of artificial neural networks. However, current sparse training methods face a challenge in achieving high levels of sparsity while maintaining performance comparable to that of their fully connected counterparts. The Cannistraci-Hebb training (CHT) method produces an ultra-sparse advantage compared to fully connected training in various tasks by using a gradient-free link regrowth method, which relies solely on the network topology. However, its rigid selection based on link prediction scores may lead to epitopological local minima, especially at the beginning of the training process when the network topology might be noisy and unreliable. In this article, we introduce the Cannistraci-Hebb training soft rule (CHTs), which applies a flexible approach to both the removal and regrowth of links during training, fostering a balance between exploring and exploiting network topology. Additionally, we investigate the network topology initialization using several approaches, including the bipartite scale-free and bipartite small-world network models. Empirical results show that CHTs can surpass the performance of fully connected networks with MLP architecture by using only 1% of the connections (99% sparsity) on the MNIST, EMNIST, and Fashion MNIST datasets and can provide remarkable results with only 0.1% of the links (99.9% sparsity). In some MLPs for image classification tasks, CHTs can reduce the active neuron network size to 20% of the original nodes (neurons), demonstrating a remarkable ability to generalize better than fully connected architectures, reducing the entire model size. This represents a relevant result for dynamic sparse training. Finally, we present evidence from larger network models such as Transformers, with 10% of the connections (90% sparsity), where CHTs outperform other prevalent dynamic sparse training methods in machine translation tasks.

show abstract

Life after BERT: What do Other Muppets Understand about Language?

Cited by 2 publications

References 36 publications

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Brain-Inspired Sparse Training in MLP and Transformers with Network Science Modeling via Cannistraci-Hebb Soft Rule

Contact Info

Product

Resources

About