NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

Dhole, Kaustubh; Gangal, Varun; Gehrmann, Sebastian; Gupta, Aadesh; Li, Zhenhao; Mahamood, Saad; Mahadiran, Abinaya; Mille, Simon; Shrivastava, Ashish; Tan, Samson; Wu, Tongshang; Sohl-Dickstein, Jascha; Choi, Jin‐Ho; Hovy, Eduard; Dušek, Ondřej; Ruder, Sebastian; Anand, Sajant; Aneja, Nagender; Banjade, Rabin; Barthe, Lisa; Behnke, Hanna; Berlot-Attwell, Ian; Boyle, Connor; Brun, Catherine; Cabezudo, Marco Antonio Sobrevilla; Cahyawijaya, Samuel; Chapuis, Emile; Che, Wanxiang; Choudhary, Mukund; Clauss, Christian; Colombo, Pierre; Cornell, Filip; Dagan, Gautier; Das, Mayukh; Tanay, Dixit,; Dopierre, Thomas; Dray, Paul-Alexis; Dubey, Suchitra; Tatiana, Ekeinhor,; Giovanni, Marco Di; Goyal, Tanya; Gupta, Rajiv; Hamla, Louanes; Han, Sang Wook; Harel-Canada, Fabrice; Honoré, Antoine; Jindal, Ishan; Joniak, Przemysław; Kleyko, Denis; Kovatchev, Venelin; Krishna, Kalpesh; Kumar, Ashok; Langer, Stefan; Ryan, Lee, Seungjae; Levinson, Corey James; Liang, Hualou; Liang, Kaizhao; Liu, Zhexiong; Lukyanenko, Andrey; Marivate, Vukosi; Melo, Gerard de; Méoni, Simon; Meyer, Maxine; Mir, Afnan; Moosavi, Nafise Sadat; Meunnighoff, Niklas; Hon, Mun, Timothy Sum; Murray, Kenton; Namysł, Marcin; Obedkova, Maria; Oli, Priti; Pasricha, Nivranshu; Pfister, Jan; Plant, Richard E.; Prabhu, Vinay; Pais, V.; Qin, Libo; Raji, Shahab; Kumar, Rajpoot, Pawan; Raunak, Vikas; Rinberg, Roy; Roberts, Nicholas J.; Rodríguez, Juan Diego; Roux, Claude; Samus, Vasconcellos; Sai, Ananya B.; Schmidt, Robin; Scialom, Thomas; Sefara, Tshephisho Joseph; Shamsi, Saqib; Shen, Xudong; Shi, Yiwen; Shi, Haoyue; Shvets, Anna; Siegel, Nick; Sileo, Damien; Simon, Jamie; Singh, Chandan Deep; Sitelew, Roman; Soni, Praveen; Sorensen, Taylor; Soto, William; Srivastava, Aman; Srivatsa, Aditya; Sun, Tony; Varma, Mukund; A, Tabassum,; Tan, Fiona; Teehan, Ryan; Tiwari, Mo; Tolkiehn, Marie; Athena, Wang,; Wang, Zijian; Wang, Zijie; Wang, Gloria; Wei, Fuxuan; Wilie, Bryan; Winata, Genta Indra; Wang, Xinyu; Wydmanski, Witold; Xie, Tianbao; Yaseen, Usama; Yee, Michael; Zhang, Jing; Zhang, Yue

doi:10.3384/nejlt.2000-1533.2023.4725

Cited by 6 publications

(4 citation statements)

References 71 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Much of their success has been extended beyond language related tasks -essentially and arguably, any type of data with sequential properties like speech, music, etc. does not appear too hard to model in theory given sufficient data and compute power (Srivastava et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

“…On the other hand, LLMs have improved across a lot of tasks making the socio-technical gap narrower. As there is more exposure to data, LLMs have improved in parameters of cognition and meaning as estimates across language benchmarks are improving Nguyen et al, 2016;Sakaguchi et al, 2021;Srivastava et al, 2023;Wang et al, 2018;Gehrmann et al, 2022.…”

Section: The Framing Trapmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the Big Picture Workshop

2023

View full text Add to dashboard Cite

A key contribution to being a successful researcher in natural language processing, as in any area, is having a clear overarching vision of what your body of research is trying to accomplish. Using my own 40-year career as an example, I will attempt to provide general advice on formulating and pursuing a coherent research vision. In particular, I will focus on formulating a unique, personal objective that exploits your specific talents, knowledge, and passions, and that is distinct from the current popular trends in the field. I will also focus on formulating a vision that bridges existing fields of study to produce an overarching agenda that unifies previously disparate ideas.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: The Framing Trapmentioning

confidence: 99%

Proceedings of the Big Picture Workshop

2023

View full text Add to dashboard Cite

show abstract

“…Text perturbations are divided into symbol-, word-, and sentence-level perturbations. Our selection of text perturbation levels draws upon the methodology designed in NL Augmenter [84]. We use NLPAug 2 , NL Augmenter 3 and back-translation from an EasyNMT 4 to craft perturbations.…”

Section: Questions Perturbationsmentioning

confidence: 99%

Analyzing the Robustness of Vision & Language Models

Shirnin,

Andreev,

Potapova

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We present an approach to evaluate the robustness of pre-trained vision and language (V&L) models to noise in input data. Given a source image/text, we perturb it using standard computer vision (CV) / natural language processing (NLP) techniques and feed it to a V&L model. To track performance changes, we explore the problem of visual questions answering (VQA). Overall, we utilize 5 image and 9 text perturbation techniques and probe three Transformer-based V&L models followed by a broad analysis of their behavior and a detailed comparison. We discovered several key findings regarding the performance of the models in relation to the impact of various perturbations. These discrepancies in performance can be attributed to differences in their architectures and learning objectives.Last, but not least, we perform an empirical study to assess whether the attention mechanism of V&L Transformers learns to align modalities. We hypothesize, that attention weights for related objects and words, should be on average higher than for random object/word pairs. However, our study shows that, unlike is believed for machine translation models, V&L models do not learn alignment at all or exhibit less evidence to do so. This may support the intuition that V&L Transformers overfit to either of the modalities.

show abstract

“…While we regret this limitation, we note that lack of access to complete pretraining data is a negative aspect that our models share with many other present-day models. Future work may consider increasing the available data via augmentation techniques (Dhole et al, 2021) or mixing with data from a different modality such as code (Muennighoff et al, 2023b,a;. The mC4-Fi and CC-Fi datasets are both derived from Common Crawl data, but cover different sets of crawls and apply different selection criteria and text extraction and filtering pipelines.…”

Section: Limitationsmentioning

confidence: 99%

FinGPT: Large Generative Models for a Small Language

Luukkonen,

Komulainen,

Luoma

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

show abstract

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

Cited by 6 publications

References 71 publications

Proceedings of the Big Picture Workshop

Proceedings of the Big Picture Workshop

Analyzing the Robustness of Vision & Language Models

FinGPT: Large Generative Models for a Small Language

Contact Info

Product

Resources

About