“…In order to test the efficacy of VT, we consider two generation tasks, question answering (QA) and question generation (QG), and two classification tasks, sentiment analysis and natural language inference (NLI). As the datasets for QA, we use SQuAD (Rajpurkar et al, 2016) (English), Spanish SQuAD (Casimiro Pio et al, 2019) (Spanish), FQuAD (d'Hoffschmidt et al, 2020 (French), Italian SQuAD (Croce et al, 2018) (Italian), JAQuAD (So et al, 2022) (Japanese), Ko-rQuAD (Lim et al, 2019) (Korean), and SberQuAd (Efimov et al, 2020) (Russian). For QG, we use the same datasets adapted for QG via QG-Bench (Ushio et al, 2022).…”