“…To combat the issue of data starvation, many researchers aim to utilize monolingual data to train NMT systems (Lample et al, 2018a;Artetxe et al, 2018;Conneau and Lample, 2019) and find ways to generate more training data, either comparable or synthetic data. Comparable data are extracted using various bitext retrieval methods (Zhao and Vogel, 2002;Fan et al, 2021;Kocyigit et al, 2022), multimodal signals (Hewitt et al, 2018;Rasooli et al, 2021), dictionary-or knowledge-based approaches (Wijaya and Mitchell, 2016;Wijaya et al, 2017;Tang and Wijaya, 2022); while synthetic data are created and utilized either through innovative training data augmentation (Kuwanto et al, 2021), utilizing automatic backtranslation (Sennrich et al, 2016a;Wang et al, 2019), or even outright generating synthetic data using generative models (Lu et al, 2023), which has gained increasing attention by the community lately due to the advancement of large language models (LLMs).…”