“…This presentation, alongside the anthropomorphic bias of deep learning models (Watson, 2019), can perpetuate these opinions including harmful stereotypes. This is a general limitation of NLG models which we are unable to capture using standardized benchmarks alongside intrinsic evaluations, and others have thus called for more work to evaluate models in the physical and cultural context in which they are applied (Liebling et al, 2022;Bhatt et al, 2022). We also note that few, if any, benchmark currently reports the environmental side-effects of training and serving NLG models (Strubell, Ganesh, & Mc-Callum, 2019).…”