“…Apart from some familiar topics such as disease detection (Oh et al, 2020;Luo et al, 2020;Lu et al, 2020b;Rajpurkar et al, 2017;Lu et al, 2020a;Ranjan et al, 2018) and lung segmentation (Eslami et al, 2020), the most related computer vision task is the emerging topic of image-based captioning, which aims at generating realistic sentences or topic-related paragraphs to summarize visual contents from images or videos (Vinyals et al, 2015;Xu et al, 2015;Goyal et al, 2017;Rennie et al, 2017;Huang et al, 2019;Feng et al, 2019;Pei et al, 2019;Tran et al, 2020). Not surprisingly, the recent progresses in medical report generation (Jing et al, 2018(Jing et al, , 2019Li et al, 2018Xue et al, 2018;Yuan et al, 2019;Wang et al, 2018;Lovelace and Mortazavi, 2020;Srinivasan et al, 2020;Zhang et al, 2020;Huang et al, 2021;Gasimova et al, 2020;Singh et al, 2019;Nishino et al, 2020) have been particularly influenced by the successes in image-based captioning. The work of (Vinyals et al, 2015;Xu et al, 2015) is among the early approaches in medical report generation, where visual features are extracted by convolution neural networks (CNNs); they are subsequently fed into recurrent neural networks (RNNs) to generate textual descriptions.…”