“…Input Method caption [16], [33], [40], [42], [41], [48], [35], [54], [43], [55], [58], [61], [65], [67], [68], [69], [70], [75], [80], [86], [34], [87], [128] caption + dialogue [93], [95], [99] caption + layout [104], [97], [108], [103] caption + semantic masks [109], [110], [113], [114], [115], [38] scene graphs [116], [121], [122], [124],…”