“…According to the requirements of parallel training samples, existing solutions can be divided into two types: models using parallel stylized image-caption data [41,11,54,1] or not [22,42]. Subsequently, the community gradually shifts the emphasis to controlling described contents [16,77,27,10,78,48,35] or structures [20,19,75,76] [18,60,37,36,64], which aims to gen-erate discriminative and unique captions for individual images. Unfortunately, due to the subjective nature of diverse and distinctive captions, effective evaluation remains as an open problem, and several new metrics are proposed, such as SPICE-U [67], CIDErBtw [64], self-CIDEr [66], word recall [58], mBLEU [52].…”