“…However, none of these studies propose any predictive models for cross-modal correlation categories. Recently, researchers pay more attention to the prediction of the cross-modal correlation categories and expand the existing classification system based on image specificity [13], emotion [14], interrelation metrics [15,16], parallel and non-parallel [17], contextual and semiotic relations [18], visual content contribution [19], etc. They annotate data and train models to predict the cross-modal correlation category around the following tasks: multimodal regression [13] or multimodal classification [14,15,16,17,18,19].…”