“…Noticing the rich visual information contained in VRDs, several methods [6,16,26,32] exploit 2D layout information to provides complementation for textual content. Besides, for further improvement, mainstream researches [2,21,24,30,38,48,50] usually employ a shallow fusion of text, image, and layout to capture contextual dependencies. Recently, several pre-training models [28,45,46] have been proposed for joint learning the deep fusion of cross-modality on large-scale data and outperform counterparts on document understanding.…”