“…In [1,5,8,2], the authors propose a joint multimodal approach to model the interaction between textual, visual, and layout information in a unified multimodal pre-training network. Besides, [3] exploit cross-modal learning in the pre-training stage to perform a task-agnostic framework to model information across textual, visual, and layout information modalities without requiring document data annotation. In [4], the authors encourage multimodal interaction using a multimodal transformer architecture to perform visual document understanding.…”