“…Yuan et al [ 20 ] have introduced the Multi-Layer cross-fusion with a Re-constructor (MCR) to create a textual summary from the multimodal video collection. The MCR performs cross-fusion through the layer blocks of cross-model transformers and it results in a cross-modal representation.…”