D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

Liu, Nayu; Sun, Xian; Yu, Hongfeng; Zhang, Wenkai; Xu, Guangluan

doi:10.1016/j.neucom.2021.04.072

Cited by 7 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HA [1] is a baseline multiencoder‐decoder model with a hierarchical attention strategy in the decoder part; MFFG [3] is a multi‐stage fusion architecture to fuse multi‐source data, which suppresses the flow of multimodal noise via a forget gate module; D ‐ MmT [2] is a decoder‐only multimodal transformer framework for video‐containing the MAS task.…”

Section: Methodsmentioning

confidence: 99%

“…Shang et al [27] introduced a novel short-term order-sensitive attention mechanism to leverage the time clue inside video frames. Liu et al [2] proposed reducing model parameters with a decoder-only multimodal transformer which combines the source inputs and target summary in the shared feature space.…”

Section: Multimodal Abstractive Summarisationmentioning

confidence: 99%

“…Liu et al. [2] proposed reducing model parameters with a decoder‐only multimodal transformer which combines the source inputs and target summary in the shared feature space.…”

Section: Related Workmentioning

confidence: 99%

“…Existing approaches have obtained promising results, and current methods can be roughly divided into two categories. As shown in Figure 1a, some works [1,2] applied specific encoders to process source inputs and obtain single-modal embeddings. Then they used a summary generation decoder to fuse multimodal information.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Yuan

Jing

Zheng

et al. 2023

IET Computer Vision

View full text Add to dashboard Cite

Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross‐fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross‐fusion for each modality via layers of cross‐modal transformer blocks, resulting in cross‐modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross‐modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross‐fusion with the reconstructor structure on the proposed model.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Multimodal Abstractive Summarisationmentioning

confidence: 99%

“…Liu et al. [2] proposed reducing model parameters with a decoder‐only multimodal transformer which combines the source inputs and target summary in the shared feature space.…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Yuan

Jing

Zheng

et al. 2023

IET Computer Vision

View full text Add to dashboard Cite

show abstract

“…Liu et al [ 22 ] have introduced a Decoder-only Multimodal Transformer (D-MmT), which is modified from the structure of the decoder by including the in-out multimodal decoder. Moreover, Cascaded Cross-Modal interaction (CXMI) creates the joint fusion representation among the modalities.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism

Argade,

Khairnar,

Vora

et al. 2024

Heliyon

View full text Add to dashboard Cite

Behavioral profiling for adaptive video summarization: From generalization to personalization

Kadam,

Vora,

Patil

et al. 2024

MethodsX

View full text Add to dashboard Cite

D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

Cited by 7 publications

References 7 publications

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism

Behavioral profiling for adaptive video summarization: From generalization to personalization

Contact Info

Product

Resources

About