2021
DOI: 10.1155/2021/2662064
|View full text |Cite
|
Sign up to set email alerts
|

RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer

Abstract: Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language. The RDMMF… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 23 publications
0
4
0
Order By: Relevance
“…As mentioned above, we investigate how to utilize the deep learning model for the English translation in order to better assist the establishment of the machine translation system [27]. We design an English translation method based on an encoder-decoder structure and an attention mechanism.…”
Section: ⅲ Methodsmentioning
confidence: 99%
“…As mentioned above, we investigate how to utilize the deep learning model for the English translation in order to better assist the establishment of the machine translation system [27]. We design an English translation method based on an encoder-decoder structure and an attention mechanism.…”
Section: ⅲ Methodsmentioning
confidence: 99%
“…In addition, as mentioned before, the model should adequately study both local and global features. Given these concerns, we constructed a model comprised of tandem Gated Recurrent Units (GRU) [25][26] and Transformer encoder [27][28] for sequence modelling (Figure 2a). The data sequence was split into short fragments and fed into GRU cells, whose outputs were concatenated and processed by Transformer encoder (Figure 2a).…”
Section: Standalone Neural Network For Pattern Extractionmentioning
confidence: 99%
“…y i is its corresponding ground-truth intent label. RNNbased encoder-decoder captures the temporal correlation through model parameters (like memories), transformerbased model (Vaswani et al 2017a;Han et al 2021;Xu et al 2021;Yi and Qu 2022;Wu et al 2023) design attention modules to capture all the possible relationships. In other words, for a trained model, the attention mechanism relies on the data itself explicitly to capture the temporal correlation, while LSTM/RNN memorizes the temporal information implicitly through model parameters.…”
Section: Framework Overviewmentioning
confidence: 99%