2020
DOI: 10.1016/j.patcog.2020.107279
|View full text |Cite
|
Sign up to set email alerts
|

Three-stream fusion network for first-person interaction recognition

Abstract: First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearers movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. The three-stream architecture captures the characteristics of the target appearance, target motion, and camera egomotion. Meanwhile the three-stream correlation fusion combines… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 45 publications
0
6
0
Order By: Relevance
“…TSCF, TSDF, and KRP represent Three-stream Correlation Fusion, Three-stream Deep Fusion, and Kernelized Ranked Pooling, respectively. The presented results of [12,16] are reported from [17]. It should be noted that all of the compared methods utilize raw RGB frames.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…TSCF, TSDF, and KRP represent Three-stream Correlation Fusion, Three-stream Deep Fusion, and Kernelized Ranked Pooling, respectively. The presented results of [12,16] are reported from [17]. It should be noted that all of the compared methods utilize raw RGB frames.…”
Section: Resultsmentioning
confidence: 99%
“…After extracting feature maps, maximum and average values are considered for the fusion step to obtain a unique feature map. In [17], the same architecture with a new correlation-based fusion approach is utilized. In these two articles, for the classification step, an LSTM network has been exploited.…”
Section: Raw Frame Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…After extracting feature maps, maximum and average values are considered for the fusion step to obtain a unique feature map. In [17], the same architecture with a new correlation-based fusion approach is utilized. In these two articles, for the classi cation step, a LSTM network has been exploited.…”
Section: Raw Frame Featuresmentioning
confidence: 99%
“…Over the past few years, visual question answering (VQA) has attracted substantial attention from both the computer vision and natural language processing communities [1][2][3][4][5][6][7][8]. Compared to the traditional tasks of computer vision or natural language processing, such as object detection [9], image captioning [10][11][12][13][14], tracking [15,16], face recognition [17,18], action recognition [19][20][21],…”
Section: Introductionmentioning
confidence: 99%