2013
DOI: 10.1007/s00138-013-0521-1
|View full text |Cite
|
Sign up to set email alerts
|

Human interaction categorization by using audio-visual cues

Abstract: Human Interaction Recognition in uncontrolled TV video material is a very challenging problem because of the huge intra-class variability of the classes (due to large dierences in the way actions are performed, lighting conditions and camera viewpoints, amongst others) as well as the existing small interclass varibility (e.g. the visual dierence between hug and kiss is very subtle). Most of previous works have been focused only on visual information (i.e. image signal), thus missing an important source of info… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2013
2013
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(16 citation statements)
references
References 26 publications
0
16
0
Order By: Relevance
“…Social networking methods model the human behavior from gestures, body motion and speech (Patron-Perez, A., Marszalek, M., Reid, I., and Zisserman, A. (2012) [18] ; Marn-Jimnez, M. J., Noz Salinas, R. M., Yeguas-Bolivar, E., and de la Blanca, N. P. (2014) [19]). By estimating the orientation and location of the faces of persons and computing a line of sight for each face we can model social interactions and we can get the location of the individuals Fathi, A., Hodgins, J. K., and Rehg, J. M. (2012) [20] and by modeling relationships between interacting person we can predict joint social interaction Park, H. S., and Shi, J.…”
Section: Social Networking Methodsmentioning
confidence: 99%
“…Social networking methods model the human behavior from gestures, body motion and speech (Patron-Perez, A., Marszalek, M., Reid, I., and Zisserman, A. (2012) [18] ; Marn-Jimnez, M. J., Noz Salinas, R. M., Yeguas-Bolivar, E., and de la Blanca, N. P. (2014) [19]). By estimating the orientation and location of the faces of persons and computing a line of sight for each face we can model social interactions and we can get the location of the individuals Fathi, A., Hodgins, J. K., and Rehg, J. M. (2012) [20] and by modeling relationships between interacting person we can predict joint social interaction Park, H. S., and Shi, J.…”
Section: Social Networking Methodsmentioning
confidence: 99%
“…Shao et al [22] mixed appearance and motion features using multi-task deep learning for recognizing group activities in crowded scenes collected from the web. Marín-Jiménez et al [23] used a bag of visualaudio words scheme along with late fusion for recognizing human interactions in TV shows. Even though their method performs well in recognizing human interaction, the lack of an intrinsic audio-visual relationship estimation limits the recognition problem.…”
Section: Related Workmentioning
confidence: 99%
“…Gaidon et al [16] propose to use clustertrees of tracklets. Marin-Jimenez et al [27] proposes to use audio features, as well as visual features for interaction recognition. While audio features can be useful, in this work, we focus solely on visual features, and show that without the need for complex models, our simple framework of two-person based visual features coupled with spatial multiple instance embedding proves to be an effective way for two-person interaction recognition.…”
Section: Related Workmentioning
confidence: 99%