Annotating Objects and Relations in User-Generated Videos

Shang, Xindi; Di, Donglin; Xiao, Junbin; Cao, Yu; Yang, Xun; Chua, Tat-Seng

doi:10.1145/3323873.3325056

Cited by 125 publications

(87 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A performance evaluation experiment of VSGG-Net is performed using two benchmark datasets, VidOR [10] (https://xdshang.github.io/docs/vidor.html) and VidVRD [3] (https://xdshang.github.io/docs/imagenet-vidvrd.html). The VidOR video dataset includes 80 object types and 50 relationship types.…”

Section: Experiments 41 Dataset and Model Trainingmentioning

confidence: 99%

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Jung

Lee

Kim

2021

Sensors

View full text Add to dashboard Cite

Video scene graph generation (ViDSGG), the creation of video scene graphs that helps in deeper and better visual scene understanding, is a challenging task. Segment-based and sliding-window based methods have been proposed to perform this task. However, they all have certain limitations. This study proposes a novel deep neural network model called VSGG-Net for video scene graph generation. The model uses a sliding window scheme to detect object tracklets of various lengths throughout the entire video. In particular, the proposed model presents a new tracklet pair proposal method that evaluates the relatedness of object tracklet pairs using a pretrained neural network and statistical information. To effectively utilize the spatio-temporal context, low-level visual context reasoning is performed using a spatio-temporal context graph and a graph neural network as well as high-level semantic context reasoning. To improve the detection performance for sparse relationships, the proposed model applies a class weighting technique that adjusts the weight of sparse relationships to a higher level. This study demonstrates the positive effect and high performance of the proposed model through experiments using the benchmark dataset VidOR and VidVRD.

show abstract

Section: Experiments 41 Dataset and Model Trainingmentioning

confidence: 99%

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Jung

Lee

Kim

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Besides, they proposed an online association method with a siamese network and obtained the stateof-the-art results by combining these two parts. [18] contributed a large-scale VidOR dataset for VidVRD. On this dataset, [23] utilized language context feature along with spatial-temporal feature for predicate prediction and won the first place at VRU'19 (Video Relation Understanding 2019) grand challenge.…”

Section: Related Workmentioning

confidence: 99%

“…We evaluate our method on two datasets: the benchmark ImageNet-VidVRD dataset [19] and the newly released VidOR dataset [18]. ImageNet-VidVRD is the first dataset for VidVRD, which consists of 1,000 videos collected from ILSVRC2016-VID and is split into 800 training videos and 200 test videos.…”

Section: Experiments 41 Datasetsmentioning

confidence: 99%

“…However, such methods unavoidably produces inaccurate prediction and missing detection because of their heavy reliance on the performance of the prediction models. Though these models can be improved over short video segments by considering spatio-temporal context [16,25], they may still suffer from the bias and noise in learning and modeling long-tail data distribution, which is quite common in visual relations [9,18]. Alternatively, we take a different perspective by studying a more robust inference algorithm through multiple hypotheses.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Video Relation Detection via Multiple Hypothesis Association

Shang

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

show abstract

“…While human arXiv:2012.09402v1 [cs.CV] 17 Dec 2020 perception typically involves inferring the physical attributes about the humans (detection [5,35,43,50], poses [3,4,8,25,28,41], shape [13,20,29,30], gaze [44] etc. ), interpreting humans involves reasoning about the finer details relating to human activity [6,24,27,48,49], behaviour [26,34], human-object visual relationship detection [23,33,36,37,39,40], and human-object interactions [23,32,33,36,37,39,40,42]. In this work, we investigate the problem of identifying Human-Object Interactions in videos.…”

Section: Introductionmentioning

confidence: 99%

Lighten

Sunkesula

Dabral

Ramakrishnan

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the

show abstract

Annotating Objects and Relations in User-Generated Videos

Cited by 125 publications

References 29 publications

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Video Relation Detection via Multiple Hypothesis Association

Lighten

Contact Info

Product

Resources

About