Visual Person Understanding Through Multi-task and Multi-dataset Learning

Pfeiffer, Kilian; Hermans, Alexander; Sárándi, István; Weber, Mark; Leibe, Bastian

doi:10.1007/978-3-030-33676-9_39

Cited by 7 publications

(5 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-dataset learning, which aims to learn a universal model from multiple datasets, has received increasing attention in various computer vision tasks, including depth estimation [24]- [26], stereo matching [27], [28], pedestrian detection [29], [30], semantic segmentation [31], [32], and object detection [33]- [37]. In this subsection, we mainly review multidataset object detection.…”

Section: B Multi-dataset Object Detectionmentioning

confidence: 99%

Dual-Mode Learning for Multi-Dataset X-Ray Security Image Detection

Yang,

Jiang,

Yan

et al. 2024

IEEE Trans.Inform.Forensic Secur.

View full text Add to dashboard Cite

Section: B Multi-dataset Object Detectionmentioning

confidence: 99%

Dual-Mode Learning for Multi-Dataset X-Ray Security Image Detection

Yang,

Jiang,

Yan

et al. 2024

IEEE Trans.Inform.Forensic Secur.

View full text Add to dashboard Cite

“…Additionally, the gradient updates for the main block may not be representative of all the tasks in each training step, affecting the statistics in the batch normalization layers [89]. To alleviate this issue, [90] proposed training on interleaved minibatches per dataset and the use of group normalization [91] to facilitate network convergence. The main difference in our approach is that we create mixed batches that enable the network to grasp information across datasets on every training iteration.…”

Section: Multi-dataset Trainingmentioning

confidence: 99%

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Kapidis

Poppe

Veltkamp

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

For egocentric vision tasks such as action recognition, there is a relative scarcity of labeled data. This increases the risk of overfitting during training. In this paper, we address this issue by introducing a multitask learning scheme that employs related tasks as well as related datasets in the training process. Related tasks are indicative of the performed action, such as the presence of objects and the position of the hands. By including related tasks as additional outputs to be optimized, action recognition performance typically increases because the network focuses on relevant aspects in the video. Still, the training data is limited to a single dataset because the set of action labels usually differs across datasets. To mitigate this issue, we extend the multitask paradigm to include datasets with different label sets. During training, we effectively mix batches with samples from multiple datasets. Our experiments on egocentric action recognition in the EPIC-Kitchens, EGTEA Gaze+, ADL and Charades-EGO datasets demonstrate the improvements of our approach over single-dataset baselines. On EGTEA we surpass the current state-of-the-art by 2.47%. We further illustrate the cross-dataset task correlations that emerge automatically with our novel training scheme.

show abstract

“…Multi-task Models. Multi-task learning has a long history [11] with several architectures and training strategies [24,36,38,52,60,77]. Earlier approaches mostly consist of a shared backbone with fixed task-specific heads, whereas we design a more general architecture for video segmentation with task-specific targets to specify what to segment.…”

Section: Related Workmentioning

confidence: 99%

TarViS: A Unified Approach for Target-based Video Segmentation

Athar¹,

Hermans²,

Luiten³

et al. 2023

Preprint

View full text Add to dashboard Cite

The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code will be made public upon acceptance.

show abstract

Visual Person Understanding Through Multi-task and Multi-dataset Learning

Cited by 7 publications

References 44 publications

Dual-Mode Learning for Multi-Dataset X-Ray Security Image Detection

Dual-Mode Learning for Multi-Dataset X-Ray Security Image Detection

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

TarViS: A Unified Approach for Target-based Video Segmentation

Contact Info

Product

Resources

About