End-to-End Referring Video Object Segmentation with Multimodal Transformers

Zheltonozhskii, Evgenii; Baskin, Chaim

doi:10.48550/arxiv.2111.14821

Cited by 2 publications

(11 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ClawCraneNet [16] leverages cross-modal attention to bridge the semantic correlation between textual and visual modalities. ReferFormer [41] and MTTR [1] are two latest works that utilize transformers to decode or fuse multimodal features. ReferFormer [41] employs a linguistic prior to the transformer decoder to focus on the referred object.…”

Section: Related Workmentioning

confidence: 99%

“…ReferFormer [41] employs a linguistic prior to the transformer decoder to focus on the referred object. MTTR [1] leverages a multimodal transformer encoder to fuse linguistic and visual features. Different from other vision-language tasks, e.g., image-text retrieval [20,22,29] and video question answering [12,37], R-VOS needs to construct object-level multimodal semantic consensus in a dense visual representation.…”

Section: Related Workmentioning

confidence: 99%

“…Visual encoder. Following previous methods [1,41,40], we build the visual encoder with a visual backbone and a deformable transformer encoder [46] on top of it. The extracted features from the backbone are flattened, projected to a lower dimension, added with positional encoding [9], and then fed into a deformable transformer encoder [46] similar to the previous method [41].…”

Section: Single-modal Feature Extractionmentioning

confidence: 99%

“…Previous works [1,41] tackle the R-VOS problem with a strong assumption that the referred object exists in the video, i.e., there is an object-level semantic consensus between the expression and the video. However, this assumption does not always hold in practice.…”

Section: Introductionmentioning

confidence: 99%

“…Even when semantic consensus exists in the given video-language pairs, it is still challenging to locate the correct object in the video due to the multimodal nature of the R-VOS task. Recently, MTTR [1] employs a multimodal transformer encoder to learn a joint representation of the linguistic expression and video, and then obtains the referred object by ranking all objects in the video. ReferFormer [41] follows the image-level method, ReTR [14], to adopt the linguistic expression as a query to the transformer decoder to avoid redundant ranking of all objects.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

Li¹,

Wang²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Referring video object segmentation (R-VOS) aims to segment the object masks in a video given a referring linguistic expression to the object. It is a recently introduced task attracting growing research attention. However, all existing works make a strong assumption: The object depicted by the expression must exist in the video, namely, the expression and video must have an object-level semantic consensus. This is often violated in real-world applications where an expression can be queried to false videos, and existing methods always fail in such false queries due to abusing the assumption. In this work, we emphasize that studying semantic consensus is necessary to improve the robustness of R-VOS. Accordingly, we pose an extended task from R-VOS without the semantic consensus assumption, named Robust R-VOS (R 2 -VOS). The R 2 -VOS task is essentially related to the joint modeling of the primary R-VOS task and its dual problem (text reconstruction). We embrace the observation that the embedding spaces have relational consistency through the cycle of text-video-text transformation, which connects the primary and dual problems. We leverage the cycle consistency to discriminate the semantic consensus, thus advancing the primary task. Parallel optimization of the primary and dual problems are enabled by introducing an early grounding medium. A new evaluation dataset, R 2 -Youtube-VOS, is collected to measure the robustness of R-VOS models against unpaired videos and expressions. Extensive experiments demonstrate that our method not only identifies negative pairs of unrelated expressions and videos, but also improves the segmentation accuracy for positive pairs with a superior disambiguating ability. Our model achieves the state-of-the-art performance on Ref-DAVIS17, Ref-Youtube-VOS, and the novel R 2 -Youtube-VOS dataset.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Single-modal Feature Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

Li¹,

Wang²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

CRFormer: A Cross-Region Transformer for Shadow Removal

Jin¹,

Yin²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Aiming to restore the original intensity of shadow regions in an image and make them compatible with the remaining non-shadow regions without a trace, shadow removal is a very challenging problem that benefits many downstream image/video-related tasks. Recently, transformers have shown their strong capability in various applications by capturing global pixel interactions and this capability is highly desirable in shadow removal. However, applying transformers to promote shadow removal is nontrivial for the following two reasons: 1) The patchify operation is not suitable for shadow removal due to irregular shadow shapes; 2) shadow removal only needs one-way interaction from the non-shadow region to the shadow region instead of the common two-way interactions among all pixels in the image. In this paper, we propose a novel cross-region transformer, namely CRFormer, for shadow removal which differs from existing transformers by only considering the pixel interactions from the non-shadow region to the shadow region without splitting images into patches. This is achieved by a carefully designed regionaware cross-attention operation that can aggregate the recovered shadow region features conditioned on the nonshadow region features. Extensive experiments on ISTD, AISTD, SRD, and Video Shadow Removal datasets demonstrate the superiority of our method compared to other state-of-the-art methods.

show abstract

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Cited by 2 publications

References 30 publications

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

CRFormer: A Cross-Region Transformer for Shadow Removal

Contact Info

Product

Resources

About