MeT: A graph transformer for semantic segmentation of 3D meshes

Vecchio, Giuseppe; Prezzavento, Luca; Pino, Carmelo; Rundo, Francesco; Palazzo, Simone; Spampinato, Concetto

doi:10.1016/j.cviu.2023.103773

Cited by 4 publications

(2 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformers, originally stemming from NLP tasks [8], utilize a self-attention (SA) mechanism to capture dependencies among elements. Due to the strong capability of sequence modeling, hybrid transformers have been successfully applied to many vision tasks [9,10,13], including image classification, object detection, and semantic segmentation. However, few works [7,14] have applied action segmentation, as it is limited by its huge computational costs.…”

Section: Study On Efficient Transformersmentioning

confidence: 99%

“…Transformers, originally stemming from natural language processing (NLP) tasks [8], have obtained various state-of-the-art performances for many vision tasks, including image classification [9], object detection [10][11][12], and semantic segmentation [13]. ASFormer [7] is the first transformer architecture for action segmentation and explicitly introduced local connectivity inductive and hierarchical representation to rebuild the transformer, obtaining impressive improvement, and whose self-attention mechanism plays a big role in hugely improving performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

Ma,

2023

Mathematics

View full text Add to dashboard Cite

Transformer-based models for action segmentation have achieved high frame-wise accuracy against challenging benchmarks. However, they rely on multiple decoders and self-attention blocks for informative representations, whose huge computing and memory costs remain an obstacle to handling long video sequences and practical deployment. To address these issues, we design a light transformer model for the action segmentation task, named LASFormer, with a novel encoder–decoder structure based on three key designs. First, we propose a receptive field-guided distillation to realize mode reduction, which can overcome more generally the gap in semantic feature structure between the intermediate features by aggregated temporal dilation convolution (ATDC). Second, we propose a simplified implicit attention to replace self-attention to avoid its quadratic complexity. Third, we design an efficient action relation encoding module embedded after the decoder, where the temporal graph reasoning introduces an inductive bias that adjacent frames are more likely to belong to the same class of model global temporal relations, and the cross-model fusion structure integrates frame-level and segment-level temporal clues, which can avoid over-segmentation independent of multiple decoders, thus reducing further computational complexity. Extensive experiments have verified the effectiveness and efficiency of the framework. Against the challenging 50Salads, GTEA, and Breakfast benchmarks, LASFormer significantly outperforms the current state-of-the-art methods in accuracy, edit score, and F1 score.

show abstract

Section: Study On Efficient Transformersmentioning

confidence: 99%