RGB-T image analysis technology and application: A survey

Song, Kechen; Zhao, Ying; Huang, Liming; Yan, Yunhui; Meng, Qingcai

doi:10.1016/j.engappai.2023.105919

Cited by 26 publications

(2 citation statements)

References 341 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to using different sensors such as visible cameras and thermal cameras, special components like beam splitters for spatial alignment and synchronization timers for temporal alignment are required during data acquisition [1] . In recent years, researchers have proposed RGB-T object detection datasets that employ specially designed hardware and preprocessing methods to achieve pixel-level alignment and provide annotations shared between modalities [2][3] [4] . Most existing RGB-T image object detectors are built upon this inter-modal alignment [5] [6] .…”

Section: Introductionmentioning

confidence: 99%

Attention guided cross-modal multispectral object detection

Xing,

Tao,

2023

Third International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2023)

View full text Add to dashboard Cite

The spatial offset between the visible and thermal images can pose challenges for accurately matching objects in both modalities, thus affecting the performance of object detection algorithms. To address this, we propose a feature alignment-based algorithm for unaligned RGB-T image object detection. Our approach admits the coarse-to-fine feature alignment strategy, incorporating an attention-guided feature offset prediction module. Additionally, a multi-headed selfattention mechanism is introduced to predict and correct the feature map offsets for visible images during the feature extraction stage. To further correct the offset between RGB-T features, a region of interest alignment module performs quadratic regression for each candidate frame in the pooling stage. Furthermore, our algorithm introduces a light-aware weighting module to adaptively adjust the contributions of different modalities by reweighting the region of interest features. Experimental results on the FLIRADAS dataset demonstrate that the proposed method achieves high accuracy and stability.

show abstract

Section: Introductionmentioning

confidence: 99%

Attention guided cross-modal multispectral object detection

Xing,

Tao,

2023

Third International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2023)

View full text Add to dashboard Cite

show abstract

“…Fully supervised semantic segmentation works [ 13 , 14 , 15 , 16 , 17 ] based on deep learning have achieved satisfactory results, even if they only use single-modal image as their model’s input. Despite the structure of deep networks being effective, it still has some limitations that the model needs a large number of annotated examples.…”

Section: Introductionmentioning

confidence: 99%

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Song,

Zhang,

Bao

et al. 2023

Sensors

Self Cite

View full text Add to dashboard Cite

As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5i for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

show abstract