Geometry-aware Relational Exemplar Attention for Dense Captioning

Wang, Tzu-Jui Julius; Tavakoli, Hamed R.; Sjöberg, Mats; Laaksonen, Jorma

doi:10.1145/3347450.3357656

Cited by 2 publications

(1 citation statement)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At present, with the development of computer vision and natural language processing technology, as well as the continually improving computer technology, it is possible to obtain useful information about multiple objects from images automatically. In particular, dense captioning, a technology based on computer vision, is gaining traction in this field [3][4][5][6][7][8]. Dense captioning is a subset of image captioning technology [9] that understands the characteristics of objects, their activities, and relationships and expresses them in natural language.…”

Section: Introductionmentioning

confidence: 99%

Automatic Defect Description of Railway Track Line Image Based on Dense Captioning

Wei

Jia

2022

Sensors

View full text Add to dashboard Cite

The state monitoring of the railway track line is one of the important tasks to ensure the safety of the railway transportation system. While the defect recognition result, that is, the inspection report, is the main basis for the maintenance decision. Most previous attempts have proposed intelligent detection methods to achieve rapid and accurate inspection of the safety state of the railway track line. However, there are few investigations on the automatic generation of inspection reports. Fortunately, inspired by the recent advances and successes in dense captioning, such technologies can be investigated and used to generate textual information on the type, position, status, and interrelationship of the key components from the field images. To this end, based on the work of DenseCap, a railway track line image captioning model (RTLCap for short) is proposed, which replaces VGG16 with ResNet-50-FPN as the backbone of the model to extract more powerful image features. In addition, towards the problems of object occlusion and category imbalance in the field images, Soft-NMS and Focal Loss are applied in RTLCap to promote defect description performance. After that, to improve the image processing speed of RTLCap and reduce the complexity of the model, a reconstructed RTLCap model named Faster RTLCap is presented with the help of YOLOv3. In the encoder part, a multi-level regional feature localization, mapping, and fusion module (MFLMF) are proposed to extract regional features, and an SPP (Spatial Pyramid Pooling) layer is employed after MFLMF to reduce model parameters. As for the decoder part, a stacked LSTM is adopted as the language model for better language representation learning. Both quantitative and qualitative experimental results demonstrate the effectiveness of the proposed methods.

show abstract