Traffic Accident Detection Using Background Subtraction and CNN Encoder–Transformer Decoder in Video Frames

Zhang, Yihang; Sung, Yunsick

doi:10.3390/math11132884

Cited by 5 publications

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this context, the probability of death occurring on the road is higher in secondary traffic accidents than in primary ones, making it crucial to quickly identify accidents on the road to prevent subsequent secondary accidents. Consequently, in the field of artificial intelligence, technologies are being actively developed to quickly detect traffic accidents or accurately classify types of accidents [2][3][4][5][6].…”

Section: Introductionmentioning

confidence: 99%

Cross-Modality Interaction-Based Traffic Accident Classification

Oh,

Ban

2024

Applied Sciences

View full text Add to dashboard Cite

Traffic accidents on the road lead to serious personal and material damage. Furthermore, preventing secondary accidents caused by traffic accidents is crucial. As various technologies for detecting traffic accidents in videos using deep learning are being researched, this paper proposes a method to classify accident videos based on a video highlight detection network. To utilize video highlight detection for traffic accident classification, we generate information using the existing traffic accident videos. Moreover, we introduce the Car Crash Highlights Dataset (CCHD). This dataset contains a variety of weather conditions, such as snow, rain, and clear skies, as well as multiple types of traffic accidents. We compare and analyze the performance of various video highlight detection networks in traffic accident detection, thereby presenting an efficient video feature extraction method according to the accident and the optimal video highlight detection network. For the first time, we have applied video highlight detection networks to the task of traffic accident classification. In the task, the most superior video highlight detection network achieves a classification performance of up to 79.26% when using video, audio, and text as inputs, compared to using video and text alone. Moreover, we elaborated the analysis of our approach in the aspects of cross-modality interaction, self-attention and cross-attention, feature extraction, and negative loss.

show abstract

Section: Introductionmentioning

confidence: 99%

Cross-Modality Interaction-Based Traffic Accident Classification

Oh,

Ban

2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…In this paper, we propose a new arbitrary timestep video frame interpolation (ATVFI) neural network model with interpolation time-decoding. Generally, our method is built on an encoder-decoder framework [13]. The decoder part of our model takes the interpolation timestep t as an extra input, indicating the relative time coordinate of the desired output frame with regard to input frames.…”

Section: Introductionmentioning

confidence: 99%

Arbitrary Timestep Video Frame Interpolation with Time-Dependent Decoding

Zhang,

Ren,

Yan

et al. 2024

Mathematics

View full text Add to dashboard Cite

Given an observed low frame rate video, video frame interpolation (VFI) aims to generate a high frame rate video, which has smooth video frames with higher frames per second (FPS). Most existing VFI methods often focus on generating one frame at a specific timestep, e.g., 0.5, between every two frames, thus lacking the flexibility to increase the video’s FPS by an arbitrary scale, e.g., 3. To better address this issue, in this paper, we propose an arbitrary timestep video frame interpolation (ATVFI) network with time-dependent decoding. Generally, the proposed ATVFI is an encoder–decoder architecture, where the interpolation timestep is an extra input added to the decoder network; this enables ATVFI to interpolate frames at arbitrary timesteps between input frames and to increase the video’s FPS at any given scale. Moreover, we propose a data augmentation method, i.e., multi-width window sampling, where video frames can be split into training samples with multiple window widths, to better leverage training frames for arbitrary timestep interpolation. Extensive experiments were conducted to demonstrate the superiority of our model over existing baseline models on several testing datasets. Specifically, our model trained on the GoPro training set achieved 32.50 on the PSNR metric on the commonly used Vimeo90k testing set.

show abstract