2021
DOI: 10.48550/arxiv.2107.08031
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is attention to bounding boxes all you need for pedestrian action prediction?

Abstract: The human driver is no longer the only one concerned with the complexity of the driving scenarios. Autonomous vehicles (AV) are similarly becoming involved in the process. Nowadays, the development of AV in urban places underpins essential safety concerns for vulnerable road users (VRUs) such as pedestrians. Therefore, to make the roads safer, it is critical to classify and predict their future behavior. In this paper, we present a framework based on multiple variations of the Transformer models to reason atte… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 27 publications
0
3
0
Order By: Relevance
“…More successful approaches were designed to take into account temporal coherence in short-term motions of visual features of the pedestrians by using ConvLSTMs [10], [11], 3D Convolutions [12], [13], [14], or Spatio-Temporal DenseNet [15]. Approaches trying to minimize the inference time of their models by avoiding the usage of RGB images were explored: [16] proposes a transformer using only spatial positioning of the pedestrian based on 2D bounding box locations. Crossing prediction based on kinematics only was also explored with various available learning architectures to monitor temporal evolution of skeletal joints such as convolutions [17], [18], [19], recurrent cells [20], [21] or graph-based models [22].…”
Section: A Pedestrian Crossing Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…More successful approaches were designed to take into account temporal coherence in short-term motions of visual features of the pedestrians by using ConvLSTMs [10], [11], 3D Convolutions [12], [13], [14], or Spatio-Temporal DenseNet [15]. Approaches trying to minimize the inference time of their models by avoiding the usage of RGB images were explored: [16] proposes a transformer using only spatial positioning of the pedestrian based on 2D bounding box locations. Crossing prediction based on kinematics only was also explored with various available learning architectures to monitor temporal evolution of skeletal joints such as convolutions [17], [18], [19], recurrent cells [20], [21] or graph-based models [22].…”
Section: A Pedestrian Crossing Predictionmentioning
confidence: 99%
“…In its first year of existence, proposed approaches evaluated on the benchmarks [1] constantly report higher classification scores [19], [16], [26], [29], [30], [31], giving the impression of clear improvements in pedestrian intention prediction. Usually, a new algorithm is proposed and the implicit hypothesis towards the proposed contribution is made such that it yields an improved performance over the existing state-ofthe-art.…”
Section: B Cross-dataset Evaluationmentioning
confidence: 99%
“…Saleh et al employed a spatio-temporal densenet for classification based on sequences of pedestrian bounding boxes (14). Achaji et al incorporated a transformer for classifying pedestrian intentions based on bounding box features (15).…”
mentioning
confidence: 99%