Explaining Autonomous Driving by Learning End-to-End Visual Attention

Cultrera, Luca; Seidenari, Lorenzo; Becattini, Federico; Pala, Pietro; Bimbo, Alberto Del

doi:10.1109/cvprw50498.2020.00178

Cited by 48 publications

(30 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, an attention mechanism can be useful in paying more attention to important vehicles and critical parts of the map in the decision-making problem. Attention mechanism can be used with ConvNets to improve explainability and interpretability of end-to-end deep neural networks [10], [11]. A multi-task attentionaware network with a ConvNet backbone was proposed by Ishihara et al [12] to learn a driving policy via conditional imitation learning.…”

Section: Related Workmentioning

confidence: 99%

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Kargar,

Kyrki

2021

Preprint

View full text Add to dashboard Cite

Driving in a complex urban environment is a difficult task that requires a complex decision policy. In order to make informed decisions, one needs to gain an understanding of the long-range context and the importance of other vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images. The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's attention mechanism helps to learn an attention map for the scene which allows the ego car to determine which surrounding cars are important to its next decision. We demonstrate that a DQN agent with a ViT backbone outperforms baseline algorithms with ConvNet backbones pre-trained in various ways. In particular, the proposed method helps reinforcement learning algorithms to learn faster, with increased performance and less data than baselines.

show abstract

Section: Related Workmentioning

confidence: 99%

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Kargar,

Kyrki

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Kim et al [22] adopted an attention-based method to filter out non-salient image regions to display only regions that causally affect the steering control of a stand-alone vehicle. Similarly, [23] also used an attention model to visualize the perception of deep networks for autonomous driving. Saliency has also been employed to explain AI models for navigation [24], lane change detection [25], and driving behavior reasoning (e.g.…”

Section: Xai For Autonomous Drivingmentioning

confidence: 99%

“…XAI has been receiving growing attention in autonomous driving. Attempts have been made to explain the functions of various AI models for autonomous driving [21,22,23,24,25,26]. Yet, studies on XAI for AI-powered accident anticipation do not catch the accelerating pace of accident anticipation research.…”

Section: Introductionmentioning

confidence: 99%

Towards explainable artificial intelligence (XAI) for early anticipation of traffic accidents

Karim

Qin

2021

Preprint

View full text Add to dashboard Cite

Traffic accident anticipation is a vital function of Automated Driving Systems (ADSs) for providing a safety-guaranteed driving experience. An accident anticipation model aims to predict accidents promptly and accurately before they occur. Existing Artificial Intelligence (AI) models of accident anticipation lack a human-interpretable explanation of their decision-making. Although these models perform well, they remain a black-box to the ADS users, thus difficult to get their trust. To this end, this paper presents a Gated Recurrent Unit (GRU) network that learns spatio-temporal relational features for the early anticipation of traffic accidents from dashcam video data. A post-hoc attention mechanism named Grad-CAM is integrated into the network to generate saliency maps as the visual explanation of the accident anticipation decision. An eye tracker captures human eye fixation points for generating human attention maps. The explainability of network-generated saliency maps is evaluated in comparison to human attention maps. Qualitative and quantitative results on a public crash dataset confirm that the proposed explainable network can anticipate an accident on average 4.57 seconds before it occurs, with 94.02% average precision. In further, various post-hoc attention-based XAI methods are evaluated and compared. It confirms that the Grad-CAM chosen by this study can generate high-quality, human-interpretable saliency maps (with 1.42 Normalized Scanpath Saliency) for explaining the crash anticipation decision. Importantly, results confirm that the proposed AI model, with a human-inspired design, can outperform humans in the accident anticipation.

show abstract

“…In a study [7] conducted by using an open-source driving simulator CARLA [10], it was reported that the driving performance of the imitation learning agent considerably drops under those conditions such as untrained urban area, weather conditions, and traffic congestion. Secondly, it is important to know how well a network perceives visual inputs for such a safety-critical application of autonomous driving, but only a few studies addressed this issue [8,21,24].…”

Section: Introductionmentioning

confidence: 99%

Multi-task Learning with Attention for End-to-end Autonomous Driving

Ishihara¹,

Kanervisto²,

Miura³

et al. 2021

Preprint

View full text Add to dashboard Cite

Autonomous driving systems need to handle complex scenarios such as lane following, avoiding collisions, taking turns, and responding to traffic signals. In recent years, approaches based on end-to-end behavioral cloning have demonstrated remarkable performance in point-topoint navigational scenarios, using a realistic simulator and standard benchmarks. Offline imitation learning is readily available, as it does not require expensive hand annotation or interaction with the target environment, but it is difficult to obtain a reliable system. In addition, existing methods have not specifically addressed the learning of reaction for traffic lights, which are a rare occurrence in the training datasets. Inspired by the previous work on multitask learning and attention modeling, we propose a novel multi-task attention-aware network in the conditional imitation learning (CIL) framework. This does not only improve the success rate of standard benchmarks, but also the ability to react to traffic lights, which we show with standard benchmarks.

show abstract

Explaining Autonomous Driving by Learning End-to-End Visual Attention

Cited by 48 publications

References 37 publications

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Towards explainable artificial intelligence (XAI) for early anticipation of traffic accidents

Multi-task Learning with Attention for End-to-end Autonomous Driving

Contact Info

Product

Resources

About