“…In addition to the presented taxonomy, studies on end-to-end navigation also focus on input representation aspects and model design. This includes considerations in the number of cameras (e.g., single or multi-camera setups) [15][16][17], methods for 3D data representation (e.g., point cloud or Bird's Eye View images) [15,16,18,20], sensor fusion and multimodality (e.g., different sensors and feature fusion methods) [19][20][21], interaction with traffic agents (e.g., interaction graphs or grid maps) [15,20], deep learning technologies (e.g., transformers, graph neural networks, deep reinforcement learning, attention mechanisms, generative models, etc.) [15,16,20,21], decision-making within the network (e.g., high-level commands input or inference) [17,18], and the accuracy or feasibility of the output (e.g., using standard controllers to estimate final outputs or filtering the output of the deep learning model) [15,18,21].…”