Does End-to-End Autonomous Driving Really Need Perception Tasks?

Li, Peidong; Cui, Dixiao

Abstract:End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module that employs a Bird's-Eye View (BEV) world model, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves state-of-the-art planning performance on the nuScenes dataset, demonstrating a 27.2\% relative reduction in L2 error and a 51.6\% decrease in collision rate to the leading E2EAD method, UniAD. Moreover, SSR offers a 10.9$\times$ faster inference speed and 13$\times$ faster training time. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code will be released at \url{this https URL}.

Comments:	Technical Report
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2409.18341 [cs.CV]
	(or arXiv:2409.18341v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.18341

Computer Science > Computer Vision and Pattern Recognition

Title:Does End-to-End Autonomous Driving Really Need Perception Tasks?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators