Pei, Dashuai and Wu, Yiwen and He, Jianhua and Liu, Kezhong and Chen, Mozi and Xiao, Xuedou and Zhang, Shengkai and Zheng, Jiawei (2025) Methodology and Benchmark for Automated Driving Theory Test of Large Language Models. IEEE Transactions on Intelligent Transportation Systems. (In Press)
Pei, Dashuai and Wu, Yiwen and He, Jianhua and Liu, Kezhong and Chen, Mozi and Xiao, Xuedou and Zhang, Shengkai and Zheng, Jiawei (2025) Methodology and Benchmark for Automated Driving Theory Test of Large Language Models. IEEE Transactions on Intelligent Transportation Systems. (In Press)
Pei, Dashuai and Wu, Yiwen and He, Jianhua and Liu, Kezhong and Chen, Mozi and Xiao, Xuedou and Zhang, Shengkai and Zheng, Jiawei (2025) Methodology and Benchmark for Automated Driving Theory Test of Large Language Models. IEEE Transactions on Intelligent Transportation Systems. (In Press)
Abstract
Large Language Models (LLMs), with their strong generalization and inference capabilities, have been increasingly leveraged to address the challenges of handling corner cases in autonomous driving (AD). However, a critical unresolved issue remains: the lack of a comprehensive understanding and formal assessment of LLMs’ driving theory knowledge and practical skills. To address this issue, we propose the first dedicated driving theory test framework and benchmark for LLMs. That is a cru- cial yet unexplored area in the literature, particularly for safety- critical applications in autonomous driving and driver assistance. Our framework systematically evaluates LLMs’ competence in driving theory and hazard perception, akin to the official UK driving theory test, ensuring their qualification for critical driving-related tasks. To facilitate rigorous benchmarking, we construct a comprehensive dataset comprising over 700 multiple- choice questions (MCQs) and 54 hazard perception video tests sourced from the official UK driving theory examination. Ad- ditionally, we incorporate two standardized MCQ sets from the UK’s Driver and Vehicle Standards Agency (DVSA). For these two types of theoretical test items, we design tailored assessment methodologies and evaluation metrics, including accuracy, recall, precision, F1-score, real-time performance, and computational efficiency. The experimental results reveal that among all LLMs tested, only GPT-4o achieved an accuracy of 88. 21% in the MCQs test, successfully passing this component. However, in hazard perception testing, none of the evaluated models met the passing criteria under the given settings, highlighting the substantial improvements required before these models can be practically deployed for real-world driving applications. Our key insight is that the specific test questions LLMs fail to answer correctly directly reflect their deficiencies in understanding and flexibly applying traffic regulations, as well as in analyzing and responding to complex driving scenarios. This provides clear directions for future improvements.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Autonomous driving, large language model, driving theory test, hazard perception test, remote driving, mobile computing |
Subjects: | Z Bibliography. Library Science. Information Resources > ZR Rights Retention |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 23 May 2025 14:08 |
Last Modified: | 23 May 2025 14:09 |
URI: | http://repository.essex.ac.uk/id/eprint/40958 |
Available files
Filename: ITITS2024-LLM driving test-FINAL VERSION.pdf
Licence: Creative Commons: Attribution 4.0