Progression Cognition Reinforcement Learning With Prioritized Experience for Multi-Vehicle Pursuit

Multi-vehicle pursuit (MVP) such as autonomous police vehicles pursuing suspects is important but very challenging due to its mission and safety-critical nature. While multi-agent reinforcement learning (MARL) algorithms have been proposed for MVP in structured grid-pattern roads, the existing algorithms use random training samples in centralized learning, which leads to homogeneous agents showing low collaboration performance. For the more challenging problem of pursuing multiple evaders, these algorithms typically select a fixed target evader for pursuers without considering dynamic traffic situation, which significantly reduces pursuing success rate. To address the above problems, this paper proposes a Progression Cognition Reinforcement Learning with Prioritized Experience for MVP (PEPCRL-MVP) in urban multi-intersection dynamic traffic scenes. PEPCRL-MVP uses a prioritization network to assess the transitions in the global experience replay buffer according to each MARL agent’s parameters. With the personalized and prioritized experience set selected via the prioritization network, diversity is introduced to the MARL learning process, which can improve collaboration and task-related performance. Furthermore, PEPCRL-MVP employs an attention module to extract critical features from dynamic urban traffic environments. These features are used to develop a progression cognition method to adaptively group pursuing vehicles. Each group efficiently targets one evading vehicle. Extensive experiments conducted with a simulator over unstructured roads of an urban area show that PEPCRL-MVP is superior to other state-of-the-art methods. Specifically, PEPCRL-MVP improves pursuing efficiency by 3.95% over Twin Delayed Deep Deterministic policy gradient-Decentralized Multi-Agent Pursuit and its success rate is 34.78% higher than that of Multi-Agent Deep Deterministic Policy Gradient. Codes are open-sourced.


Progression Cognition Reinforcement Learning with
Prioritized Experience for Multi-Vehicle Pursuit Xinhang Li, Graduate Student Member, IEEE, Yiying Yang, Zheng Yuan, Zhe Wang, Graduate Student Member, IEEE, Qinwen Wang, Chen Xu, Member, IEEE, Lei Li, Jianhua He, Senior Member, IEEE, and Lin Zhang * , Member, IEEE Abstract-Multi-vehicle pursuit (MVP) such as autonomous police vehicles pursuing suspects is important but very challenging due to its mission and safety-critical nature.While multiagent reinforcement learning (MARL) algorithms have been proposed for MVP in structured grid-pattern roads, the existing algorithms use random training samples in centralized learning, which leads to homogeneous agents showing low collaboration performance.For the more challenging problem of pursuing multiple evaders, these algorithms typically select a fixed target evader for pursuers without considering dynamic traffic situation, which significantly reduces pursuing success rate.To address the above problems, this paper proposes a Progression Cognition Reinforcement Learning with Prioritized Experience for MVP (PEPCRL-MVP) in urban multi-intersection dynamic traffic scenes.PEPCRL-MVP uses a prioritization network to assess the transitions in the global experience replay buffer according to each MARL agent's parameters.With the personalized and prioritized experience set selected via the prioritization network, diversity is introduced to the MARL learning process, which can improve collaboration and task-related performance.Furthermore, PEPCRL-MVP employs an attention module to extract critical features from dynamic urban traffic environments.These features are used to develop a progression cognition method to adaptively group pursuing vehicles.Each group efficiently targets one evading vehicle.Extensive experiments conducted with a simulator over unstructured roads of an urban area show that PEPCRL-MVP is superior to other state-of-the-art methods.Specifically, PEPCRL-MVP improves pursuing efficiency by 3.95% over Twin Delayed Deep Deterministic policy gradient-Decentralized Multi-Agent Pursuit and its success rate is 34.78% higher than that of Multi-Agent Deep Deterministic Policy Gradient.Codes are open-sourced.
Index Terms-autonomous driving, multi-agent reinforcement learning, multi-vehicle pursuit, prioritized experience

I. INTRODUCTION
E MPOWERED by the self-learning ability of reinforce- ment learning (RL) and significantly improved environment perception, autonomous driving (AD) [1]- [3] is growing with fast pace and great potentials to improve driving safety and traffic efficiency [4]- [6].Multi-vehicle pursuit (MVP) is a specific application of AD technology, where multiple autonomous pursuing vehicles chase one or more moving vehicles.MVP problems have been attracting extensive research attention due to their increasing applications, including collision avoidance designs in intelligent transportation systems, sport/game strategies, the balance and game between generators and loads in smart grid dispatch, disaster relief strategies, autonomous police vehicles pursuing suspects, and similar confrontation scenarios [7], [8].The MVP tasks are usually mission and safety-critical.Efficient multi-vehicle collaboration and comprehensive perception under complex and dynamic traffic environments are important to successfully complete the MVP tasks [9].
Cooperative multi-agent reinforcement learning (MARL) has been widely studied for multiple agents collaboration and connected-automated vehicles (CAVs), and could be applied to MVP applications [10]- [12].Many MARL-based cooperative control schemes for CAVs have been proposed.Guan et al. presented a centralized coordination framework [13] for autonomous vehicles at intersections without traffic signals, which significantly improved the road efficiency.[14] and [15] studied distributed cooperation methods to realize the conflictfree control of CAVs.[16] and [17] implemented a multiagent systems-based hierarchical controller to improve vertical and horizontal cooperation among the automated vehicles.It is noted that all the above MARL algorithms for CAVs were designed to improve driving safety.However, the MVP tasks have additional mission-critical requirements and require strong collaboration and adaptation to dynamic environments, which present significant new challenges to the design of MARL algorithms.
In the literature, a few game theory-based and other classical methods for MVP have been proposed [18].Huang et al. presented a decentralized control scheme [19] based on the Voronoi partition of the game domain.Pan et al. designed a region-based relay pursuit scheme [7] for the pursuers to capture a single evader.A policy iteration method-based continuous-time Markov decision process [20] was proposed to optimize the pursuer strategy.[21] employed a graphtheoretic approach to study the interactions of the agents and obtain distributed control policies for pursuers.[9] and [22] introduced curriculum RL to train pursuers to approach 0000-0000 © 2023 IEEE the evader.In order to improve the pursuing efficiency, [23] weighted different evaders to encourage the pursuers to capture the close evader.In addition, [24] designed cooperative multiagent schemes with a target prediction network.Yang et al. design a graded-Q RL framework [25] to enhance the coordination capacity of pursuing vehicles.[26] and [27] adopted MARL to accomplish collaborative pursuit tasks in simplified traffic scenes with structured grid-pattern roads.However, these above methods did not consider the dynamic urban pursuit-evasion environment and the fixed allocation of pursuit tasks greatly affects the efficiency of the pursuit.
In the existing MARL algorithms for MVP and AD, deep neural network parameters are shared among agents via centralized training with decentralized execution (CTDE), which significantly improves learning efficiency and experience utilization [28]- [30].Many CTDE-based deep MARL methods achieve state-of-the-art performance on some tasks, such as group matching game and path finding [31]- [34].Although CTDE can accelerate training [35], it has poor performance in complex and difficult tasks, such as Google Research Football [36] and MVP.These complex tasks typically require substantial exploration, diversified strategies, and efficient collaboration among agents [37].But homogeneous agents tend to behave similarly because of parameter sharing, limiting efficient exploration and collaboration of MARL agents.Besides, prioritized experience replay has been the focus of several studies.Previous studies adopted the temporal-difference error as priorities of experience [38]- [41] without considering their variability among multiple agents.And they did not address the problem of multi-agent homogenization.
According to the above analyses, it can be observed that low adaptation to dynamic traffic environments and homogeneous agents severely limit the collaborative pursuing performance.To address these problems, this paper proposes a progression cognition reinforcement learning with prioritized experience (PEPCRL-MVP) for MVP in urban traffic scenes.A framework of the PEPCRL-MVP is shown in Fig. 1.There are two distinct modules in the new PEPCRL-MVP architecture.The first is a proposed prioritization network, which is used to select prioritized training set for each agent in MARL to adjust its deep neural network parameters.Optimizing agents with the personalized training set enables each agent to distinguish itself from others, thereby encouraging efficient collaboration.The proposed prioritization network can also be applied to a wide range of multi-agent systems to improve collaboration.In addition, an attention-based progression cognition module is designed to adaptively group multiple pursuing vehicles considering dynamic traffic awareness.With the above designs, the PEPCRL-MVP can address the problems of low adaptation and homogeneous agents in the existing MVP approaches and is expected to greatly improve pursuing performance.
The contributions of this paper can be summarized as follows.
approach with prioritized experience for collaborative multi-vehicle pursuit.A novel prioritization network is proposed to diversify the optimization and strategies of MARL, encouraging more efficient collaboration and experience exploration.Section II describes multi-vehicle pursuit in an urban pursuit-evasion scene and models MVP problem based on partially-observable stochastic game (POSG).Section III presents MARL with prioritized experience and its training process.Section IV presents the reinforcement learning-based path planning algorithm with progression cognition.Section V gives the performance of the proposed method.Section VI draws conclusions.

II. MULTI-VEHICLE PURSUIT IN DYNAMIC URBAN TRAFFIC
This section firstly illustrates the urban complex pursuitevasion environment and the constraints, bridging the 'sim-toreal' gap.Section II-B formulates the MVP problem based on POSG and introduces MARL-based solution to POSGs.

A. MVP in Large-Scale Urban Traffic
This paper focuses on the problem of multi-vehicle pursuit under the complex urban traffic.We consider a closed largescale urban traffic scene with multi-intersection road structure [42].The considered scene basically retains the settings of urban traffic, such as traffic lights and speed limits.Without loss of generality, we assume there are N pursuing vehicles, M evading vehicles (N > M ), B background vehicles, and L lanes.The background vehicles and evading vehicles follow the randomly selected routes.For the MVP task, an evading vehicle is deemed captured if any pursuing vehicle is less than a pre-configured distance d min from its target evading vehicles.If all evading vehicles are captured with a given st time steps, the pursuit task is successfully Done.
In this paper, we have a few constraints for the MVP task.
• All vehicles in the scene (pursuing, evading, and background vehicles) follow traffic rules, such as obeying traffic lights, and driving on the right lanes without collisions.
• All pursuing vehicles and evading vehicles are initialized at different diagonal points in the map with 0 m/s.• The maximum speed v max , maximum acceleration ac max and maximum deceleration de max of all pursuing vehicles and evading vehicles are set to be the same.

B. POSG-Based MVP Problem Formulation
In MVP, the decision-making process of a finite set of agents I deployed in pursuing vehicles with partial observability can be formalized as POSG, which can be defined as a tuple MG := (I, S, [A n ], [O n ], T r, [R n ]) for n = 1, ..., N .In time step t, the pursuing vehicle n receives a local observation o n t : S → On that is correlated with the underlying state of the environment s t ∈ S. o n t is further processed to s n t as the state of the pursuing vehicle n, that takes an action a n t ∈ An according to s n t .Consequently, the environment evolves to a new state s t+1 with the transition probability T r = P (s t+1 | s t , a t ) : S × A1 ×... × AN → S and then the agent receives a decentralized reward r n t : S × An → R. [R n ] is the rewards of all multiple agents.The probability distribution of actions at a given state is determined by the stochastic policy π n .The goal of an optimal policy π * n is to generate a distribution that maximizes the discounted sum of future rewards over an infinite time horizon, which can be expressed as in which, γ ∈ [0, 1) is the discount factor, indicating the impact of future earnings on current expectation value.The optimal policy maximizes the state-action value function, i.e., π * n (s n t ) = arg max a Q π n (s n t , a).According to Bellman optimality equation, the optimal state-action value function can then be derived as (2) As an emerging AI algorithm, MARL enables agents in POSGs to make optimal strategies without exact state transition probability T r and reward function R. It provides an excellent solution to MVP with the dynamic and complex environment.The agent n in MARL updates the Q value function according to temporal difference error δ by off-policy learning, in which, where α is the learning rate.For pursuing vehicle n, the function Q π n (s n t , a n t ) calculates the expectation values of turning left, turning right and going straight according to the current state s n t to assist the vehicle to select the optimal route to pursue the evading vehicle.

III. MARL WITH PRIORITIZED EXPERIENCE
To introduce diversity among collaborative agents, we design a prioritized experience boosting MARL equipped with a prioritization network.Subsection III-A describes the overall framework and Subsection III-B presents the prioritization network in detail.Finally, the training process is introduced.

A. Prioritized Experience Boosting MARL Framework
Emerging MARL algorithms, such as Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [28] and QMIX [43], adopt centralized training with randomly sampling experience to improve the utilization of experience and deploy the same trained model to all the agents.Homogeneous learning may lead agents to behave similarly.For example, two pursuing vehicles choose to chase behind one evading vehicle simultaneously, rather than one pursuing vehicle chasing and the other intercepting.As a result, homogeneous learning policy hinders collaboration among agents [44].Moreover, randomly sampling experience also affects the training efficiency.Therefore, this paper proposes a prioritized experience boosting MARL framework, as shown in Fig. 2, to introduce diversity among agents.It employs a prioritization network P N to select personalized training set E n per ∈ G by a central server from the global experience replay buffer G = {E 1 , E2, ..., Emax cap } for each agent.max cap represents the maximum capacity of the global buffer, and E is the replay experience collected by agent n ′ in one epoch, In the prioritized experience boosting MARL framework, all agents upload exploration experience to the central server.Every agent samples prioritized experience for training and updates parameters in the global experience buffer via the prioritization network.The prioritization network is trained to model the relationship between the training set features and the reward change ∆r after updating parameters, thus assisting the agents in selecting appropriate experience for efficient training and improving decision performance.
The prioritized experience boosting MARL essentially optimizes the gradient descent and parameter update process of RL-based agents.It differs from the conventional training way of simply replaying experiences at the same frequency, regardless of their importance.By prioritizing experiences based on their significance, the framework enables agents to train more efficiently with optimal replay transitions and fosters better collaboration among the agents.

B. Prioritization Network and Annealing Priority
The prioritization network is responsible for assessing the importance of experience replay transition Ei for all agents.For a given agent n with parameters θn, the prioritization network determines the priority and the sampling probability P n (i) of experience replay transition Ei.Thus, the prioritization network is designed to estimate the performance gain of the agent n after training with Ei.And, as for RL, the rewards directly reflect the performance of an agent.Therefore, we use the prioritization network to fit the reward change ∆r i n after training the agent n with experience replay transition Ei, where ϑ is the parameters of the prioritization network P N .Meanwhile, in order to improve the stability of the algorithm, we use the average reward over k historical epochs as the base reward to calculate the reward change ∆r i n .We adopt gradient back-propagation to update ϑ.The loss of P N is defined as, where K is the batch size.The prioritization network output ∆r i n is used for stochastic sampling.The prioritization network computes the gain of each replay transition in the global experience replay buffer according to the current parameters of agent n, which is defined as { ∆r i n |i = 1, ..., max cap}.And then the maximumminimum normalization is performed on the sequence, where ζ is a small positive constant that prevents the sampling probability becoming zero.However, reward prioritization sampling may focus on a small subset of the experience that makes the agent prone to over-fitting.Therefore, the annealing priority is proposed to calculate sampling probabilities, where the exponent β ∈ (0, 1] determines how much prioritization is introduced.β = 0 corresponds to the uniform sampling.In the early stage of training, uniform sampling is expected to facilitate agents learning and the prioritization network convergence.As training proceeds, the P N can gradually and reliably compute the value of an experience replay transition and guide the agents gradient decreasing.In practice, we linearly anneal β from β 0 to 1 to ensure stable MARL update and continuous performance improvement.

C. Training Process of MARL with Prioritized Experience
Typical reinforcement learning utilizes random experience replay to estimate the distribution of policy and states via Monte Carlo.However, prioritized replay inevitably changes the distribution and introduces bias [38].And the optimal solution that the estimates converge to is influenced by the bias.Therefore, this paper adopts Importance Sampling (IS) to abate the impact of the bias, that fully compensates for the non-uniform probabilities And the weights is normalized by 1/max stability to get ω i n .It is worth noting that the choice of hyperparameters λ interacts with β in annealing priority.Increasing them both simultaneously encourages more aggressive priority sampling.Considering IS, Eq. ( 3) can be reformulated as, Introducing IS into the training process has another benefit.IS can reduce the step size of non-linear function approximation, e.g.deep neural networks.In prioritized experience boosting MARL, experience replay transitions favored by the prioritization network may be revisited many times, and the IS correction reduces the gradient magnitude to ensure that the agents converge to the globally optimal policy.
Distributed training is used in the proposed prioritized experience boosting MARL.The overall training process is shown in Algorithm 1.For every RL-based agent, the prioritization network firstly evaluates the priority of each experience transition according to the agent's parameters.And the annealing priority is obtained to help select personalized training set E n per .Then the agent's parameters are updated with E n per via Eq.(11).After the parameters of all the agents have been updated, the prioritized experience boosting MARL is tested, and the gain of each agent's reward is calculated.Finally, the prioritization network is trained via Eq.(7).With this training process, the prioritization network can accurately compute the value of experience replay transitions for each agent.Moreover, by training with personalized and optimal experience, diverse multiple agents can largely improve collaboration and convergence efficiency.

IV. PROGRESSION COGNITION DQN-BASED COOPERATIVE PATH PLANNING
This section will introduce a progression cognition DQNbased cooperative path planning for pursuing vehicles.Firstly,

A. Attention-Based Progression Cognition Module
In complex urban traffic environments pursuing vehicles need real-time and accurate sensing of driving environments and the status of evading vehicles.We propose attentionbased progression cognition module to extract critical traffic features and assist pursuing vehicles to select suitable evading vehicle as the target.It helps each pursuing vehicle focus on only one evading vehicle and work with other pursuing vehicles in a group to improve pursuit performance.Moreover, the allocation of pursuing tasks with progression cognition enhance collaboration among pursuing vehicles.
The locations of the pursuing and evading vehicles are very important for collaboration and decision making of the pursuing vehicles.In this paper, the location of vehicle i is denoted by loc i t = C l , pos i,l t .C l denotes the binary code of lane l where vehicle i is located, and pos i,l t denotes the distance between vehicle n and the starting of lane l at time t.And the length of loc i t is denoted by len loc .The positions of pursuing and evading vehicles can be represented respectively as LOC P t = {loc 1 t , loc 2 t , ..., loc N t } and LOC E t = {loc 1 t , loc 2 t , ..., loc M t }.Moreover, the adjacency matrix RT is used to represent the topology of roads.Assume that there are L lanes in the pursuit-evasion environment.The size of matrix RT is L × L.An element e i,j in row i and column j of RT indicates whether the vehicles can drive directly from lane i to lane j.
To choose an optimal pursuit route, the information of the number of background vehicles in each lane is also utilized by the progression cognition module.The number of background vehicles in each lane forms a vector of size 1 × L, defined as BV t .We use convolutional neural networks (CNNs) in the module to extract key traffic features from RT and BV t .Specifically, RT is fed into the convolutional layers.Then its output combined with BV t is input to the fully connected layers, and finally the urban traffic feature F is obtained.
As the core of progression cognition module, multi-head attention is used to help pursuing vehicles focus on evading vehicles.It simultaneously takes into account urban traffic features F .All pursuing vehicles share their locations.And F is attached to the vehicle position vectors loc i t , which is a word embedding.Therefore, the query q of multi-head attention is represented as Group attention weights W g can be derived as in which, Here, h is the number of heads and dK is the dimension of K's features.The size of group attention weights W g finally obtained is N × M .The nth row of W g represents the attention weight of pursuing vehicle n on all evading vehicles, where a larger value means more attention.Every pursuing vehicle selects an evading vehicle with the maximum attention weight as its target vehicle.Due to the high dynamic of the urban traffic, the target evading vehicle selected by pursuing vehicle n may vary from time steps.Therefore, the progression cognition divides the pursuing vehicles adaptively to collaborative groups according to the traffic situations.

B. Multi-Vehicle Pursuit Path Planning
Deep Q-Network (DQN) is a popular reinforcement learning algorithm and has been applied for decision-making in various scenarios.In DQN, artificial neural networks (ANNs) are used to approximate Q π n (s n t , a n t ).DQN adopts a dual network framework, consisting of online network and target network, which have the same structure.The two ANNs are parameterized by θ n and θ n ′ .In MVP path planning, DQN is utilized to evaluate the value Q of the pursuing vehicle's each action according to its real-time state s n t , denoted as Q π n (s n t |θ n ).Q π n (s n t |θ n ) is used as the policy for agent n to select the appropriate action.The architecture of multi-vehicle pursuit path planning algorithms is shown in Fig. 3.
In this paper, the action space of each pursuing vehicle includes three actions, turning left, turning right, and going straight.For agent n, the state s n t consists of four parts, including its own position loc n t , the position of the target evading vehicle being focused on loc m t , the traffic critical feature F and its attention weight vector W n g .Here, W n g represents the attention weight of agent n on all evading vehicles obtained via attention-based progression cognition module, which is the nth row of W g .To motivate the capture of pursuing vehicles and incentivize efficient training, a carefully designed reward r n t consists of sparse reward and dense reward.Sparse Reward: Only when a collaborative group successfully captures its target, all members in the group obtain a positive reward V .Dense Reward: Sparsity of reward hinders the exploration of optimal policy by agents.Therefore, a dense reward is set for each pursuing vehicle to address this problem.The dense reward contains the following two components, 1) The pursuing vehicles which do not capture the target vehicle are given a negative reward of c at each time step; 2) A distance-sensitive reward is set to improve the pursuing efficiency.When a pursuing vehicle reduces the distance from its current target compared to that at the last time step, it will obtain a positive reward, and conversely, it will be punished with a negative reward.
Therefore, the formulation of r n t is expressed as where σ is a negative reward factor, and d n,m t denotes the distance of the pursuing vehicle n from its target evading vehicle m at time step t.
Adopting prioritized experience boosting MARL, we can compute the following gradient by differentiating the loss function with respect to the weights, in which, θ n is updated via stochastic gradient descent and Eq. ( 15).And target network performs soft update at each training step,

C. PEPCRL-MVP Decision-Making and Training Process
The decision-making and training process of the proposed PEPCRL-MVP is shown in Algorithm 2. N DQN-based path planning agents and the prioritization network are initialized firstly.At the beginning of each epoch, N agents are trained distributedly with personalized and prioritized experience via Algorithm 1, if the number of transitions in the global experience replay buffer G reaches max cap.Then the pursuitevasion environment is initialized to test PEPCRL-MVP.In each time step of pursuit, attention-based progression cognition processing is invoked to get F and W g .Then each agent uses the partial observation to obtain optimal path planning and the strategy of all agents is performed.At the end of the epoch, experience of all pursuing vehicles is stored to G. Finally, change of rewards is calculated and used to update the prioritization network P N via Eq.( 7).
In the decision-making and training process of PEPCRL-MVP, collaboration among agents is performed in the following three aspects.1) Information Sharing: During the pursuit process, the pursuing vehicles upload their own positions and observation information to the central server for traffic feature extraction and task allocation; 2) Task Allocation: The proposed attention-based progression cognition module dynamically calculates group attention to adaptively group pursuing vehicles; 3) Experience Sharing: To increase the utilization of experience, all agents upload their experience to the global experience buffer.During the training process, every agent selects prioritized experience from the global experience buffer via the prioritization network for training and updates its parameters.

A. The Simulator and Settings
As a MARL algorithm, PEPCRL-MVP collects training data and updates parameters by interacting with the simulated urban traffic environment.To comprehensively evaluate the proposed PEPCRL-MVP, we build three urban traffic road scenes with bidirectional two lanes based on SUMO [42], including 3 × 3 and 4 × 5 grid-pattern urban roads, and real map-based urban roads.The real map-based urban roads simulate those in an area inside the second ring road of Beijing, bridging the 'simto-real' gap.And the real urban road map is obtained from an open-sourced map website. 1 During the simulation process, the number of background vehicles remains constant, and the background vehicles follow randomly selected routes.Moreover, to evaluate the robustness of PEPCRL-MVP, we design three different difficulty levels of MVP tasks with variable numbers of pursuing vehicles N and evading vehicles M , respectively, 6 pursuing vehicles chasing 3 evading vehicles (denoted by P6-E3), 7 pursuing vehicles chasing 4 evading vehicles (denoted by P7-E4) and 8 pursuing vehicles chasing 5 evading vehicles (denoted by P8-E5).All evading vehicles randomly select escape routes.The simulation parameters are shown in

B. Ablation Experiments
We conduct 100 tests on every model and measure the pursuit performance in terms of five metrics, which are average reward (AR), the standard deviation of reward (SDR), average time steps (ATS), the standard deviation of time steps (SDTS), and the pursuing success rate (SR).Ablation experiments are designed to investigate the effect of the proposed prioritized experience selection and progression cognition modules in PEPCRL-MVP.The ablation experiment results are shown in columns 4 to 6 in TABLE IV.The results of A-MVP correspond to a DQN-based path planning with progression cognition without prioritized experience selection, and the results of B-MVP correspond to a DQN-based path planning equipped with prioritized experience selection without attention-based progression cognition.
PEPCRL-MVP shows the best performance in scenes with different difficulty levels under the same urban traffic road structure.In the real map-based urban road, compared with A-MVP, the AR of PEPCRL-MVP increases 52.53%, 47.46%, and 20.96%, respectively, and the ATS of PEPCRL-MVP decreases 2.82%, 3.31%, and 3.24% in P6-E3, P7-E4, and P8-E5 difficulty levels, respectively.Furthermore, the SDR and SDTS of PEPCRL-MVP are comparable to those of A-MVP.These results reveal that prioritized experience selection can effectively promote cooperation among pursuing vehicles and improve pursuing performance.
Given the pursuing difficulty level, PEPCRL-MVP also substantially outperforms the other methods under different urban traffic road scenes.Taking the P6-E3 as an example, the SDTS of PEPCRL-MVP is 7.55%, 2.80% and 1.35% lower than that of B-MVP under 3 × 3, 4 × 5, and real map-based scenes, respectively.The results show the proposed method has excellent robustness.The SDR is 10.72% lower than that of B-MVP on average in the P7-E4 difficulty level.These results can be explained by the fact that attention-based progression cognition can considerably enhance the stability of pursuing vehicles' performance.Moreover, the proposed PEPCRL-MVP has a better generalization and pursuing validity than A-MVP and B-MVP.It is evident that in different scenes, whether the urban road structure or the pursuing difficulty level is different, PEPCRL-MVP has the greatest SR.Concretely, the SR of PEPCRL-MVP is 19.67%, 11.92% higher than that of A-MVP and B-MVP on average, respectively.
In addition, to investigate the impact of the prioritized experience selection on the PEPCRL-MVP's training convergence, we present the average reward with the P6-E3 setting in Fig.The above results obtained from the ablation experiments demonstrate that the prioritized experience selection network can select appropriate training set for each agent, which overcomes the problem of agent differentiation in the existing MVP approaches.It can effectively promote the convergence of agent learning, and enhance cooperation among agents.Furthermore, the progression cognition module can decide appropriate targets for each pursuing vehicle according to the real-time traffic situation and MVP task, consequently improving pursuing efficiency and system stability.
To verify the necessity of inputting group attention to DQN, we conducted an ablation experiment C-MVP, whose the DQN inputs are the ego pursuing vehicle position, the position of its target vehicle, and the urban traffic features without group attention.TABLE IV shows the experiment results.In the three urban traffic road scenes, the ATS of PEMARL-MVP is 4.38 time steps less than that of C-MVP on average.With increasing difficulty, the advantage of PEMARL-MVP in pursuit efficiency becomes more and more apparent compared with C-MVP.For example, in the real map urban traffic road, the ATS of PEMARL-MVP decreases by 0.84%, 0.98%, and 1.2% in P6-E3, P7-E4, and P8-E5 difficulty levels, respectively.Moreover, in all tested scenes, the SR of PEMARL-MVP improves by 7.5% over that of C-MVP.The superiority of PEMARL-MVP over C-MVP illustrates that adding group attention to DQN can assist in the decision-making of pursuit vehicles and improve the pursuit efficiency.

C. Comparison with Other Methods
We compare the PEPCRL-MVP to the other state-ofthe-art RL approaches for MVP, including DQN, DDPG, MADDPG, Twin Delayed Deep Deterministic policy gradient-Decentralized Multi-Agent Pursuit (TD3-DMAP) [9], Proximal Policy Optimization (PPO), and Transformer-based Time and Team RL for Observation-constrained MVP (T 3 OMVP) [26].The comparison results are presented in the columns beginning from column 7 in TABLE IV.
According to the comparison results, for any given pursuing difficulty level, PEPCRL-MVP shows remarkable performance improvement under all the traffic road scenes.For the setting of P8-E5, the AR of PEPCRL-MVP is improved by 43.41% and 42.5% on average, respectively, compared with DQN and PPO under the 3×3, 4×5 and real map-based road structures, which are the top two performances of all comparison methods in general.And the ATS of PEPCRL-MVP decreases by 4.18% on average compared with other methods in the P8-E5 difficulty level under the real map-based urban road.The results show that the proposed PEPCRL-MVP approach can highly improve pursuing efficiency and have excellent adaptability to different road scenes and traffic situations.
Under the same urban traffic road scenes, PEPCRL-MVP shows competitive robustness and pursuing effectiveness at different pursuing difficulty levels.Under the 3×3 grid pattern urban roads, as the pursuing difficulty level increases, the ATS of PEPCRL-MVP is 6.46%, 1.73%, and 1.32% lower than that of PPO which has the second-best performance, in the P6-E3, P7-E4, and P8-E5 difficulty levels respectively.Comprehensively, we evaluate the SR metric for all methods to further compare PEPCRL-MVP to other methods.The SR of PEPCRL-MVP is 51.09% higher than that of other methods on average under the 4 × 5 scene, and 47.53% higher than that of other methods on average for all scenes.These results indicate that PEPCRL-MVP greatly improves pursuing efficiency and has stronger robustness.
It is noted that there is no significant SR and SDTS performance improvement by PEPCRL-MVP.This can be explained by the fact that we set the maximum time steps to 800, which leads to a higher standard deviation for the methods with better performance in the 100 tests.Although the SDR and SDTS do not obtain great results, there is a considerable increase in AS, ATS, and SR for the PEPCRL-MVP approach.Specifically, in the P7-E4 difficulty level under the real map-based simulation environment, even if the SDR and SDTS of PEPCRL-MVP are 11.16% and 7.78% higher than those of DQN, the AR of PEPCRL-MVP is greatly improved, increasing by 51.66%.The SDR and SDTS o f PEPCRL-MVP are respectively 11.54% and 3.37% lower than those of PPO, which has the second-best performance.At the same time, the AR increases by 40.81%, and the ATS decreases by 6.46% in the P6-E3 difficulty level under the 3×3 road scene.These results demonstrate that PEPCRL-MVP can achieve great improvements both in algorithm stability and pursuing efficiency in simple scenes, and sacrifices some stability to obtain higher pursuing efficiency and better average performance in some complex scenes.4 (c), compared with other methods, PERL has a better convergence trend and higher reward.In conclusion, Fig. 4 illustrates that compared with TD3-DMAP and PPO, which belong to the superior performance in all comparison methods, PEPCRL-MVP makes the competitive convergence trend and stability, demonstrating its superiority and effectiveness.

D. Case Study
In this section, we analyze the PEPCRL-MVP pursuing processes in a case study with a real map-based scene in detail.Representative results are shown in Fig. 5. Fig. 5 (a) presents the distribution of background vehicles and the pursuing routes, where, the number marked next to the lane represents the average number of background vehicles per time step in this lane during the pursuit.If the average number of vehicles in a lane is greater than 10, it means that the lane is congested.Also, Fig. 5 (a) shows the routes of pursuing vehicles p 2 and p 5 .Following the group attention weights, p 2 and p 5 form a group to capture e 1 from A. From the routes of p 2 and p 5 , it can be seen that they predict the trajectory of the target evading vehicle and collaboratively pursue and intercept e 1 .It is worth noting that p 5 plans the path to C while avoiding the congested lane l 1 and lane l 2 .Finally, p 5 catches e 1 at D. This pursuit process shows the efficient cooperation of p 2 and p 5 .To investigate the impact of the prioritization network, we show the training loss of the prioritization network in the real map-based scene in Fig. 6.The prioritization networks with different numbers of pursuing and evading vehicles all converge in 100 training epochs.The training of the prioritization network with a smaller number of agents converges more easily, but the mean square error between the network output and reward gain is larger.It is clear from Fig. 6 that as the number of agents increases, the loss of the prioritization network decreases.This suggests that extensive experience collection greatly contributes to prioritized network performance.It also demonstrates that the prioritization network can effectively evaluate the global experience pool, thus facilitating the learning and collaboration of multiple agents.

VI. CONCLUSION
The emerging MARL technology is promising for multivehicle pursuit applications.However, the mission and safetycritical MVP tasks present great challenges, especially for the chasing of multiple target vehicles.While there are existing MARL algorithms proposed for MVP, they usually applied centralized training with randomly selected experience samples and did not adapt well to dynamically changing traffic situations.To address the problems in the existing MVP algorithm, in this paper we proposed a novel MVP approach (called PEPCRL-MVP) to improve MARL learning, collaboration, and MVP performance in dynamic urban traffic scenes.There are two major new components included in PEPCRL-MVP, a prioritization network and an attention-based progression cognition module.The prioritization network was introduced to effectively select training experience samples and increase diversity for the optimization and behavior of MARL, which improved agent collaboration and extensive exploration of experience.The progression cognition module was introduced to extract key traffic features from the sensor data and support the pursuing vehicles to adaptive adjust their target evading vehicles and path planning according to the real-time traffic situations.A simulator was developed for evaluation of the proposed PEPCRL-MVP approach and comparison with existing ones.Extensive experiments were conducted over urban roads in an area inside the second ring road of Beijing on PEPCRL-MVP and several approaches.Experiment results demonstrate that PEPCRL-MVP significantly outperforms the other methods for all the investigated road scenes in terms of performance metrics including pursuing success rate and average rewards.The results also demonstrate the effectiveness of the proposed two components.Jointly they largely improve collaboration and traffic awareness, leading to improved MVP performance.In the future, we will investigate the impact of additional factors in MVP, such as pedestrians, social activities, and communication delay, on the design and analysis of MVP approaches.We will also design smarter MVP methods for more real scenes, such as evading vehicles not following traffic rules.

Fig. 1 .
Fig. 1.Architecture of PEPCRL-MVP.Urban traffic environment for MVP (a) provides complex pursuit-evasion scenes and an interactive environment for MARL.Attention-based progression cognition module (b) provides accurate urban traffic information with critical features and group attention.The critical features and group attention are used to improve DQN-based path planning (c).Prioritization network and prioritized experience selection (d) are used to improve diversity and personalization of MARL.

Algorithm 1 :
Training Process of Prioritized Experience Boosting MARL Input: N agents, prioritization network P N , average reward of each agent in k historical test epochs [r 1 , r2 , ..., rN ], and global experience replay buffer G 1 for n=1:N do 2 for i=1:max cap do 3 Get the priority of E i in G by P N considering parameters θ n of agent n; Sample personalized training set E n per for agent n according to P n ; 7 Update parameters θ n of agent n via Eq.(11); 8 end 9 Test the updated MARL and get distributed rewards [r 1 , r 2 , ..., r N ] ; 10 Calculate the change of rewards [∆r 1 , ∆r 2 , ..., ∆r N ] ; 11 Calculate the loss of P N via Eq.(7) and update P N ; the attention-based progression cognition module is presented in Section IV-A.DQN-based path planning, as the core decision making algorithm, is then described in Section IV-B.Finally, Section IV-C introduces the decision-making and training process of the proposed PEPCRL-MVP.

Algorithm 2 : 6 end 7
PEPCRL-MVP Decision-making and Online Training Algorithm 1 Initialize N DQN-based path planning agents and the prioritization network P N ; 2 Initialize global experience replay buffer G ; 3 for e=1:max epoch do 4 if len(G) = max cap then 5 Train N agents via Algorithm 1, Eq. (15) ; Initialize an urban pursuit-evasion environment and obtain S 1 = {LOC P 1 , LOC E

4 .
Fig. 4 (a) shows the average reward with training epochs under the 3 × 3 grid pattern.It can be noticed that compared with A-MVP, PEPCRL-MVP has a smaller fluctuation and a more stable convergence in the late stage of training.For the 4×5 grid pattern, as shown in Fig. 4 (b), although the average reward of A-MVP grows fast in the early training stage, PEPCRL-MVP has a faster convergence speed and higher convergence trend in the late training stage.In Fig. 4 (c), under the real map-based urban road, PEPCRL-MVP shows an obviously stronger convergence.According to the results in Fig 4, it can be observed that the prioritized experience selection greatly facilitates the convergence of agent learning, and helps improve the pursuing performance.

Fig. 4
describes the convergence curve of average reward with training epochs of the P6-E3 under different road structures.In Fig. 4 (a), both TD3-DMAP and PPO have large fluctuations in the late stage of training under the 3 × 3 road structure scene, while TD3-DMAP shows a superior growth trend and stable convergence.Fig. 4 (b) depicts the average reward under the 4 × 5 urban road scene, presenting that the PEPCRL-MVP has advantageous performances in both convergence rate and convergence stability compared with TD3-DMAP and PPO.For the real map-based urban road scene, as shown in Fig.

Fig. 5 (
b) shows the group attention weights in 260 time steps during the pursuit.It can be observed that p 2 and p 5 both focus their attention on e 1 .Moreover, each pursuing vehicle has its own target evading vehicle and each evading vehicle may be pursued by one or more pursuing vehicles.It indicates that the progression cognition module can select suitable targets for pursuing vehicles according to the traffic situation and the locations of evading vehicles.The PEPCRL-MVP can achieve efficient and effective collaborative multivehicle pursuit.

Fig. 6 .
Fig. 6.Training loss of prioritization network in the real map-based scene.
TABLE I and the PEPCRL-MVP parameters are shown in TABLE II.The internal structure of DQN is shown in TABLE III.