Fairbank, Michael and Prokhorov, Danil and Barragan-Alcantar, David and Samothrakis, Spyridon and Li, Shuhui (2025) Neurocontrol for Fixed-Length Trajectories in Environments with Soft Barriers. Neural Networks, 184. p. 107034. DOI https://doi.org/10.1016/j.neunet.2024.107034
Fairbank, Michael and Prokhorov, Danil and Barragan-Alcantar, David and Samothrakis, Spyridon and Li, Shuhui (2025) Neurocontrol for Fixed-Length Trajectories in Environments with Soft Barriers. Neural Networks, 184. p. 107034. DOI https://doi.org/10.1016/j.neunet.2024.107034
Fairbank, Michael and Prokhorov, Danil and Barragan-Alcantar, David and Samothrakis, Spyridon and Li, Shuhui (2025) Neurocontrol for Fixed-Length Trajectories in Environments with Soft Barriers. Neural Networks, 184. p. 107034. DOI https://doi.org/10.1016/j.neunet.2024.107034
Abstract
In this paper we present three neurocontrol problems where the analytic policy gradient via back-propagation through time is used to train a simulated agent to maximise a polynomial reward function in a simulated environment. If the environment includes terminal barriers (e.g. solid walls) which terminate the episode whenever the agent touches them, then we show learning can get stuck in oscillating limit cycles, or local minima. Hence we propose to use fixed-length trajectories, and change these barriers into soft barriers, which the agent may pass through, while incurring a significant penalty cost. We demonstrate that the presence of soft barriers can have the drawback of causing exploding learning gradients. Furthermore, the strongest learning gradients often appear at inappropriate parts of the trajectory, where control of the system has already been lost. When combined with modern adaptive optimisers, this combination of exploding gradients and inappropriate learning often causes learning to grind to a halt. We propose ways to avoid these difficulties; either by careful gradient clipping, or by smoothly truncating the gradients of the soft barriers’ polynomial cost functions. We argue that this enables the learning algorithm to avoid exploding gradients, and also to concentrate on the most important parts of the trajectory, as opposed to parts of the trajectory where control has already been irreversibly lost.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Adaptive dynamic programming; Analytic policy gradient; Back-propagation through time; Exploding gradients; Neurocontrol; Reinforcement learning; Soft barriers |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 03 Jan 2025 11:49 |
Last Modified: | 03 Jan 2025 11:49 |
URI: | http://repository.essex.ac.uk/id/eprint/39819 |
Available files
Filename: 1-s2.0-S0893608024009638-main.pdf
Licence: Creative Commons: Attribution 4.0