Yilmaz, Ahmet and Poli, Riccardo (2022) Successfully and Efficiently Training Deep Multi-layer Perceptrons with Logistic Activation Function Simply Requires Initializing the Weights with an Appropriate Negative Mean. Neural Networks, 153. pp. 87-103. DOI https://doi.org/10.1016/j.neunet.2022.05.030
Yilmaz, Ahmet and Poli, Riccardo (2022) Successfully and Efficiently Training Deep Multi-layer Perceptrons with Logistic Activation Function Simply Requires Initializing the Weights with an Appropriate Negative Mean. Neural Networks, 153. pp. 87-103. DOI https://doi.org/10.1016/j.neunet.2022.05.030
Yilmaz, Ahmet and Poli, Riccardo (2022) Successfully and Efficiently Training Deep Multi-layer Perceptrons with Logistic Activation Function Simply Requires Initializing the Weights with an Appropriate Negative Mean. Neural Networks, 153. pp. 87-103. DOI https://doi.org/10.1016/j.neunet.2022.05.030
Abstract
The vanishing gradient problem (i.e., gradients prematurely becoming extremely small during training, thereby effectively preventing a network from learning) is a long-standing obstacle to the training of deep neural networks using sigmoid activation functions when using the standard back-propagation algorithm. In this paper, we found that an important contributor to the problem is weight initialization. We started by developing a simple theoretical model showing how the expected value of gradients is affected by the mean of the initial weights. We then developed a second theoretical model that allowed us to identify a sufficient condition for the vanishing gradient problem to occur. Using these theories we found that initial back-propagation gradients do not vanish if the mean of the initial weights is negative and inversely proportional to the number of neurons in a layer. Numerous experiments with networks with 10 and 15 hidden layers corroborated the theoretical predictions: if we initialized weights as indicated by the theory, the standard back-propagation algorithm was both highly successful and efficient at training deep neural networks using sigmoid activation functions.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Deep neural networks; Vanishing gradient; Weights initialization; Logistic activation function; Supervised learning |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 23 Dec 2022 14:21 |
Last Modified: | 30 Oct 2024 20:47 |
URI: | http://repository.essex.ac.uk/id/eprint/32958 |
Available files
Filename: manuscript.pdf
Licence: Creative Commons: Attribution-Noncommercial-No Derivative Works 3.0