Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Abstract: We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
Synopsis
Overview
- Keywords: Deep Learning, Activation Functions, Exponential Linear Units, Neural Networks, CIFAR-10, CIFAR-100, ImageNet
- Objective: Introduce the Exponential Linear Unit (ELU) as an activation function that enhances learning speed and accuracy in deep neural networks.
- Hypothesis: ELUs will outperform traditional activation functions like ReLU and its variants in terms of learning speed and classification accuracy.
- Innovation: ELUs provide negative values that help center activations around zero, improving gradient flow and reducing bias shift, leading to faster convergence and better generalization.
Background
Preliminary Theories:
- Vanishing Gradient Problem: A challenge in training deep networks where gradients become too small for effective learning, particularly in layers far from the output.
- Batch Normalization: A technique to normalize activations in a network, reducing internal covariate shift and improving training speed and stability.
- Rectified Linear Units (ReLU): A widely used activation function that outputs zero for negative inputs and is linear for positive inputs, alleviating the vanishing gradient issue but leading to dead neurons.
- Leaky ReLU and Parametric ReLU: Variants of ReLU that allow a small, non-zero gradient when the unit is inactive, aimed at addressing the dead neuron problem.
Prior Research:
- 2010: Introduction of ReLU, which significantly improved training in deep networks.
- 2015: Development of Batch Normalization, which became a standard technique for improving training speed and stability.
- 2015: Introduction of Leaky ReLU and its variants, which aimed to mitigate the limitations of standard ReLU.
Methodology
Key Ideas:
- ELU Definition: ELUs are defined as ( f(x) = x ) for ( x > 0 ) and ( f(x) = \alpha (e^x - 1) ) for ( x \leq 0 ), where ( \alpha ) is a hyperparameter controlling saturation for negative inputs.
- Mean Activation Control: ELUs push mean activations closer to zero, which is theorized to speed up learning by aligning the normal gradient with the natural gradient.
- Noise Robustness: The saturation of ELUs at negative values reduces the variance of activations, making the network more robust to noise.
Experiments:
- Datasets: Evaluated on MNIST, CIFAR-10, CIFAR-100, and ImageNet.
- Network Architectures: Various convolutional neural networks (CNNs) were employed, with ELUs compared against ReLU, Leaky ReLU, and Batch Normalization setups.
- Metrics: Training loss and test error rates were tracked across iterations to assess performance.
Implications: The design of ELUs suggests that activation functions can significantly influence learning dynamics, particularly in deep networks.
Findings
Outcomes:
- ELUs demonstrated faster convergence and lower training loss compared to ReLU and its variants across all datasets.
- On CIFAR-10, ELUs achieved a test error of 6.55%, while on CIFAR-100, they reached 24.28%, outperforming other architectures without requiring model averaging or multi-view evaluation.
- ELUs maintained lower variance in activation distributions, indicating more stable learning dynamics.
Significance: ELUs challenge the dominance of ReLU by providing a more effective alternative that addresses both speed and accuracy in deep learning tasks.
Future Work: Exploration of optimal values for the hyperparameter ( \alpha ) and further testing on diverse datasets and architectures to validate generalizability.
Potential Impact: Adoption of ELUs could lead to more efficient training processes in deep learning, particularly in applications requiring rapid convergence and high accuracy, such as image classification and natural language processing.