Layer Normalization

Abstract: Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Layer Normalization, Neural Networks, Batch Normalization, Recurrent Neural Networks, Training Speed
  • Objective: Introduce layer normalization as a method to improve the training speed and stability of neural networks, particularly recurrent neural networks.
  • Hypothesis: Layer normalization will outperform batch normalization in terms of training speed and stability, especially in recurrent neural networks with small mini-batch sizes.
  • Innovation: Layer normalization computes normalization statistics from all summed inputs to neurons in a layer on a single training case, making it invariant to mini-batch size and applicable to recurrent networks.

Background

  • Preliminary Theories:

    • Batch Normalization: Normalizes the summed inputs to each neuron over a mini-batch, reducing internal covariate shift but dependent on mini-batch size.
    • Recurrent Neural Networks (RNNs): Networks where outputs from previous time steps are fed back into the network, often facing issues with gradient stability.
    • Weight Normalization: A method that normalizes the weights of the network rather than the inputs, aiming to stabilize training.
    • Covariate Shift: The problem where the distribution of inputs to a model changes over time, complicating training.
  • Prior Research:

    • 2015: Introduction of batch normalization, which significantly improved training speed and model performance across various tasks.
    • 2016: Research on recurrent batch normalization, highlighting the challenges of applying batch normalization to RNNs.
    • 2016: Weight normalization proposed as an alternative to batch normalization, focusing on stabilizing the training process.

Methodology

  • Key Ideas:

    • Layer normalization computes mean and variance across all neurons in a layer for a single training case, allowing for consistent normalization regardless of mini-batch size.
    • Each neuron has its own adaptive bias and gain applied after normalization, similar to batch normalization.
    • The method is invariant to per training-case feature shifting and scaling, addressing limitations of batch normalization.
  • Experiments:

    • Evaluated layer normalization on various tasks including image-sentence ranking, question-answering, and handwriting generation.
    • Compared performance against batch normalization using metrics such as training speed and model accuracy across different mini-batch sizes.
    • Datasets included MNIST, IAM Online Handwriting Database, and various NLP tasks.
  • Implications: The design of layer normalization allows for more stable training dynamics in RNNs, particularly beneficial for long sequences and small mini-batches.

Findings

  • Outcomes:

    • Layer normalization demonstrated faster convergence and improved generalization performance compared to baseline models and batch normalization.
    • Effective in stabilizing hidden state dynamics in RNNs, leading to reduced training times.
    • Invariance to mini-batch size allows for more flexible application in various training scenarios.
  • Significance: Layer normalization provides a robust alternative to batch normalization, particularly in recurrent architectures where batch normalization struggles due to varying sequence lengths and mini-batch sizes.

  • Future Work: Further exploration of layer normalization in convolutional networks and its integration with other normalization techniques could enhance performance across different architectures.

  • Potential Impact: If further developed, layer normalization could lead to significant advancements in training efficiency and model performance in deep learning, particularly for applications requiring real-time processing or limited computational resources.

Notes

Meta

Published: 2016-07-21

Updated: 2025-08-27

URL: https://arxiv.org/abs/1607.06450v1

Authors: Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

Citations: 8796

H Index: 213

Categories: stat.ML, cs.LG

Model: gpt-4o-mini