Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Abstract: Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Large Minibatch SGD, ImageNet, ResNet-50, Distributed Training, Learning Rate Scaling
  • Objective: Demonstrate the feasibility of training large-scale deep learning models efficiently using large minibatch sizes without sacrificing accuracy.
  • Hypothesis: Large minibatch sizes can be used effectively in training without loss of generalization accuracy if proper techniques are applied.
  • Innovation: Introduction of a hyper-parameter-free linear scaling rule for learning rates and a new warmup strategy to address optimization challenges with large minibatches.

Background

  • Preliminary Theories:

    • Stochastic Gradient Descent (SGD): A method for optimizing an objective function by iteratively updating parameters based on a subset of data (minibatch).
    • Batch Normalization: A technique to stabilize and accelerate training by normalizing the inputs of each layer.
    • Learning Rate Scaling: The principle that adjusting the learning rate in proportion to the minibatch size can maintain training stability and accuracy.
    • Warmup Strategies: Techniques to gradually increase the learning rate at the beginning of training to avoid optimization difficulties.
  • Prior Research:

    • Krizhevsky et al. (2012): Early work on large-scale training with SGD that noted accuracy degradation with increased minibatch sizes.
    • Goyal et al. (2017): Initial findings on large minibatch training showing that accuracy could be maintained with careful tuning.
    • Li et al. (2017): Demonstrated distributed training with large minibatches but did not provide a comprehensive scaling rule.

Methodology

  • Key Ideas:

    • Linear Scaling Rule: Learning rate is scaled linearly with the minibatch size, allowing larger batches without losing accuracy.
    • Warmup Strategy: Gradually increasing the learning rate during the initial epochs to mitigate early optimization issues.
    • Implementation in Caffe2: Utilization of the Caffe2 framework for efficient distributed training across multiple GPUs.
  • Experiments:

    • Dataset: ImageNet with approximately 1.28 million training images.
    • Model: ResNet-50 trained with varying minibatch sizes (up to 8192).
    • Metrics: Top-1 validation error and training error curves were used to evaluate model performance.
  • Implications: The methodology allows for efficient scaling of training processes, enabling the use of commodity hardware for large-scale deep learning tasks.

Findings

  • Outcomes:

    • Successful training of ResNet-50 with a minibatch size of 8192 in one hour while maintaining accuracy comparable to smaller minibatches.
    • Validation error remained stable across a wide range of minibatch sizes, with a significant drop in performance only beyond 8192.
    • The gradual warmup strategy effectively addressed optimization difficulties, allowing large minibatch training to match small minibatch performance.
  • Significance: This research challenges previous beliefs that larger minibatches inherently lead to poorer generalization, demonstrating that with proper techniques, large-scale training can be both efficient and effective.

  • Future Work: Exploration of the linear scaling rule and warmup strategies in other domains and with different architectures, as well as further optimization of communication algorithms for distributed training.

  • Potential Impact: If pursued, these strategies could revolutionize the training of deep learning models, enabling faster iterations and broader accessibility to large-scale data processing in both research and industry settings.

Notes

Meta

Published: 2017-06-08

Updated: 2025-08-27

URL: https://arxiv.org/abs/1706.02677v2

Authors: Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He

Citations: 3317

H Index: 279

Categories: cs.CV, cs.DC, cs.LG

Model: gpt-4o-mini