Building high-level features using large scale unsupervised learning

Abstract: We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Unsupervised learning, feature detection, deep learning, autoencoders, object recognition
  • Objective: To investigate the feasibility of building high-level, class-specific feature detectors from unlabeled data.
  • Hypothesis: It is possible to learn high-level features, such as face detectors, using only unlabeled images without requiring labeled data.
  • Innovation: The research introduces a scalable approach using a deep autoencoder with local receptive fields and pooling, enabling the learning of complex invariances from large datasets of unlabeled images.

Background

  • Preliminary Theories:

    • Grandmother Neurons: The concept that certain neurons in the brain are highly selective for specific categories, such as faces, suggesting a biological basis for class-specific feature detection.
    • Unsupervised Feature Learning: Techniques that allow the extraction of features from unlabeled data, which have traditionally struggled to capture high-level concepts.
    • Deep Learning Architectures: The use of deep neural networks to learn hierarchical representations of data, which has shown promise in various machine learning tasks.
  • Prior Research:

    • 2006: Hinton and Salakhutdinov introduced deep autoencoders, demonstrating the potential of unsupervised learning for feature extraction.
    • 2007: Bengio et al. proposed greedy layer-wise training of deep networks, laying groundwork for scalable learning algorithms.
    • 2009: Lee et al. showed that convolutional networks could learn features from aligned images, but required some level of supervision.

Methodology

  • Key Ideas:

    • Locally Connected Sparse Autoencoder: A nine-layered architecture that uses local receptive fields to reduce communication costs and improve learning efficiency.
    • Pooling and Local Contrast Normalization: Techniques employed to achieve invariance to local transformations, enhancing the robustness of learned features.
    • Model Parallelism and Asynchronous SGD: Leveraging a distributed computing environment with 1,000 machines to handle large-scale training efficiently.
  • Experiments:

    • Face Detection: Evaluated the network's ability to detect faces using a dataset constructed from random YouTube video frames, achieving an accuracy of 81.7%.
    • Invariance Testing: Assessed the robustness of the learned features against transformations such as scaling and rotation, confirming the network's ability to generalize across variations.
    • Object Recognition on ImageNet: Utilized the learned features to classify 22,000 object categories, achieving a 15.8% accuracy, a significant improvement over previous methods.
  • Implications: The methodology demonstrates that high-level features can be effectively learned from unlabeled data, paving the way for advancements in areas where labeled data is scarce.

Findings

  • Outcomes:

    • The network successfully learned to detect faces, human bodies, and cat faces from unlabeled data, achieving high accuracy rates.
    • The learned features exhibited robustness to various transformations, indicating the network's capability to generalize well.
    • The approach resulted in a 70% relative improvement in object recognition accuracy on ImageNet compared to prior state-of-the-art methods.
  • Significance: This research challenges the prevailing belief that labeled data is essential for training effective feature detectors, showcasing the potential of unsupervised learning techniques.

  • Future Work: Suggested avenues include exploring additional high-level concepts, improving the scalability of the model, and applying the methodology to other domains beyond image recognition.

  • Potential Impact: If further developed, these techniques could revolutionize fields such as computer vision, enabling the use of vast amounts of unlabeled data to train sophisticated models without the need for extensive labeling efforts.

Notes

Meta

Published: 2011-12-29

Updated: 2025-08-27

URL: https://arxiv.org/abs/1112.6209v5

Authors: Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng

Citations: 2230

H Index: 385

Categories: cs.LG

Model: gpt-4o-mini