You Only Look Once: Unified, Real-Time Object Detection

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Object Detection, Real-Time Processing, Convolutional Neural Networks, Regression Models, YOLO
  • Objective: Introduce a unified approach to object detection that operates in real-time by framing the problem as a regression task.
  • Hypothesis: The YOLO model can achieve high accuracy and speed in object detection by treating the detection process as a single regression problem rather than a series of classification tasks.

Background

  • Preliminary Theories:

    • Object Detection: Traditionally involves identifying objects in images using classifiers that evaluate various locations and scales, often leading to complex and slow pipelines.
    • Convolutional Neural Networks (CNNs): Deep learning models that have revolutionized image processing by automatically learning features from data.
    • Regression Analysis: A statistical process for estimating relationships among variables, applied here to predict bounding boxes and class probabilities directly from images.
  • Prior Research:

    • DPM (Deformable Parts Models): Early object detection models using sliding window techniques, focusing on local features and requiring multiple stages for detection.
    • R-CNN (Regions with CNN features): Introduced region proposal networks, significantly improving accuracy but at the cost of speed, often taking seconds per image.
    • Fast R-CNN: Aimed to speed up R-CNN by sharing computation, yet still not achieving real-time performance.

Methodology

  • Key Ideas:

    • Unified Architecture: YOLO integrates the entire detection pipeline into a single neural network, enabling end-to-end training and real-time processing.
    • Grid Division: The input image is divided into an S × S grid, where each grid cell predicts bounding boxes and class probabilities.
    • Bounding Box Predictions: Each grid cell predicts multiple bounding boxes and confidence scores, optimizing for both localization and classification simultaneously.
  • Experiments:

    • Datasets: Evaluated on PASCAL VOC 2007 and 2012 datasets, comparing performance metrics such as mean Average Precision (mAP).
    • Ablation Studies: Analyzed the impact of various network configurations and the effectiveness of the grid-based approach in predicting bounding boxes.
  • Implications: The design allows for faster inference times and improved generalization across different domains, making YOLO suitable for real-time applications.

Findings

  • Outcomes:

    • YOLO achieves 63.4% mAP at 45 frames per second, significantly outperforming other real-time detectors.
    • Fast YOLO variant reaches 155 frames per second with 52.7% mAP, showcasing the model's speed.
    • YOLO exhibits fewer false positives compared to traditional methods, particularly in complex backgrounds.
  • Significance: YOLO's approach contrasts sharply with traditional detection systems, offering a faster, more efficient alternative that maintains competitive accuracy.

  • Future Work: Further improvements could focus on enhancing localization accuracy, particularly for small objects, and adapting the model to handle diverse aspect ratios and configurations.

  • Potential Impact: Advancements in YOLO could lead to broader applications in autonomous systems, robotics, and real-time surveillance, enhancing the capability of machines to interpret visual data quickly and accurately.

Notes