Segment Anything

Abstract: We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Image Segmentation, Foundation Models, Zero-Shot Learning, Dataset, Promptable Segmentation
  • Objective: Develop a foundation model for image segmentation that can generalize to new tasks and distributions using prompt engineering.
  • Hypothesis: The Segment Anything Model (SAM) can achieve competitive zero-shot performance on segmentation tasks compared to fully supervised models.

Background

  • Preliminary Theories:

    • Foundation Models: Large models trained on extensive datasets that can generalize to various tasks with minimal fine-tuning.
    • Prompt Engineering: A technique used to guide models to perform specific tasks by providing structured input prompts.
    • Image Segmentation: The process of partitioning an image into multiple segments to simplify its representation and make it more meaningful for analysis.
    • Zero-Shot Learning: The ability of a model to generalize to unseen tasks without additional training.
  • Prior Research:

    • CLIP (2021): Introduced a model that aligns text and images, enabling zero-shot generalization to novel visual concepts.
    • DALL·E (2021): A model that generates images from textual descriptions, showcasing the potential of combining vision and language.
    • ViT (2020): The Vision Transformer architecture that applies transformer models to image data, leading to significant advancements in computer vision tasks.

Methodology

  • Key Ideas:

    • Promptable Segmentation Task: A task designed to return valid segmentation masks based on various prompts (points, boxes, text).
    • Segment Anything Model (SAM): Composed of an image encoder, prompt encoder, and mask decoder, allowing for efficient mask generation.
    • Data Engine: A system that iteratively collects and annotates data using the SAM model to create a large dataset.
  • Experiments:

    • Evaluated SAM on 23 diverse segmentation datasets to assess zero-shot transfer capabilities.
    • Conducted human studies to compare SAM's output quality against existing interactive segmentation models.
    • Implemented ablation studies to analyze the impact of different components of the model.
  • Implications: The design of SAM allows for real-time interactive segmentation and can adapt to various tasks through prompt engineering, enhancing usability in practical applications.

Findings

  • Outcomes:

    • SAM demonstrated high-quality mask generation from a single point prompt, often matching or exceeding the performance of fully supervised models.
    • The model effectively handled ambiguity by predicting multiple masks for a single prompt, improving robustness in segmentation tasks.
    • The SA-1B dataset, containing over 1 billion masks, was shown to significantly enhance model training and performance.
  • Significance: SAM's performance indicates a shift towards foundation models in image segmentation, providing a versatile tool for various applications without extensive retraining.

  • Future Work: Further exploration of SAM's capabilities in specific domains, refinement of prompt engineering techniques, and addressing identified biases in segmentation outputs.

  • Potential Impact: Advancements in segmentation models like SAM could lead to improved applications in fields such as autonomous driving, medical imaging, and augmented reality, fostering innovation in computer vision technologies.

Notes

Meta

Published: 2023-04-05

Updated: 2025-08-27

URL: https://arxiv.org/abs/2304.02643v1

Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick

Citations: 2892

H Index: 263

Categories: cs.CV, cs.AI, cs.LG

Model: gpt-4o-mini