Takeaways:
We’re unveiling the next generation of the Segment Anything collection of models, advancing image, and video understanding. Segment Anything Model 3 (SAM 3) introduces some of our most highly requested features like text and exemplar prompts — enabling detection, segmentation, and tracking of any visual concept across images and video. We also want to make it easier for more people to use our models. As part of this release, we’re debuting the Segment Anything Playground, the simplest way for anyone to experiment with applying our state-of-the-art models to media modification.
Today, we’re releasing the SAM 3 model weights, a demo on Segment Anything Playground, and a research paper that details how we built SAM 3. Additionally, we’re sharing the Segment Anything with Concepts (SA-Co) evaluation dataset to serve as a new benchmark for the community. Separately, we’re sharing SAM 3D, which includes a model for object and scene reconstruction and another for human pose and shape estimation. More information about this release can be found in our SAM 3D blog post.
At Meta, we’re using these advancements to help build the next generation of creative media tools. SAM 3 and SAM 3D are being used to enable the new View in Room feature on Facebook Marketplace, helping people visualize the style and fit of home decor items, like a lamp or a table, in their spaces before purchasing. New creation experiences enabled by SAM 3 will be coming to Vibes on the Meta AI app and meta.ai on the web, where people can use AI visual creation tools and remix existing AI-generated videos. We’ll also soon be introducing new effects on our Edits app that use SAM 3. Creators can apply dynamic effects to people or objects in their videos — simplifying a complex editing workflow to just one tap.
Introducing Meta Segment Anything Model 3
Linking language to specific visual elements in images or videos is a major challenge in computer vision. Traditional models often focus on object segmentation with a fixed set of text labels, restricting their ability to address the full spectrum of user requests, which frequently involve segmenting concepts not present in predefined lists. This means that existing models can segment frequent concepts like “person,” but struggle with more nuanced concepts like “the striped red umbrella”.
SAM 3 overcomes these limitations by introducing the promptable concept segmentation capability: finding and segmenting all instances of a concept defined by a text or exemplar prompt. SAM 3 accepts text prompts — open-vocabulary short noun phrases — and image exemplar prompts, eliminating the constraints of fixed label sets. To assess large-vocabulary detection and segmentation performance, we created the Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation in images and videos that challenges models to recognize a much larger vocabulary of concepts compared to prior benchmarks. As part of this release, we’re making SA-Co publicly available to support reproducibility and further innovation in open-ended visual segmentation.
SAM 3 supports a variety of prompt modalities, including both concept prompts such as simple noun phrases and image exemplars, as well as visual prompts, such as masks, boxes, and points, which were introduced in SAM 1 and SAM 2. This increases the flexibility and usability of segmentation, particularly for concepts that are rare or hard to describe with text alone.
SAM 3 excels at segmenting objects described by short noun phrases, reflecting common user intent in interactive and natural settings. Our model can also be used as a perception tool for multimodal large language models to segment objects described by more complex prompts, such as: “people sitting down, but not holding a gift box in their hands.”
Overall, SAM 3 delivers a 2x gain over existing systems in both image and video on our promptable concept segmentation benchmark, SA-Co, and improves upon previous SAM capabilities in interactive visual segmentation tasks.
Building a Novel Data Engine Using AI and Human Annotators
Obtaining high-quality annotated images with segmentation masks and text labels across a broad range of categories and visual domains is a significant challenge. This type of data doesn’t exist at scale on the web. Exhaustively masking every occurrence of an object category — particularly in video — is a time-intensive and complex task for human annotators. Additionally, building comprehensive coverage for a large and diverse vocabulary across multiple visual domains requires considerable time and resources. Overall, the process is both time-consuming and expensive.
We address this challenge by creating a scalable data engine that leverages SAM 3, human annotators, and AI models in the loop, which allows dramatic speed-ups in annotation — approximately 5x faster than humans on negative prompts (concepts not present in the image/video) and 36% faster for positive prompts even in challenging fine-grained domains. This hybrid human and AI system enabled us to create a large and diverse training set with over 4 million unique concepts.

A pipeline of AI models, including SAM 3 and systems such as a Llama-based captioner, automatically mine images and videos, generate captions, parse the captions into text labels, and create initial segmentation masks, which are shown as “candidates” in the above figure.
Human and AI annotators then verify and correct these proposals, yielding a feedback loop that rapidly scales dataset coverage while continuously improving data quality. AI annotators are based on Llama 3.2v models that were specifically trained to match or surpass human accuracy on annotation tasks, such as verifying if a mask is high quality, or if all instances of a concept are exhaustively masked in an image.
By delegating some human annotation tasks to AI annotators, we more than double the throughput compared to a human-only annotation pipeline. AI annotators also automatically filter out easy examples, focusing valuable human annotation effort on the most challenging cases where the current version of SAM 3 fails. We also leverage a concept ontology — a dictionary of concepts and their relationships based on Wikipedia — to map text labels into a shared concept space and increase the coverage of less frequent concepts in the data.
We validate this approach through ablation studies, demonstrating that integrating AI- and human-annotated labels results in measurable improvements in model performance. We further validate that an entirely automated data engine can be used to generate data to automatically expand coverage to new visual and text domains.
Model Architecture
Building a model that excels at promptable concept segmentation requires us to maintain strong performance on all tasks compared to individual, task-specific models. This presents significant challenges in model design and in the development of a training recipe, due to potential task conflicts. For example, the task of re-detecting and tracking instances requires visual features that distinguish them from other instances of the same concept. This conflicts with the concept detection task, which requires visual features that are similar for all instances of a concept. Finding the right architecture is an important step in being able to solve all tasks in a unified model. Additionally, designing strong data recipes is essential to prevent issues like catastrophic forgetting as new tasks and data are introduced.
The SAM 3 model architecture also builds on many previous AI advancements from Meta. The text and image encoders in SAM 3 are from the Meta Perception Encoder, an open source model we shared in April that enables the building of more advanced computer vision systems that can assist people in everyday tasks, such as image recognition and object detection. Using the Meta Perception Encoder enabled us to achieve a significant leap in performance compared to previous encoder choices. The detector component is based on the DETR model, which was the first to use transformers for object detection. The memory bank and memory encoder used in SAM 2 is the basis for the Tracker component. We also used several open source components, including datasets, benchmarks, and model improvements, to advance our work.
Results
We achieve a step change in concept segmentation performance in images (measured on SA-Co Gold subset) and videos (on SA-Co Video), with SAM 3 doubling cgF1 scores (a measure of how well the model can recognize and localize concepts) relative to existing models. SAM 3 consistently outperforms both foundational models like Gemini 2.5 Pro and strong specialist baselines such as GLEE, OWLv2, and LLMDet. In studies, users prefer SAM 3 outputs over the strongest baseline, OWLv2, approximately three to one. We also achieve state-of-the-art results on the SAM 2 visual segmentation tasks (mask-to-masklet, point-to-mask), matching or exceeding the state-of-the-art performance of previous models like SAM 2. Furthermore, we see notable gains on challenging benchmarks like zero-shot LVIS (not shown) and object counting (shown on CountBench).

This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU. In video, the inference latency scales with the number of objects, sustaining near real-time performance for approximately five concurrent objects.
We also show that a multimodal large language model (MLLM) that uses SAM 3 as a tool, called SAM 3 Agent, can segment more complex text queries such as, “What object in the picture is used for controlling and guiding a horse?” The MLLM proposes noun phrase queries to prompt SAM 3 and analyzes the returned masks, iterating until the masks are satisfactory. Without training on any referring expression segmentation or reasoning segmentation data, SAM 3 Agent surpasses prior work on challenging free-text segmentation benchmarks that require reasoning, such as ReasonSeg (shown above) and OmniLabel.
Applications to Science
SAM 3 is already being applied for use cases in scientific fields. For example, Meta collaborated with Conservation X Labs and Osa Conservation to combine on-the-ground wildlife monitoring with SAM 3 to build an open dataset of research-ready, raw video footage. The publicly available SA-FARI dataset includes over 10,000 camera trap videos of more than 100 species, annotated with bounding boxes and segmentation masks for every animal in each frame. FathomNet is a unique research collaboration led by MBARI that is working to advance AI tools for ocean exploration. Segmentation masks and a new instance segmentation benchmark tailored for underwater imagery are now available to the marine research community via the FathomNet Database. SA-FARI and FathomNet can be used by the broader AI community to develop innovative new ways to discover, monitor, and conserve wildlife on land and in the ocean.
Future Areas of Exploration for the Open Source Community
While SAM 3 demonstrates strong performance for segmenting objects in images and short videos with simple text phrases, the model performance can be further improved, especially in challenging scenarios.

When applied to video, SAM 3 tracks every object with a SAM 2-style masklet, which means the cost of SAM 3 inference scales linearly with the number of objects being tracked. Each object is processed separately, utilizing only shared per-frame embeddings, without inter-object communication. Incorporating shared object-level contextual information could aid in improving efficiency and model performance in complex scenes with many visually similar objects.
There’s plenty more work to be done to propel research in this field even further. We hope the AI community will join us by building with SAM 3, adopting the SA-Co benchmark, and leveraging these new resources to help push these capabilities further. Together, we can accelerate open science to build impactful new experiences and use cases that benefit people and society.
Explore SAM 3 on the Segment Anything Playground
We’re bringing all of this work together in the Segment Anything Playground, our new platform that enables anyone to try our latest models — no technical expertise needed. The start-from-scratch flow enables uploading an image or video, or it’s possible to jump right in using one of the available templates. These include practical options like pixelating faces, license plates, and screens, as well as fun video edits such as adding a spotlight effect, motion trails, or magnifying specific objects. Additionally, the templates assist in annotating visual data and provide a way to stress test SAM 3. We’ve designed SAM Playground to be the simplest way to experiment with our models for media modification, and we can’t wait to see how people use it to enhance their creativity.
SAM 3 also performs well on first-person footage captured by wearable devices like Meta’s Aria Gen 2 research glasses. This enables robust segmentation and tracking of objects from a first-person perspective, handling the dynamic challenges of wearable-captured scenes. Select recordings from the Aria Gen 2 Pilot Dataset are now featured on the Segment Anything Playground. This integration demonstrates SAM 3’s value for research and applications in areas like machine perception, contextual AI, and robotics, where understanding the world from the human perspective is crucial.
We want to continue empowering creators, developers, and researchers to experiment, build, and push the boundaries of what’s possible with Meta Segment Anything Model 3. Looking ahead, we’re optimistic about the transformative potential of SAM 3 to unlock new use cases and create positive impact across diverse fields. As always, we welcome continued iteration and feedback from the community to help us evolve and advance the field together.
Our approach
Latest news
Foundational models