RF-DETR: How Small Models Can Beat Big Ones in Real-Time Computer Vision

Name: RF-DETR: How Small Models Can Beat Big Ones in Real-Time Computer Vision | Brijesh Madhavan, PhD posted on the topic | LinkedIn
Uploaded: 2025-11-14T08:41:08.894Z
Channel: Brijesh Madhavan, PhD

This title was summarized by AI from the post below.

Brijesh Madhavan, PhD

Co-founder @Neuralcraft | @Curvelogics | @Data Science Academy | AI Accelerator

A must-read paper for the future of real-time computer vision. While most of the world is scaling Vision Language Models upward, the team behind RF-DETR shows something remarkable: 👉 Small, NAS-optimized specialist models can beat heavyweight detectors, including YOLO variants, in real-time settings. RF-DETR combines: 🔹 Recurrent Fusion for multi-scale features 🔹 A carefully designed DETR search space 🔹 Weight-sharing NAS to discover architectures that sit on a new accuracy vs latency Pareto frontier. This is a powerful reminder that innovation is not only about bigger models. It is about better architectures. Brilliant work by the authors. 📄 RF-DETR: NAS for Real-Time Detection Transformers #AI #MachineLearning #DeepLearning #ComputerVision #GenAI #NeuralArchitectureSearch #NAS #Transformers #DETR #YOLO #VisionAI #MLResearch #AITech #ModelOptimization #EdgeAI #RealTimeAI #AIEfficiency #TechInnovation #MLEngineering #DataScience Data Science Academy Pvt. Ltd. Curvelogics Advanced Technology Solutions Pvt Ltd

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models

RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers

To view or add a comment, sign in

More Relevant Posts

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models
1w
Report this post
RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers

69 Comments
Like Comment
To view or add a comment, sign in
Roboflow

49,052 followers
1w
Report this post
Three big releases here RF-DETR segmentation is the best segmentation model available RF-DETR paper and repo gives a reproducible Apache 2.0 SOTA model to the community Neural architecture search is how RF-DETR is Pareto optimal over the previous state-of-the-art models More task types incoming soon!

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models
1w

RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers
Like Comment
To view or add a comment, sign in
Lekha Priyadarshini Bhan

Generative AI Architect | RAG Systems Specialist | Agentic AI Platform Builder | Thought Leader | Speaker
1w
Report this post
RF-DETR is another reminder of how fast vision models are evolving. What stands out here isn’t just the performance jump but it’s the architecture philosophy: 🔹 Real-time detection built on top of DINOv2 🔹 NAS exploring thousands of configs in a single training run 🔹 A unified accuracy–latency curve across detection + segmentation 🔹 And outperforming YOLOv8/YOLOv11 at ~2x the speed This is the kind of research that pushes real deployment boundaries like edge, robotics, AR/VR, live tracking systems, sports analytics… and more. What excites me most is the convergence of: Transformer backbones + structured search (NAS) + efficiency-first design. This is exactly the direction production-grade CV systems are moving toward. Huge respect to the team behind this work and I believe it’s a brilliant execution. Link below for anyone who wants to dive deeper 👇 #computervision #transformers #opensource #research

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models
1w

RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers
Like Comment
To view or add a comment, sign in
Nasir Uddin

PhD | Senior Data Scientist & AI/ML Engineer | AWS Certified Machine Learning Specialist | GCP | Azure | Futurist
1w
Report this post
RF-DETR Just Redefined Real-Time Object Detection Open-vocabulary models like GroundingDINO are accurate but slow. Real-time detectors like YOLO are fast but overfit to COCO and might break in real-world data. RF-DETR (Roboflow + CMU) finally bridges the gap. Using weight-sharing Neural Architecture Search (NAS) + DINOv2 foundation features, RF-DETR discovers the best accuracy-latency trade-off after training—no costly retraining loops. 🔥 Key Highlights: First real-time detector to pass 60 AP on COCO 48.0 AP (nano) → +5.3 AP over D-FINE at the same latency 20× faster than GroundingDINO (tiny) on RF100-VL RF-DETR-Seg beats YOLOv8/YOLOv11 segmentation while running up to 10× faster Fully standardized latency evaluation (buffers, CUDA graphs, FP32=FP16 consistency) This is the closest we’ve come to a detector that is: fast, accurate, hardware-aware, and genuinely transferable beyond COCO.

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models
1w

RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers
Like Comment
To view or add a comment, sign in
Asfandiyar Khan

Computer Vision Product Specialist @ Roboflow | B.S, M.Eng. CS
1w
Report this post
It's already happening, but excited to see how many more production systems quietly swap over to RF-DETR-style designs over the next year. RF-DETR feels like a real inflection point for detectors. Instead of hand-picking a single architecture per use case, you can fine-tune once on your dataset and get an entire accuracy-latency Pareto curve out of a DINOv2-backed supernet, for both detection and segmentation. The fact that you can sweep thousands of configs in one run and then “snap” to the exact variant that fits a given GPU, edge device, or SLA is a huuggeee unlock for production CV teams. The business side of this is just as important: fewer bespoke model families to maintain, faster iteration when requirements change, and a much cleaner path to standardizing on specialist, real-time DETRs rather than juggling popular CNN-based detector architectures.

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models
1w

RF-DETR paper is out! 🔥 🔥 🔥 TL;DR: RF-DETR is a real time detection transformer built on top of DINOv2 and weight sharing NAS. One training run explores thousands of architectures and produces a full accuracy latency curve for both detection and segmentation. - DINOv2 backbone: DINOv2 brings strong visual priors, improves results on small or unusual datasets, and provides a solid foundation for the NAS search space. - NAS over ~6000 configs: Training samples a new architecture every step. Resolution, patch size, decoder depth, queries, and window layout shift dynamically while all subnets share one set of weights. - Detection: RF-DETR N hits 48.0 AP at 2.3 ms, matching YOLOv8 M and YOLOv11 M at about 2x their speed. - Segmentation: RF-DETR-Seg N reaches 40.3 mask AP at 3.4 ms, outperforming the largest YOLOv8 and YOLOv11 models. ⮑ 🔗 paper: https://lnkd.in/dNgSV4FH Huge congratulations to Peter Robicheaux, Isaac Robinson, and Matvei Popov for making it happen! #computervision #opensource #paper #transformers
Like Comment
To view or add a comment, sign in
Kartik Pundir

Software Engineer | JAVA Full Stack Developer | Springboot | Hibernate | DSA | Backend | MySQL
2w
Report this post
✅ Day 25 — Count Unguarded Cells in the Grid 📌 Problem We are given an m x n grid along with positions of guards and walls. Each guard can watch cells in 4 directions (up, down, left, right) until their view is blocked by a wall or another guard. We need to count how many empty cells are not guarded by anyone. 🧠 Intuition Guards behave like lasers pointing in 4 directions. They keep marking cells as “guarded” until something blocks their vision: A wall stops vision ❌ Another guard also blocks ❌ Empty cells are considered watched ✅ So the plan is to simulate this line-of-sight for each guard and mark every visible cell. ⚙️ Approach Create a grid initialized to 0 (empty). Mark all walls and guards with 1 (blocking objects). For each guard, check 4 directions: Move step-by-step in that direction Stop if we hit a wall or another guard Mark cells as 2 (guarded) Finally, count cells that remain 0 — these are unguarded empty cells ✅ 🧩 Key Points Guards block each other’s line of sight. Walls block vision too. Only empty cells contribute to the final answer. ⏱️ Complexity Time: O(G × (m + n)) Each guard can scan a full row + column. Space: O(m × n) For storing the grid. 👣 This problem teaches: Grid simulation Directional traversal Real-world system simulation (CCTV vision logic) #Day25Of90 #90DaysOfCode #LeetCode #LeetCodeDaily #DSA #DataStructuresAndAlgorithms #GridProblems #Simulation #ProblemSolving #CodingJourney #KeepLearning
Like Comment
To view or add a comment, sign in
Sophia F.

International Market Expert
2w
Report this post
🛰️ Performance Comparison on Automatic Tie Points Matching of Satellite Stereo Images Due to varying data acquisition conditions, the quality of stereo pairs from different satellite sources can be inconsistent. The operators often encounter issues where RPC optimization fails in the process of joint block adjustment of satellite stereo pairs for the same survey area. SVP software provides a fast and high-precision image matching algorithm, and it adopts a variety of gross error elimination methods to completely and automatically eliminate error points, and the matching accuracy can reach sub-pixel level. The video below shows the performance comparison between SVP and Inpho when extracting tie points automatically based on WorldView-03 stereo pairs. 👇 #SVP #satelliteimage #remotesensing #tiepoint #matching #stereoimage #tiepointextraction #stereopairs Contact me for more information, and the SVP software is now available on https://www.geocloud.work/
Like Comment
To view or add a comment, sign in
Wuhan Linventix Technology Co.,Ltd

28 followers
2w
Report this post
🛰️ Performance Comparison on Automatic Tie Points Matching of Satellite Stereo Images Due to varying data acquisition conditions, the quality of stereo pairs from different satellite sources can be inconsistent. The operators often encounter issues where RPC optimization fails in the process of joint block adjustment of satellite stereo pairs for the same survey area. SVP software provides a fast and high-precision image matching algorithm, and it adopts a variety of gross error elimination methods to completely and automatically eliminate error points, and the matching accuracy can reach sub-pixel level. The video below shows the performance comparison between SVP and Inpho when extracting tie points automatically based on WorldView-03 stereo pairs. 👇 #SVP #satelliteimage #remotesensing #tiepoint #matching #stereoimage #tiepointextraction #stereopairs Contact me for more information, and the SVP software is now available on https://www.geocloud.work/ Dr. Yuri Raizman geocloud.work

Sophia F.

International Market Expert
2w

🛰️ Performance Comparison on Automatic Tie Points Matching of Satellite Stereo Images Due to varying data acquisition conditions, the quality of stereo pairs from different satellite sources can be inconsistent. The operators often encounter issues where RPC optimization fails in the process of joint block adjustment of satellite stereo pairs for the same survey area. SVP software provides a fast and high-precision image matching algorithm, and it adopts a variety of gross error elimination methods to completely and automatically eliminate error points, and the matching accuracy can reach sub-pixel level. The video below shows the performance comparison between SVP and Inpho when extracting tie points automatically based on WorldView-03 stereo pairs. 👇 #SVP #satelliteimage #remotesensing #tiepoint #matching #stereoimage #tiepointextraction #stereopairs Contact me for more information, and the SVP software is now available on https://www.geocloud.work/

1 Comment
Like Comment
To view or add a comment, sign in
Gayan Samuditha

PhD Candidate in Biomedical & Health Informatics | Advancing Healthcare through Data-driven Insights | Researcher, Former Senior Software Engineer & Visiting Lecturer
4w
Report this post
Exciting update from the FreeSurfer ecosystem! 🧠 EasyAtlas is now available on FreeSurfer, using EasyReg to iteratively build an atlas in about 1 minute per scan (on CPU). Key highlights: * No preprocessing required * Works with MRI scans of any orientation, resolution, or contrast * No parameters to tune More information: https://lnkd.in/gwTpknpH

EasyReg and EasyAtlas surfer.nmr.mgh.harvard.edu
Like Comment
To view or add a comment, sign in
ARDOP LIGHTING

379 followers
1w
Report this post
✨ µLight Vision reveals what the eye can’t see. ✨ With the newest phase-calculation method, we can now clearly detect defects in optical gratings. This is a major step forward for precision optical inspection ! ARDOP LIGHTING is proud to highlight this innovation from LEIDA Technology. 👉 Want to learn more or schedule a demonstration? 📩 Contact us: https://lnkd.in/e54bbqN6
François PERRAUT

CEO chez Leida Technologies SARL
1w

Pure phase image; 🔬 Here, I observed a transmission diffraction grating with µLight Vision to record a differential phase contrast (DPC) image. 🧮 Then I applied the new phase calculation method currently being integrated into the software. ⁉️ This allows us to observe the highly periodic structure of the grating. I expected a sinusoidal variation in phase, but this is not the case. 🔎 Defects in the gratin can also be detected. ⛏️ µLight Vision is a wonderful tool for checking the quality of optically complex objects, and I never tire of improving it. Experimental conditions in the first comment. #µLight #leida ARDOP LIGHTING Caroline To Van Trang - Arous
- $Phase image of a transmission diffraction grating observed with µLight Vision and post-processed.$
1 Comment
Like Comment
To view or add a comment, sign in

4,852 followers

777 Posts

View Profile Connect

RF-DETR: How Small Models Can Beat Big Ones in Real-Time Computer Vision

More Relevant Posts

Explore content categories