
Loading summary
A
Okay, we're here in the remote studio with the grand return of the roboflow and Latent space and SAM combo. Welcome to Joseph, my sort of vision co host, I guess.
B
Thanks. Great to be here.
A
Welcome back. We also have. Welcome back. Nikhil Ravi, who's the lead on Sam 2. I guess just Sam in general. Right. And we have joining us Peng Chuan, who's also a researcher on SAM.
C
Yeah. Nice to meet you guys.
A
So congrats on SAM3's launch. I mean, like the, the demo, Each time you step it up really amazingly. And I think every time my general impression or takeaway when I tell people about SAM is just every time you have a new release, it's like once a year you show up, you drop a banger, and then you just drop the mic and go for next year. And you also add a dimension. So I was entirely weirdly not surprised when Sam3 had the 3D thing because I'm like, well, yeah, which is the next dimension to go? It's like 3D.
D
Yeah. Actually maybe just on that. I think that's actually a common misconception. We launched actually three separate models this time. It was SAM3, SAM3D objects and SAM3D body.
B
Yes.
D
Those were two completely separate models and SAM3 is just the image and video.
A
Understanding model and which is on a. On a debtor backbone and is sped up. Yeah, sorry, I didn't. I didn't mean to sort of pre. Preface all.
C
All.
A
All this, but maybe for just to remind our audience or maybe for.
C
For.
A
For people new to the SAM series of a podcast that we've done so far, maybe each of you can sort of go around and intro like your. Or your sort of entry into computer vision or sort of your relationship with sam. Go ahead, Nikki.
D
Okay, cool. Hi, everyone. I'm Nikhila. I'm a researcher at Meta. I've been at meta for 8 and a half years, so really been through evolution of the field. In that time, I really started working on a range of different problems in computer vision. Worked briefly on 3D. We bought this library called PyTorch3D, but really started on this segment anything as a project in around sort of late 2021. So it's actually been almost four years since I've been like working on this segment, anything space. And you know, we started with SAM1 in 2023, SAM2 last year in July 2024, and then now SAM3. So it's been, you know, the culmination of a lot of work of a lot of people over the years. So yeah, really, really excited to be at this point and you know, get to share it with all of you. I'll hand it over to Peng Chuan.
C
Yeah. Hello everyone. So I'm Pengchuan, I'm a researcher at the Santin. I have been working in computer vision this field for nearly nine years starting from 2017 I think it's a long time. I have been working in MSR for five years and then kind of moved to Meta Reality Lab to work on egocentric foundation models on AI glasses for a while. And then in 2023 I moved to Santing and that time is exactly the start time of Sun Slui and really kind of I think that's the lifetime experience I have on the Sanslui and it's glad that sensory is out and I kind of achieved my original grand goal of computer vision to reach kind of human performance of detection, segmentation, tracking, image and videos.
B
I'm Joseph, co founder, CEO at Roboflow where our mission is to make the world programmable. We think software should have the sense of sight and models like SAM and others are critical to unlocking that capability. Now millions of developers, Half the Fortune 100 build with roboflow's tools and infrastructure to create and deploy models to production. We've been big believers of the META family of open source models all the way back to like Mask, RCNN and Detectron 2, all the way to presence of SAM1, SAM2 and SAM3. The work that the META team does to advance state of the art and open source computer vision has been bedrock to enabling developers and enterprises globally to adopt AI. So we've been big fans of the work and I'm pleased to be joining you today, Swix to co host the episode on SAM3.
A
And you guys shipped your own Detter model too?
B
Yeah, we've been, we've been doing some work to advance machine learning research too. Like one of the, for example deter detection transformers which was born out of Neurips last year. I think Swix, you actually challenged us. You were like, hey, what are some of the advancements that are happening in computer vision and in visual AI? And we had this observation that Transformers had surpassed a lot of CNNs in vision tasks, but they hadn't been made to run real time, as in, you know, over 30 frames per second for example on like a small T4 or excuse me, small like edge device and hundreds of frames per second on like a T4. We did some research and published RF data, rubber flow detection transformer which Is, you know, we kind of joke the greatest of all time model for doing real time segmentation and object detection on the edge. Now in rfdetter, it's, you know, you have to have a fixed class list and need to know some of the objects that you want to segment at a time. But for anyone that's running on like constrained compute and on an edge device and wants like an Apache 2 model to do that, RF debtor and its family of models are key to fulfilling that mission and that goal.
A
Yeah, amazing. Okay, I think we are going to just go into a SAM3 demo. I think, Nikki, you've prepped some stuff to show us and this is great because obviously there's nothing better than the creator of the tool showing off the tool.
D
So just to start with, like, what is SAM3? So SAM3 is a model that can detect, segment and track objects and images and videos using what we call concept prompts. So I'm going to start with a simple image example and then we'll show you a video example. So a concept can be anything that is a short text phrase. So here, for example, we can use something like watering can. And you can see the model predicts a mask for the watering can. You can also then refine the prompts using clicks or additional visual exemplars, which I'll show you in a different image. But essentially, idea of a concept prompt opens up the ability to find all instances of an object category without having to manually click on every single instance, as you would have had to do if you were using SAM2 or SAM1. Now if the model misses any of the instances, you can add visual exemplars. So a visual exemplar is also a way to describe a concept to the model. So here I can add a positive box here and show the model that this is also an instance of a flower that we want to detect. So this is just an images. But what's really cool is you can now also do this in video. And so here I'll show you an example. Maybe this is a fibble match. You want to track all the players in white, for example. So red jersey or white jersey, you can provide a concept prompt and the model will find the objects in the first frame and then track and detect the new instances that appear later on in the video. So it's not just detecting on the first frame, but both tracking those detections and finding new instances that appear throughout the video. And one of the things we love to do in our demos is also show some real world applications of this. And so one idea here is that you can use this for video editing or adding effects. So here was a really simple mask effect. But you can imagine, for example, you might want to add a trail around the players. Yeah, you can follow them around. Maybe you want to clone them so you've got multiple players running around. You can also do background effects, for example, spotlighting players. And so these are just fun things you can do on top of the SAM3 outputs. And this is just like a way to show people like what you can do. There's also some templates which basically are pre populated with a text prompt and an effect. Um, and these are just some fun ways you can use the outputs. But really, you know, the crux of it is in this like create from scratch where you can upload any image or video and try SAM3 on that and we'll share the link so you can try it out as well.
B
One of the other demos that I have is like a busy, a busy scene for like doing labeling, which we can do later on. But just to give you a preview, it's like if you wanted to find tablecloth and maybe like back there, there's like airplane. So I'll do airplane and you kind of get the ability to start to.
A
You find the confidence thresholds.
B
They do. I don't know why tablecloth wasn't as good. I've used that one in the past. Table maybe.
C
Yeah. Cool.
A
Wow, look at that. Yeah, I think, I think the other impressive thing that you guys emphasize in your launch is also like the latency. I don't know where this, this particular inference is running, but it says something like SAM3 runs in 30 milliseconds on a single image. If I want 100 detected objects on an H200, obviously this is an H200, but it's also like just impressively fast and sometimes basically you can be real time if you want.
D
Yeah, definitely on images. On images it's really fast. And then on video it kind of scales with the number of objects, but it's for limited number of objects. It's still, still real.
C
Yeah. Also add, even for video, if you can afford kind of GPUs, implement very kind of good kind of parallel inference algorithm. So even you have a lot of objects to track, you can still get real time tracking performance as long as you scale up the GPU there.
A
So I'm reading in the paper it's 10 objects on 2, issue 1 hundreds, 28 on 4, issue 1 hundreds and 64 on 8, 200 something like that. I don't think there's an architecture. I don't know if there's. This is the parallelism demonstration that we're talking about.
C
Yeah. In fact, when you kind of try the demo the video to the kind of parallel implementation of the kind of video grounding. So it's already kind of in that fast mode.
D
Yeah. You try it with a video with like lots of objects and then you can notice that it's actually not very slow and you get the sense that we are doing the multi GPU inference. Yeah. Everyone should try out and see if it.
C
So. Okay.
A
Amazing. So this, this thing about concept segmentation, I feel like you had a prototypical version of this and in your paper you really talk about like sort of generalizing it, I guess. Like what was the Planning like in SAM3? Like at the start of this? Did you, you know, is. Is what we have today exactly what you planned for or did you kind of. Did it emerge as you discovered capabilities?
D
Maybe I could quickly talk about. Yeah, in SA did have a proof of concept of text prompting, but that was just a very early exploration. It wasn't really built out and you know, became the most highly requested feature since then. And so we, you know, in SAM 3 we really wanted to do it properly and actually do this in a way that it works in all different scenarios. And so we had to really think about how to formulate the problem. So it could have been that we took open ended text input and it works for all open ended text, or we could have be more focused, which is what we chose to do, and really focus on these atomic visual concepts like yellow school bus or a purple umbrella and really focus on nailing the problem for these like atomic visual concepts. But Pengcharan, maybe you want to talk a little bit about kind of the benchmarks that existed previously and how we had to actually fully redefine the task and the benchmark that we wanted to solve. Yeah. And maybe just to add to Pengtuan's point, like if you look at the size of these benchmarks, the previous benchmark Pengtuan mentioned, Elvis, that everyone uses, it has about 1,2k unique concepts. And the benchmark that we created, which we're calling segment anything with concepts or Seiko Coco for short, Seiko has more than 200,000 unique concepts. If you think about the natural language that people use, we don't just use a thousand words. We have a very large vocabulary. And we really wanted to build a benchmark that can capture that diversity and size.
A
Yeah, it's really impressive and also very formulaic, I guess, or classic that every great model work starts with a lot of data work, I think basically scaled up version of the same process for SAM2.
D
Yeah. In some ways I think the in SAM3 data engine really was like a very novel and critical component. I think, you know, to your point, competitive advantage in AI is not just about the models, but really about the data. And maybe even more so is actually the data engine to generate that data. And we put a lot of effort in SAM3 specifically to try and automate that process a lot.
B
One of the things that we're really impressed by is the diversity and depth as well as breadth of uses that we see with models like SAM in production. Basically, when you think about computer vision, you know, folks kind of like always classically think about like dogs and cats and simple sorts of things. And the reality is like computer vision is where AI kind of meets the real world. So any sort of thing that needs to be seen and understood, you need to have understanding of that thing. So a model like sam, expanding the concepts from like, you know, a few thousand closed form concepts max in a single model to tens of thousands of concepts means that you're going to see such a huge acceleration of the number of fields and applications of the model. So this is SAM3, right. So we've already seen and measured some of the impact of the SAM family of models and we pulled some of the updated stats on how impactful SAM has been across the roboflow community. I think, I think roboflow might maintain one of, if not the largest hosted instances of SAM. And we've seen basically 106 million kind of smart, poly created examples that are, that are SAM 1, 2 or 3 powered. And we estimate that that saved humanity collectively like a hundred, maybe 130 years, depending on exactly how you want to do the calculation of time. Just curating data. And each of those use cases. Right. Isn't dogs and cats on the Internet. It's things like, I don't know, we see medical labs across the world that are accelerating cancer research by doing things like counting and identifying the automation of neutrophils after a given experiment. Or we see folks that are using aerial imagery for things like helping a drone navigate through the world, or maybe counting and seeing, you know, solar panels from, from above, or maybe even doing like insurance estimates. We see folks that are building underwater trash, cleaning up robots. So like you can imagine an autonomous underwater bot that's navigating through the Pacific Ocean and identifying and Grabbing on and grabbing plastics and cleaning up the world's ecosystem. Relatedly, we've seen some work with aquariums across the US like Mbari who are doing work for keeping track of species and identifying the impact of ensuring given steps that are taken are increasing the populations of given fish. With like underwater fish cameras, we see folks in industrial settings like doing work to produce electric vehicles or get products from point A to point B. At the time of recording this, it's like near Christmas time and it's like high time for holidays for folks that are doing gift giving. And that ends up being really, really high time for making sure goods and services show up where they're supposed to be at the, at the given point in time. One of the statistics that we track is the frequency with which folks cite works like SAM or Roboflow or blogs that we publish. And there's now basically like a little over two research papers published every day citing some of the work across like the, the robofill community. And that's folks that are like publishing in Nature and Science Direct and a fairly prestigious number of journals. And each of those, you got to think about it, each one of those publications is someone's like seminal work. Often 6, 12, 24 months of effort that's been accelerated from models like SAM. So it's not an exaggeration to say like models like SAM are speeding up the rate at which we, you know, solve global hunger or find cures to cancer or make sure critical medical products make their way to people all across the planet. And at the infrastructure level, we're like thrilled and surprised constantly by the breadth and depth of adoption that we see from the community. I mean in the first five days of SAM 3 there was like 8 million inferences of folks that were running across all diverse sets of fields. And that's actually only increased because we it was released and then there's like Thanksgiving and now it's back and folks are like hitting it pretty hard. So it's been incredibly encouraging to see the both depth of adoption and how much the community takes and uses and relies on models like SAM and Prod.
D
Yeah, and I think from maybe just to add to that from like meta side, like we don't usually get as much visibility into all of these real world use cases. So you know, being able to kind of hear that from roboflow and having these models available on the platform is like so valuable for us is also we get to know how these models actually work in the real world, which is ultimately the best eval for a model. So I think it's definitely awesome to hear about all these things that we're empowering.
B
Nikila, you had this comment of the best eval for a model is like it's not necessarily benchmarked. What was it like if it works on unreal world things? I think it's a really good sound bite.
D
Probably something like the best eval as if it works in the real world.
A
Yeah, true.
D
And that's like the ultimate goal for all of our models like SAM1, SAM2, SAM3. We want people to use it out of the box as much as possible. And I think, you know, with language in SAM 3 specifically there, there does need to be in some cases some domain adaptation. But we have sort of tried to make that easy. I know Pengtuan, you want to talk a little bit about, about that. Like the fine tuning aspect.
A
I wanted to also endorse like the real world thing. I was just so happily surprised when I was visiting the CZI Imaging Institute for in preparation for our pod with Mark that they were using SAM in imaging the human cell and they showed us like how in reality all these sort of masses are actually like really undifferentiated and it's really hard for the human eye to track. This is actually a simpler one where you can actually. This is pretty clean here. In reality, a lot of it is just gray mush and you have to segment individual lysomes out of these. And they showed us how they were using SAM and fine tuning SAM to do it. Yeah, really complicated and also very meaningful for basic science research. And I also maybe mention this in the paper. The data distribution, you can actually see what SACCO does. So a lot of. A lot of animals. A lot of animals and then very surprisingly few maps. I'm like, maybe there should be more maps. I'll say Hugging Face has been doing a lot here and other companies.
D
Yeah. This is actually one thing. Something we get asked a lot is like what's the minimum amount of data I need to fine tune? And being able to do that with just sort of 10 data points is hopefully we'll unlock a lot more than we can do ourselves.
A
Yeah, I mean the more the merrier. Obviously this is where ablations are really helpful. You probably didn't have any fine tuned ablations in here. I think this is all data and model training oriented. But yeah, I mean like very, very, very clear. I just have a cheeky curious point. Is there a ratio of what is the ratio of the negative example to positive example?
C
Right.
A
So in Nicolas example when you were demoing just now, you only selected positive examples. Obviously there's going to be a lot more negative examples of not class than positive example. So there should be some exchange ratio where negative examples contribute smaller than a positive example. Or is that not the case for.
B
Positive and negative examples? I don't know that I have seen a golden ratio that works well or not works well. But I can offer anecdotally that a single negative example goes a long way. A common place where fine tuning is really helpful is data that's out of distribution that might have been impossibly in distribution. Like one of my favorite fine tuned examples is counting Waymos. There's not that much data that have like Waymos labeled throughout the streets of San Francisco. But SAM does a really good job to identify Waymo as like a vehicle. If you prompt with Waymo, it doesn't find anything. You find vehicle, it find it labels a Waymo as a vehicle, which is valid. But a Waymo is a specific type of vehicle, right? Usually from even just like a 10 second video clip you can actually start to have SAM3 learn what should have been seen versus as a waymo versus what should have been seen as as a vehicle. And even on a single image example we see that like Sam3 starts to adapt because it takes the text and image prompt into account. When it makes a subsequent inference from like three to five negative examples alongside positive examples, you start to see the model update its priors, if you will, for where it would predict things from what the user provided. All this is written with caveats, right? Because like when you talk about visual world, the negative example and the positive examples could have been a very different perspective or a very different type of object. Like maybe you're like labeling dog breeds and suddenly a new dog breed appears. Or maybe you have a perspective where it's overhead and then suddenly you have a side by side view. So usually the best way is to like have these things meet the real world data and try. But I'll offer maybe the note that a small number of negative examples goes a really long way. Like small, like three to five, not like hundreds.
D
Yeah. The other place when negatives play a big role is just is it in the image or not? And that was one of the things that we did was really separate the problem into a recognition problem and a localization problem. So first can you answer the question, is this object or is this concept in the image? And then if it's in the image, where is it in the image. And so to really build in that capability we had to annotate a lot of negative phrases in images. So basically a lot of phrases that don't exist in the image in addition to the concepts that exist in the image with the corresponding mask pair. So we have, you know, if you look at one of the tables in the paper which shows the training dataset distribution, I think it's table 24. We have about 70, more than 70% of the annotations are these like negative phrases that are not present in the image. So we have to really train the model to not detect stuff that is not in the image.
A
Yeah, I think that the separation of localization and it's basically precision recall.
C
Right.
A
But in the vision domain we basically.
D
Add this presence token to the model which explicitly separates the task of recognition and localization. So basically it simplifies the task and so the model doesn't have to try to do everything with just the, with just the proposals in the detector. Be able to have this global like sort of learned token just for the recognition part.
A
Yeah. In general I find that you guys did a lot of extra net new work. You had a really nice chart in here about the yellow boxes being the new stuff. I forget where.
D
Yeah, the architecture diagram.
A
Yeah, I'm like, holy crap. Last time it was like, you know, there was like the memory stuff. This is Sam 2 and here there's all this. Obviously it's hard to cover it all, but I wonder if there's any other interesting stories or tricks like the presence token that you might want to focus on.
D
Yeah, I mean this is nice, this diagram. I'm glad you brought it up because SAM3 isn't just a version bump, it's an entirely new approach to do segmentation. It's like this new interface for segmentation and it combines so many different tasks. Where previously you would have needed a task specific model that for each of these tasks, you know, interactive segmentation, text prompting, open vocabulary, detection, tracking, like all of these tasks you would have needed a separate model and so you really had to do a lot of work to bring it together. I think one of the things we did was really decouple the detection component and the tracking component. So you can see we still preserve the tracking components from SAM 2, but the detector is separate. And the reason we do this is if you think about what a detector has to do and what the tracker has to do, detector needs to be identity agnostic. So if you have a concept dog, it needs to be able to find all instances of that dog. And it needs to sort of have this representation of dog that is the same for all dogs. But when you're tracking those dogs through the video, each dog needs to have a separate representation such that we're able to preserve the identities. And so there is this kind of task conflict that emerges between the detector and the tracker. And so we really had to. We experimented a lot. We really tried to build kind of a unified approach to do things. But then what we found was having the separate detector and tracker really worked. But we share, we use the Perception encoder as this shared visual backbone. And there's sort of a text and image aligned encoder. You can see the green boxes there. It says from pe. That's Perception Encoder. That was also from our group in there at the time. This was released earlier this year in April. And so this really is bringing together components from like the entire Therapy Fair and meta ecosystem. We have Perception Encoder, we have a detector, we use SAM2. We also use llama in our data engine. So we really like using all the components from.
A
Yeah, it's like any third film in a trilogy. You always see the previous recurring characters come back.
D
Yeah, well, if it works, you got to continue using it and to connect.
B
To something we discussed earlier. You mentioned that at Video Component each object needs to be tracked independently. That's why the compute scales linearly with the number of classes.
C
Right.
B
Cause each of those instance types needs to be maintained.
D
Each of the IT scales with the number of detected objects.
B
Yeah, so each. So for example, like each dog that appears in the video, each one of those needs to be tracked independently. There was something else that you started to allude to in the paper that I was hoping we would spend some time discussing. And it's interaction of SAM3 and LLMs, Llama and others. So using SAM3 to almost be like a tool call for LLMs to give them better grounding and give them better visual understanding. And there's a paper in the table where you describe the increase in in performance. It's kind of alluding, I think to maybe where things are going for using SAM3 as a component part of multimodal architectures. Do you want to describe a bit about what the introduction of that work was meaning to showcase and how the interaction of SAM 3 and LLMs is envisioned to be important?
D
Yeah, maybe I can just do a quick intro and I'll hand it over to Pengtuan to do the deep dive. But essentially As I mentioned, SAM3 we constrain the Text input to these atomic visual concepts like yellow school bus or yellow watering can. But obviously people want to interact with the model with natural language and we want to enable that as well. And so that really segues into being able to use SAM3 as this visual agent for an MLLM. And so I'll hand over to Peng Truan. Maybe you can explain about the SAM3 agent setup and then talk through some of the results that we got there.
C
Yeah, yeah. So as Nikhina mentioned, the big picture is that SAM3 is focused on this kind of atomic kind of concept. But people definitely want to try kind of much more complex phrases like, okay, I'm going to locate the bigger kind of character for me. What kind of. For example, this can9 example. What is the kind of the feature that distinguish male and female in this picture? Then these are more kind of complex language. This is exactly kind of science three cannot do. But sensory agents target to kind of solve. In this case, you can see that it needs much more advanced language understanding and reasoning. The sensory currently do not have this kind of capability because it's small language encoder. But we know that large language models definitely was trained on a lot of this data and has this kind of word knowledge and the reasoning capability. Sensory agent is exactly using Sens 3 for the large language models to solve this problem. Kind of complex visual grounding tasks.
A
Is there any sort of insights that you or surprises that you have other than I guess like SAM is sensory is a very good tool. Is that the main conclusion?
B
If you go to go to Table 8 in the paper as you. As you describe this, if you don't mind.
A
Yeah, table eight.
C
Okay.
B
Yeah, yeah, There we go.
C
Yeah. Please maybe kind of quickly reply to kind of Swig's kind of question. I would say that first, besides that Sense three is really a good tool kind of provide the eye for large language model. The other thing we definitely found is that Sense three is not perfect. It's not like kind of as robust as kind of human eye. Then language model also kind of helps to correct the kind of sound error. They have a synergy between each other. Instead of just okay, large language model provide the brain understand solution, provides the eye.
A
Interestingly, you use Llama four. I saw you. There's a mix of Llama three and Llama four here. But it looks like it does best with Gemini 2.5, which makes sense given this comparable set of MLLMs. I'm just like, I think like the baseline also is just that. Like, well, what extra addition does this add on top of just the mllm. Like I would maybe like want to do that completion. Maybe you've already done it somewhere.
C
What do you mean by additional thing?
A
So basically like without the tool call there's some native capability inside the MLM itself.
C
Wow, that's a really kind of good question. In fact, our kind of reviewer even asked that question. So without, you can imagine that without kind of large language models, kind of without vom kind of sens3 only for kind of reason's sake, it only achieves about kind of on the validation set. If I remember it correctly, it's only achieving kind of 30 kind of numbers there. And also it's very intuitive. You can see that for reasons it has this kind of short none untested. It has kind of different subsets short non short. Then it's very close to sensory training data. Like it's kind of atomic phrases, short phrases. Is this kind of very kind of complex reasoning. You will see that kind of for short sensory only is very close to kind of the kind of sensory agents. But for none the gap is so large which indicates that. Okay, that is exactly kind of the capabilities that's large language model bring in. Got it.
B
I can show an example here that might be insightful too.
A
Go for it.
B
So even comparing, Even comparing like SAM3 and Gemini, let's say that we just want to have them do like an object detection task here of finding here we're going to prompt with a speedometer and RPMs and we're going to ask for things like indicator light number and needle. And if we run SAM 3 head to head with Gemini 3 and Florence 2 almost as a baseline of like where things have been and we see each of the results. First things first, you'll note that the speed of inference of SAM3 is quite quick. This is just calling the Gemini 3 Pro API. So whatever is provided from hosted compute is sort of what you get on the on the response time. And then the second thing you'll note is in addition to speed is some of the accuracy of of results who might get it. We might have a timeout error.
C
Let's see.
A
Do you have ELO scores?
B
I have what scores?
A
ELO scores.
B
Like elo?
A
Yeah, you had the arena. Okay. I was wondering what the ELO was was because you said you were blind testing this.
B
Yeah, that's actually interesting because we had blind tested SAM3 before it was released. Not a SAM3 just for people to try and compare. I think we call that like a potential SEG or SEGPEDIA or something. And we allowed users to vote and they kind of unanimously voted for what they didn't know at the time was SAM3. We actually got like emails of people being like, hey, like where can I use that? And we just sort of ignored them until the model came out. But so here with the responses, you see that the grounding capabilities of SAM3 compared to even Gemini are out ahead currently. So not only is it doing grounding, but if you look closely, you can actually see it's making segmentation masks too. Whereas Gemini 3 struggles to do. It just does detection by comparison. And then the other thing is just the richness of detections, like the recall is high as well as the precision. And if we compare here, it does almost as well. Right. But you see that it misses some of the numbers and has kind of these, some of these erroneous boxes that it's, that it's predicted. And then it also doesn't do segmentation, so it just does detection of the task. So you can envision that the same way the SAM3 paper introduces the idea of using SAM3 in tandem with MLLMS. I would expect that to be the case pretty soon and maybe the Google team taking some notes to improve Gemini and other series of models based on what SAM3 demonstrates here. So in other words, not only is it faster, but it seems to be more comprehensive for concept segmentation.
D
And I think the speed actually is, is a huge factor for many use cases. I, I think like even matter, we're using SAM3 for various different product use cases and fast inference speed is very critical to, to enable that. So I think that's something that I think in many cases you don't even need an MLM for. It's this kind of overkill to use an MLM for some applications.
B
The other interesting thing is the Florence 2 results. And you know, Florence 2 is a little bit older of a model now, so maybe it's not fair to put up head to head with the state of the art. But it is useful as a way to just see how far we've come. Because Florence 2 by comparison labels the entire region as a single class without seeing individual detection of numbers and indicator lights and needle. And not only that, but it actually runs at about three times the speed as SAM3. So SAM3 again is faster doing a task that the other models are not doing in segmentation and more accurate both in recall and precision of the things that it's intended to find, which I think really showcases the capabilities of the model.
C
In fact, I even got kind of a little surprise about this because this domain, this more like kind of OCR Nike because recognition numbers is nearly ocr. We do not prioritize this domain of data collection. It works. So we know that it roughly works. But I think, I guess surprise, that's gonna. It works so well.
B
That's encouraging. Even a task that wasn't expressly prioritized, it still does a great job on.
C
Yeah. In fact, during our data engine, we intentionally do not sample OCR heavy images. Wow.
B
On an easier 1 glass mug Sam 3 Gemini 3, Florence 2 Sam 3 loaded first and has really impressively. It sees even this glass mug in the corner, which I think is something Sam3 does a great job of is occlusion and partial objects. Gemini 3 struggles a bit with this one, I think maybe because the opacity of the objects by comparison. And then Florence 2 does a good job at finding one of the glass mugs. So again, another type of task that shows the power and veracity of the model.
D
Yeah, I mean exhaustivity, like finding every instance is something we heavily prioritized and is really built into the data engine design. You know, Mary Mae Pengtron, you want to talk about how we design the data engine to really scale exhaustivity because you know, if a human was to set an appetite every single instance, it would take a really long time and verify. But we put a lot of effort into trying to automate and speed up that process such that we could get to the data scale and diversity needed to get to a step change.
C
Yeah, yeah. I think definitely I would say data engine is the critical component that we achieve sensory performance like now. So maybe we can go to the data engine picture. I think we have kind of illustration there.
D
Yeah. Page five.
C
Yeah. Here you can see that this is our annotation kind of pipeline. So we first source the images and kind of generate the noun faces. So this is the input of this task. Source images and generates kind of long faces from for example, kind of llama generate caption and we pass the caption to get the long faces. This is the kind of input distribution. Then we use kind of sensory model in the loop to generate kind of candidates kind of masks that we kind of that. That should be the candidate, but it's not perfect, especially in the beginning. Then we go to kind of, you can say, go to. The next step is verification. So sans3 give you this mask. Then we need to first do mask verification to verify each mass, whether it's good or not. And then can after we can filter all the bad mass. There are some good mass left and we verify whether this good mass are exhaustive or not. Like your mark example. So for example, the kind of bad model do not predict that partial mark, then the exhaustivity check will be kind of failing there kind of then kind of if the exhaustivity is filled, then we go to next step. You can see that we can go to the pipeline, go to this kind of so called human manual correction, human kind of manually annotate kind of all this kind of missing masks. You make this data point exhaustive. So you can see that exhaustivity is a very big factor there. And we play it as the kind of center place in this data engine. But you can see that if we ask human annotator to annotate every mass from scratch, it will take a lot of time. I remember each data point in the beginning will take about more than kind of two minutes to finish. But if you use model in the loop, then it's reduced to about kind of 45 seconds. You kind of use model to propose mass and then just a few months to kind of to annotate the missing mass, then it's 45 minutes. Another very key kind of innovation in this data engine is that we really found that this verification steps like to verify a mass is good or not, or to verify now the good mass are exhaustive or not, can be done by AI, can be done by that multimodal model. That is a breakthrough. And then we can fine tune our kind of, for example Namask 3.2 with our kind of verification, human analysis, verification data. We get kind of superhuman performance on these two verification tasks. And then we do not need human on these two tasks. This further brings our kind of per data point annotation time to about 25 seconds. So you can see that from the original kind of all human to kind of about two minutes, to finally kind of 25 minutes for one kind of data point. How can this is kind of our kind of the journey of our data engine to make it super efficient.
B
Did you maintain statistics on how many images were specifically hard? For example, like we had n many objects that were very difficult occluded or we had some number of images where the comprehensive test was was really hard. Or did you just bet that by having a large scale you would encompass occlusion and exhaustive cases?
C
In fact, we know we kind of maintain this kind of information exhaustivity. Which one is hard, which one is easy? Because first in our data engine, when human annotates, then we exactly know which kind of which data point are exhaustively by the model which we need a human intervene. In fact, we have that kind of metadata in our data set. The second one is that the better kind of the more beautiful part is we have this kind of exhaustively AI annotator. Then we can kind of given a new data point, we can automatically decide whether this is a difficult kind of data point or easy data point by this AI annotator.
A
Yeah, I think that the sort of bootstrapping and annotation story was very strong last time around and it's even stronger this time. What are you going to do when you run out of humans like next year? You're going to have superhuman level of everything, right? Like TCs and PBS. What then?
C
I'm not so optimistic about this. And first, indeed our current plan for next project is kind of this kind of fully automated data engine without humans. That's our dream. I would say that that will can. I think that is the kind of perfect thing. But still we need some kind of useful information. There's no free lunch. There's kind of something kind of no model can do well. And we need a human to inject that useful information. I would say that what kind of practically can do is really minimal human intervention. Human only do the tasks that the model cannot do the most difficult task. So that's the kind of kind of first one kind of internal data engine. The second one is about human performance on this kind of PCs task. My feeling is that this kind of computer vision is going to enter this. When we get to human performance, we will enter this RAO HF domain of computer vision. So you can see that language models kind of before kind of in the birth age kind of the language model are not human performance kind of SFT kind of really imitation learning really do their job get to very good performance. But if you only do SFT and the SFT data is annotated by human, then your performance is bounded by human. You cannot get kind of superhuman performance just by kind of this kind of data engine approach to use human annotate data. And they're found that you need to go to this RLHF domain that human really just tell which two point which one is better. This is exactly kind of the philosophy that two to tell which one is better is easier to really kind of to construct the data point from scratch so you can get kind of higher performance, kind of get better performance from human job from scratch. I would say that I hope that after science 3 we can see kind of New research emerge from kind of in computer vision, which is okay, how we go beyond human performance. Science 3 is close to that. But I would say that new learning paradigm is needed to go beyond human performance for sensory tasks and for computer vision.
D
Yeah. Just to add to that, this is. Penguin is only talking about images. I think video is a whole nother challenging beast. And getting to that fully automated data engine is something that we tried to do in SAM2. We actually didn't get to that fully automated approach in SAM1. We did. We fully as a 1B dataset that we released was fully annotated automatically. We didn't really get to that in SAM 2 for video and in SAM 3 for video. I think there's still like a lot of room to push on this sort of pseudo labeling for video and really be able to get to that same step changes we had on images.
B
What are the biggest changes to see the same step change in video that you've seen in images for automated data pipeline?
C
Yeah, yeah. I would say that learning kind of good video, large language kind of video large multimodal model. So when we do sensory is kind of earlier this year or kind of last year. You can see that image nudge multimodal model is very good. But video large multimodal model I think really can it become good or practical later this year like kind of Queen three. This kind of model gets kind of roughly kind of okay in that state. So we have a good kind of base model to fine tune our data and to get human performance for this recognition or verification task. I would say that you can see that we need definitely kind of sensory effort in the perception side. But we also need this kind of multimodal natural language model kind of efforts kind of good foundation model on the kind of vision language side. I think it's ready. It's ready now.
D
Yeah. I would say video annotation is just so much more time intensive to get to that to be able to annotate enough data to train a verifier like video mask annotation. We just found it was very time intensive. So maybe there are more efficient video annotation strategies. I think there's a lot of exploration that can be done there too.
C
Yeah.
A
Spending a bit of time on video. I wanted to also talk about obviously last time we were focused a lot on memory attention. I think this time there was this sort of mask lit thing that I wanted to just like get more ideas of or just to share the idea just generally. What was it called?
D
The muscular detection.
A
Muscular Detection score. Exactly. And how it's basically smoothing within a temporal window which I think basically a lot of computer vision models don't have this and they could just simply add it and it'll be a lot more stable when it comes to video. And I don't know why they don't do it.
C
Maybe I can comment on this first why they didn't do that. I think one big reason is kind of this streaming requirement. You can see when you want to gather information across the entire math net then you need to wait for the MAST nets in the end and kind of get the strategy. So that will sacrifice some streaming kind of capability. So you can see that the streaming requirement is kind of somehow kind of limits we kind of traditional measure to do this. But I would say that this is definitely kind of beneficial. The reason why is that I think even human do this. You can imagine that when something just appears kind of at the corner of the video, like a hand appears at the corner of the window kind of the video, you just do not know whether there's a man or woman. So you might even make mistakes also for sensory it will make this mistake. But when you get more and more information the person really enters the video fully then you get to know okay, whether this a man and woman. So this kind of going to gather kind of more information to really kind of know whether kind of this concept is kind of the concepts you create is the idea here. So there is a trade off between the latency and accuracy here. If you care more about accuracy then you can use this kind of overall kind of information. Can of course the mastnet to get kind of more robust signal about the concept. But if you care about kind of latency then you need to make a decision in the very beginning and then you will sacrifice some accuracy.
D
I think also in many video use cases, I think Joseph, you were showing on roboflow. So users care more about detecting the objects rather than having unique identities. So in some cases this, maybe it's. This isn't required to preserve the identities throughout the video and you just want to essentially do detection per frame. Like the rubber flow rapid examples you were sharing.
B
Yeah, there's cases where being able to count and you know the objects are all going to be the same. So you don't care as much about unique classes. You just want to know the full presence. Things like that matter. But then there's other cases like you mentioned where I don't know, like in sport you care about individual players versus just knowing that there's 11 players on the pitch. One thing that might be useful actually to discuss with some of our time is we talked a little bit about how SAM3 and MLLMs will play nicely together. But there's probably like a greater discussion about how SAM3 fits into the broader AI ecosystem and like what bigger picture trends it might fit into. Do you have some thoughts on what this represents about where things are headed?
D
Yeah, maybe I could say one point and then Peng Chan, feel free to add one. You know, as we mentioned before, M3 isn't just a version bump. We are really having a unified model that can do many different tasks in the same unified architecture. And so you know, in the same way that LLMs can do many different tasks without needing a task specific model, like with SAM3 we're able to do image promptable concept segmentation, video promptable concept segmentation we can do. We don't need a specialist model for counting. We can do interactivity. There really is like multi capability visual models that are on par or better than the single task state of the art models. So that's really one place in which SAM3 fits into the AI ecosystem in terms of MLLMs. I don't know if Pengtran, you want to talk about the agent approach?
C
Yeah, yeah, definitely. I would. You can see. Let me give kind of. I would say that Stan3 can now kind of really get a big step change in vision. How it really helped the general AGI fit into general AGI or frontier model landscape is very, very kind of exciting for me. We always have this example, kind of give this kind of six finger kind of hands up picture, ask how many fingers do we have in this picture and then can be an audio frontier model C5. And you can imagine that with size 3 then we can just kind of first detect how many fingers we have that very robustly kind of six fingers. And then the multimodal model should know that okay, this is six finger hand instead of five five finger. You can see that the errors made by frontier models can be solved if we use kind of sensory as a tool. But then how running is sensory as a tool is the end of the picture or should really somehow sensory even just be naturally embedded into this frontier models. The frontier models have running the Sense3 capability by themselves. I would say that there's a lot of possibilities there. Kind of my picture is that now we have a very good brain with this kind of frontier models and we have a very good eye with sensory. Now let's see kind of whether the eye really Kind of is kind of working together kind of natively with the brain together, or eye is really kind of a different kind of organism kind of organ and then lead to kind of somehow like a tool to kind of work with the brain. I think this is a very exciting kind of research area.
B
And so in your analogy, if you think about like the visual cortex compared to like a human human brain, like, you know, we have rods and cones in our eyes that do kind of very fast, we joke like lizard brain level detection, simple stuff. And then you have your brain that reasons about some of the visual information that your eyes see. In your example of SAM3 as a tool call or SAM3 as natively a part of the multimodal models, which future do you think is more likely?
C
I think at least I want to bet on when they work natively together. The future for simple, I would say for simple or even intermediate difficult vision tasks, for example, counting with less than 20 objects, I think for this kind of simple task, this is like system one kind of visual reasoning with our brain. This should be kind of our brain kind of should do it better kind of by themselves. But with very, very difficult tasks, you can see that if we are counting kind of maybe kind of thousands of objects kind of in the picture so crowded, then we kind of even need to kind of draw something there. I would say that at that time maybe we did kind of some extra model kind of for difficult tasks. You can see that this is a hybrid approach. But I'm more excited. I think for most of the cases should be native. The reason why there is, is you can think that I would see kind of perception or grounding and really kind of know where it is, how many it is. It's like a fundamental capability of our brain. I'm just not happy that kind of the frontier model just cannot count how many fingers immediately. And instead of need to call a tool to do that, I think this kind of should be system one thing and this should be kind of natively in our brain. And also if our brain cannot do this task, which means that it's definitely kind of missing some kind of very critical kind of visual capability by itself. So that's kind of, I would say that it just feels that the intuition just feels that it's not correct to do not have this capability by itself.
B
So for very simple system one questions things like how many fingers on a hand that should be native, but for maybe more complex things that are maybe long running tasks and long running reasoning, then maybe there's a bit More of like a tool call approach.
C
Yeah, yeah, exactly. For example, you can see that we already kind of in our sensory agents or kind of in our AI annotator. We even demonstrate this approach kind of for simple cases. The model can do it by itself that okay, I can detect for example 10 people here and then the natural language model can. Even the AI annotator can even know that okay, this 10 people is not exhaustive. Okay, there are more people there. So if you want to do kind of well then maybe you need to do more step for example to call an expert model. So you can see that this is a very, very kind of native kind of reasoning process for more advanced or complicated vision questions, I have a related.
B
But maybe slightly different question. Sam 3 is an incredibly powerful piece of work and it's open source as a part of now MSL open source critical to achieving AGI.
D
Maybe I can comment on SAM specifically, but in SAM 3 we did leverage many of the open source contributions people have made on top of SAM 2 there were new data sets, there were new benchmarks, there were new kind of inference time optimizations. We adopt a lot of the things that the community built on top of the models on top of the data sets. And so all those contributions helped make SAM3 for SAM series. We've really benefited a lot from being very generous with what we open source and then leveraging what the community builds on top of that. But that's just from the SAM perspective.
A
I think it's clear what the community brings and offers and I think, you know, every time we do this we always shout to the community to try it on their use cases and report weird findings. And if it doesn't do what you are trying to make it do, well let's talk about it and maybe sort of implement it in the next version like you already said. Peng Chuan, you already hinted at what might be coming for SAP4 which is at least a little bit more of the document and OCR work. Any other directions are interesting I guess. Obviously a lot more video work as well. What is the talk of the town in the CV community that oh, it'd be really great or super obvious. Next year is going to be the year of what?
C
Yeah, maybe kind of I can first talk something and then Nikina can add first. Definitely going to. I think even it's not SAM4, it's size 3 something, size 3 point something like small models. SAMS3 currently only have really kind of one model, kind of one size model, kind of more kind of efficient model that kind of fit for kind of eight cases and also kind of a more efficient model for video. I think currently kind of the video model is not efficient. You either you can achieve very good kind of throughput, but you need GPUs to do that. So first kind of small and efficient models, that's one kind of big thing. The second big thing is definitely kind of video.
A
Roboflow can do that for you.
C
Yeah, the second thing is video. I would say that video is still far from, I would say have a big gap from human performance right now. There's kind of still kind of a lot of research need to be done there how to do end to end training with video. We do not have and kind of we have this kind of decoupled approach but we do not end to end train this model. And we expect definitely kind of it will be kind of benefit from kind of end to end training. And also as we kind of on video side really kind of how to scale up the data engine. We need to kind of definitely kind of AI annotators for video. We tried that by. But yeah we can. I think that's something and definitely worthwhile to do. The third one, we also discussed about that how sensory, how perception fit into AGI this big landscape. Now we have the eye. How the eye work with the brain to do to solve real kind of reasoning tasks. Not only output segmentation, but really kind of answer how many kids are here or even answer the question, okay, I can, I have an example of of kind of biology labs. Kind of the robots need to decide whether they can liquid in the can test tube is kind of at the kind of correct level or not. You can see that this is kind of involved perception but also involves reasoning. How to kind of solve this more kind of visual reasoning task with sun is kind of a very big direction.
D
On the robotics topic. It was exciting to hear from like several friends that work at, you know, different robotics companies on how they're like immediately starting to use SAM3 and I think especially for the video use case, I think robotics is probably one of the domains where I think improving video performance will have a lot of impact. And so I think yeah, that's definitely an area that we could improve on further. But yeah, to Penguin's point, I think there's still another step change to be achieved on video PCs.
A
So yeah, just a quick comment on the robotics things. We're interviewing a bunch of robotics folks here as well as like Fei, Fei Li who obviously started imagenet. A lot of people are betting on explicit world models. And SAM is not for better or worse. And I wonder when that crossover might happen. That's an open question. If you guys want to take any world models discussions re where things are.
B
Going based on community questions similar to how Nikila mentioned after Sam 1. The like almost obvious thing that people wanted was like open concepts prompting because people are like, great, this model can see things, but I want to tell it what I want it to see. And now with the introduction of SAM3, you have this stepwise component which feels like a key component of, you know, the ChatGPT era for vision is arriving. As a result, what's going to happen is now you've provided people with an open text box and media and so you're going to get all sorts of queries from people that maybe the model isn't primed to be able to perform particularly well on yet. For example, earlier we were talking about document understanding and document reasoning being a place where there's known improvements to be made. And so you'll have people that will probably prompt to try to OCR things or you'll have people that want to do work with spatial reasoning, like give me the object to the left of this other object or give me a sense of where things are in relation to one another, which is critical for robotics like we're discussing because that's how you navigate throughout the real world. You'll also have, I think people will want action recognition and vision language, action models, vlas like the same things that, where you have these tasks where people are used to providing open text prompts and getting. Here's the part of the scene where the player kicked the ball or the tennis player made the serve. Those are interesting for the purposes of how to understand and synthesize visual inputs. And so now that you've kind of given this open text box for media, there's going to be a flood of the types of things users are going to want to try to do, some of which SAM is already going to be really well adapted to do, some of which not. And I think that that's going to be. It's going to reveal itself. Of the types of things that are that are obvious. One of the things that we wanted to discuss was like where to use SAM and discover how to build with sam. So in addition to the meta team building a tremendous playground for being able to interact with images and video and kind of apply effects with like a video emphasis. I think one of the things that we're pretty excited about with SAM3 is how much it positively impacts each part of building a system for visual understanding. So for example, the very first step of historically aggregating and collecting a data set, because you think that there's not a model that understands the slice of the world that you want to understand, is where automating away lots of labeling can exist. Basically, if you collected a bunch of data of something that is already in the SAM3s knowledge, then you can prompt for SAM3 to automatically label all that data for you. And so we've actually made a bet on SAM3 being a core part of auto label at roboflow. Giving users a first pass of saying, hey, if you have a new image or you have a new video, start providing just a text prompt and allow SAM3 to find and automatically label those regions of interest for you downstream. I think there's areas for fine tuning like, you know, within a week of releasing SAM3 Med, SAM3 came out for adapting SAM into medical contexts. And I think that's a harbinger of what's to come. Like there'll be lots of domain specific adaptations of SAM in places where maybe there's a specific ontology that someone wants to understand or maybe there's a place where just the model doesn't have great awareness yet. And I think we're already beginning to see that with hundreds of fine tunes that users are creating for various domains. And then the last area is like, okay, I've got my model, now I want to use it. And so one of the things that we are really proud of is to be ready on launch day to showcase the infrastructure we've built to burst and scale like infinitely large. As folks have models that they want to deploy and make it readily available. Having an endpoint that serves either a fine tuned model or a model as is, or even a model that might be able to run on edge hardware as smaller models come out or maybe distillation comes to rise is I think also an awesome place of, of where we're seeing Sam3 being impactful. Each part of like the computer vision lifecycle and pipeline.
D
That's awesome. Yeah, I think especially the impact on speeding up annotation. I think we've seen that consistently on roboflow. And I'm really curious to see how the introduction of SAM3 really helps speed up that process even further. I mean just from playing around with it, it's so much faster than having to manually annotate every single object. So yeah, you're really curious to see how that that improves the experience.
B
One of the things that we were pretty excited about is we were kind of able to build an entirely new product in The World of Sam 3. And we called it, we called it Rapid, but basically it's like there's probably a model that already understands the objects in the world that you want to see. So here I'm screen sharing an example of like these are vehicles next to our office in San Francisco that go by and you can see here's a waymo and here's like other vehicles and like if I just have like this 10 second clip and let's say, you know, the first thing I want to do maybe is just like count cars and I want to get a sense of like each of the vehicles. What's really awesome is I can just, you know, of course text prompt and say I want vehicle. And as I toggle through different frames in my video, SAM3 already recognizes and understands those objects. Now one thing that I think is really interesting, there was a conversation earlier about how much you want to rely on a model versus human's output of the model for what you care about. So for example, let's pretend in this scene, maybe the only cars that we care about are the ones that are like before the crosswalk and maybe not far in the distance. Then you would get people that would say, hey, you know what, I actually want the objects that are like most confident and I would like, you know, move my slider down to like getting a fewer number of objects. Whereas maybe others might say, hey, I want like every single presence of a potential object in the scene, which even gets like reflections on the building of objects. As computer vision approaches this world where we increasingly have models that can understand and improve themselves and we rely on what human output and human preference from the models is, we're going to get these funny scenarios where things aren't immediately deterministic of what a human cares about. And I think that's where tooling fills a big gap. But it also is going to be a place where it'll be really interesting to see where users kind of start to use and apply the models and why you need. So there's this last mile work to put the model in context in the domain that someone is trying to solve and tackle.
A
So let me, let me, since you're here, right, this is one of those things where I'm like, I'm not sure this concept, concept, the concept of labeling concepts can scale only because I don't know if I ever. If this slider between less and more is the way if ultimately I Need to tell you whether or not to include reflections. Right. Because the reflections sometimes is great. That's exactly what I want. Most of the time it's not going to be what I want. I don't know if some RLHF thing is going to solve any of that because you just need more prompting. Just saying vehicle is not going to do it. Yeah, I don't know. Feel free to disagree.
C
Can you imagine sad's going to such pipeline coming for example, as kind of Swig said that maybe kind of the reflection is exactly what I want. Then you need some kind of iterations with the interface or the model or to kind of to get finally what you need. So you need to specify the concepts kind of more clearly through kind of multiple iterations. Can human not be involved in this iteration but just kind of models in just kind of do it automatically? I think that's kind of something kind of definitely going to. It's. I would say I'm quite interesting that you can imagine this workflow and I want kind of reflections and then I can kind of with the kind of the default kind of threshold, maybe kind of the kind of model will get an output. Then another kind of very strong kind of perception model on their kind of like kind of Gemini slowly will then kind of ask. We ask Gemini slow whether there's kind of some reflections there and it says yes. Then we can. Then you can see that we can automatically move the threshold lower and Gemini again to see whether kind of the reflection is not included or not. So somehow this process conceptually should be done kind of completely with AI.
A
I see.
C
Yeah. Yeah, exactly.
A
So for now the answer is imagen and we can sort of tie it closer. I think Joseph is showing us the sort of Waymo annotation. Yeah, it's nice. Now you have a way more model.
B
Yeah, I was just doing an example where maybe we want to find an object that's not already represented in the training data. I think prompting can solve. Yeah, I think prompting could solve the problem of like reflections because maybe you could say like vehicles on the street. But to your point, like you would have to like see that that's a failure case.
C
Right.
B
Like if I was like just setting up a camera and saying count cars, I wouldn't anticipate realizing that reflection could be a problem. And so I think this is why like in some ways human in the loop. Because identifying human intention, not necessarily human knowledge is what's going to be important for a lot of last mile use. But yeah, I'm pretty Excited about yeah.
C
Maybe I want to echo kind of what Joseph said. This is also my experience just different people have quite different kind of definition of even a visual concept. For example for some kind of data set even hand. Some people would like to kind of just kind of annotate their palm kind of part as kind of their hand and some people will can include the arm kind of also kind of ask hand. Then when we kind of first test through on kind of some very kind of customized data set, we found okay, the performance is not that good. And when we can finally look into kind of the kind of performance we found, okay, this is kind of just the user have a different definition or explanation of the concepts but kind of both explanations are okay then in this case you can see that really need a human in the loop to do the kind of few short fine tuning or to adapt to the user's definition of this concept.
B
That's exactly right. It's not always like deterministic of what someone really wants. Which is why I think like even if you have a fully comprehensive omniscient model, putting the model into the context of what the user's trying to do is where a lot of tooling and infrastructure becomes really, really helpful. Anyway, I found, I found our waymos.
A
You continue to build like excellent tooling for vision and I think the world is very grateful for that. Let's get to calls to action. You know, I think, you know we, we, we've sort of given it a good overview and people obviously should read the paper and try out the playground. Try out roboflow if they're, if they're interested in diving deeper. What is there calls for action from, from each of you.
D
I mean try the demo, try the code. We've got a lot of resources on GitHub Reaper and you know, it's a.
A
Very well managed launch by the way, like kudos. I don't know. This probably takes a lot of effort just on the launch itself even after the model's done.
D
Yeah. And actually just on that maybe one thing just shout out to the whole team. I think this was M3 was our biggest and most ambitious project to date. And it really took a huge team of scientists, engineers, interns, software engineers across the company. So really huge shout out to the entire team that that made not just the model successful but also the demo and then all the launch and everything. So it was a huge team effort. Definitely would love to hear from people on what you're using the models for where it's failing. Raise GitHub issues. Message us on Twitter. We'd love to hear from you on where we should go next as well.
C
Yeah, and on top of that, definitely going to try out also our benchmark, the stack of benchmark. I would say that it's likely that the benchmark will last longer than our sunscreen model. Maybe kind of next year there will be a stronger model. But the benchmark is kind of the one that I hope to guide the community to kind of get better and better models. Kind of to get to kind of. We measure human performance on the benchmark. I think maybe we are the first one to do that for this kind of very kind of segmentation and kind of video kind of grounding in task. It's very difficult to measure human performance on this task. Hopefully this benchmark guides the community to achieve human performance for this task and even going to surpass human performance there.
B
We set out to be one of the best places, if not the best place, to build with SAM3 and the SAM family models. So we're eager to see what people build with SAM and computer vision models to move the whole field forward. We have infrastructure for everything from deploying SAM3.0 shot to making your own fine tunes to automating labeling of data with sam. And we continue to see the impact with each subsequent release, expand the number of use cases and the amount of use and accelerate the time to value. So excited to see what folks can build Ron Roboflow with Sam.
A
Thank you all so much. This is a really great company. It's great work and just obviously always expands my mind as to what is possible with machine learning. Yeah, I mean, you know, we're not at SI yet or AGI yet, but every day we're getting closer.
D
Awesome. Thank you so much.
A
Thank you.
C
Thank you.
Episode Title: SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)
Date: December 18, 2025
Host: Latent Space (“A”, also called Swix)
Guests:
This episode delves into the launch of SAM3 (Segment Anything Model 3) by Meta – a significant leap in computer vision enabling open-ended, text-prompted object segmentation, detection, and tracking in images and videos. Guests from Meta discuss the research journey, technical breakthroughs, and emerging applications. Joseph Nelson, CEO of Roboflow, adds the industry and developer perspective, highlighting the real-world impact and deployment of SAM3. The conversation touches on model architecture, fine-tuning, open vocabulary segmentation, data engines, community impact, integration with LLMs, and projections for the future of computer vision.
[00:03–04:21]
[05:26–10:07]
[10:50–13:32]
[13:32–17:47]
[17:47–23:16]
[23:16–28:11]
[28:11–36:00]
“SAM3 isn’t just a version bump, it’s an entirely new approach to segmentation. ... Where previously you needed a task-specific model for each task, you now have a single model for all.”
— Nikhila Ravi (24:39)
[37:52–45:34]
[45:34–52:04]
[52:04–57:03]
[57:03–74:47]
“Every time you have a new release...you just drop the mic and go for next year. And you also add a dimension.”
— Host/A (00:29)
“SAM3 is a model that can detect, segment, and track objects in images and videos using what we call concept prompts...Now if the model misses any of the instances, you can add visual exemplars.”
— Nikhila Ravi/D (05:39)
“A single negative example goes a long way.”
— Joseph Nelson/B (20:30)
“Competitive advantage in AI is not just about the models, but really about the data. And maybe even more so is actually the data engine to generate that data.”
— Nikhila/D (13:05)
“The best eval is if it works in the real world.”
— Nikhila/D (17:58)
“Counting fingers should be system one...if a brain can’t do this natively, it’s missing a critical visual capability by itself.”
— Pengchuan/C (54:11)
“We hope the benchmark will last longer than our SAM3 model. Next year there will be a stronger model, but the benchmark can guide the community.”
— Pengchuan/C (73:17)
Latent Space continues to showcase the collaborative march of open source and research advances powering the next generation of AI systems—SAM3 exemplifies unified, open, scalable vision for AI engineers.
For show notes, links, and more resources:
latent.space