Segment Anything! Meta's Amazing New AI

Introduction

Segmentation is the ability to take an image and identify the objects, people, or anything of interest. It's done by identifying which image pixels belong to which object, and it's super useful for tons of applications where you need to know what's going on, like a self-driving car identifying other cars and pedestrians on the road.

We also know that prompting is a new skill for communicating with AIs, but what about promptable segmentation? Promptable segmentation is a new task that was just introduced with an amazing new AI model by Meta called the Segment Anything Model (SAM). SAM stands for Segment Anything Model and is able to segment anything following a prompt. How cool is that? In one click, you can segment any object from any photo or video. It's the first foundation model for this task, trained to generate masks for almost any existing object. It's just like ChatGPT for segmenting images – a very general model trained with every type of image and video with a good understanding of every object.

Similarly, it has adaptation capabilities for more complicated objects like a very specific tool or machine. This means you can help it segment unknown objects through prompts without retraining the model, which is called zero-shot transfer. Zero-shot as in it has never seen that in training. SAM is super exciting for all segmentation-related tasks with incredible capabilities and is open source, super promising for the research community, including myself, and has tons of applications. You've seen the results, and you can see even more using the demo linked below if you'd like.

We've also had a quick overview of what it is, but how does it work, and why is it so good? To answer the second question of why it's that good, we must go back to the root of all current AI systems: data. It's that good because we trained it with a new dataset which I cite as the largest ever segmentation dataset. Indeed, the dataset called Segment Anything 1 Billion was built specifically for this task and is composed of 1.1 billion high-quality segmentation masks from 11 million images. That represents approximately 400 times more masks than any existing segmentation dataset to date. This is enormous and of super high quality with really high definition images, and that's the recipe for success: always more data and good curation.

Other than data, which most models use anyway, let's see how the model works and how it implements prompting into segmentation tasks because this is all related. Indeed, the dataset was built using the model itself iteratively. As you can see here on the right, they used the model to annotate the data, further trained the model, and repeated this process. This is because we cannot simply find images with masks around objects on the internet. Instead, we start by training our model with human help to correct the predicted masks. We then repeat with less and less human involvement, primarily for the objects that the model didn't see before.

But where is prompting used? It's used to say what we want to segment from the image. As we've discussed in my recent podcast episode with Sander Sulath, founder of Learn Prompting, which I think you should listen to, a prompt can be anything. In this case, it's either text or spatial information like a rough box or just a point on the image, basically asking what you want or showing it. Then we use an image encoder as with all segmentation tasks and a prompt encoder. The image encoder will be similar to most I already covered on the channel, where we take the image and extract the most valuable information from it using a neural network.

Here, the novelty is our prompt encoder. Having this prompt encoder separate from our image encoder is what makes the approach so fast and responsive, since we can simply process the image once and then iterate prompts to segment multiple objects. As you can see by yourself in their online demo, the image encoder is another Vision Transformer (ViT) that you can learn more about in my Vision Transformer video if you'd like. It will produce our image embeddings, which are our extracted information. Then we will use this information along with our prompts to generate a segmentation.

But how can we combine our text and spatial prompts with this image embedding? We represent the spatial prompts through the use of positional encodings, basically giving the spatial information as is. Then for the text, it's simple: we use CLIP, as always, a model able to encode text similar to how images are encoded. CLIP is amazing for this application since it was trained with tons of image-caption pairs to encode both similarly, so when it gets a clear text prompt, it's a bridge for comparing text and images.

And finally, we need to produce a good segmentation from all that information. This can be done using a decoder, which is simply put the reverse network of the image encoder, taking condensed information and recreating an image. Though now we only want to create masks that we put back over the initial image, so it's much easier than generating a completely new image as DALL-E or MidJourney does. Such models use diffusion models, but in this case, they decided to go for a similar architecture as the image encoder: a Vision Transformer-based decoder that works really well. And voila, this was a simple overview of how the new SAM model by Meta works.

Of course, it's not perfect and has limitations like missing fine structures or sometimes hallucinating small disconnected components. Still, it's extremely powerful and a huge step forward, introducing a new interesting and highly applicable task. I invite you to read Meta's great blog post and paper to learn more about the model or try it directly with their code or demo. All the links are in the description below. I hope you've enjoyed this episode, and I will see you next time with another amazing paper.

Segment Anything! Meta's Amazing New AI

Introduction

One more thing