A NEW AI Model out for Text to 3D?! MVDream Explained...

I'm super excited to share this new AI model with you. We've seen so many new approaches to generating text, then generating images, and they are only getting better. After that, we've seen other amazing initial works for generating videos and even 3D models out of text. Just imagine the complexity of such a task when all you have is a sentence and you need to generate something that could look like a real object in our real world with all its details. Well, here's a new model that is not merely an initial step; it's a huge step forward in 3D model generation from just text: MVDream.

As you can see, it seems like MVDream is able to understand physics compared to previous approaches. In Jetset, it knows that the view should be realistic with only two ears and not two for any possible views. It ends up creating a very high-quality 3D model out of just this simple line of text. How cool is this? But what's even cooler is how it works. So let's dive into it. But before doing so, let me introduce a super cool company answering the video with another application of artificial intelligence: voice synthesis.

Introducing Kits.AI

Kits.AI: a platform for artists, producers, and fans to create AI voice models with ease and even create monetizable work with licensed AI voice models of your favorite artists. Kits.AI offers a library of licensed artist voices, a royalty-free library, and a community library with voice models of characters and celebrities created by the users. You can even train your own voice with one click. Simply provide audio files of the voice you want to replicate, and Kits.AI will create an AI voice model for you to use with no back-end knowledge required.

Generate voice model conversion by providing an acapella file, recording audio manually, or even inputting a YouTube link for easy vocal separation. And that's pretty cool since I can do it pretty easily. Get started with Kits.AI using the first link in the description right now.

Now, let's get back to the 3D World. If you look at a 3D model, the biggest challenge is that they need to generate both realistic and high-quality images for each view from where you are looking at it. And those views have to be spatially coherent with each other, not like the four-eared Yoda we previously saw or multi-phase subjects we see since we rarely have people from the back in any image dataset. So the model kind of wants to see faces out because one of the main approaches to generating 3D models is to simulate a view angle from a camera and then generate what it should be seeing from this Viewpoint. This is called 2D lifting since we generate regular images to combine them into a full 3D scene. Then we generate all possible views from around the object.

That is why we are used to seeing weird artifacts like these, since the model is just trying to generate one view at a time and doesn't understand the overall object well enough in the 3D space. Well, MP Dream made a huge step in this direction. They tackled what we call the 3D consistency problem and even claimed to have solved it using a technique called score distillation sampling introduced by DreamFusion, another text-to-3D method that was published in late 2022, which I covered on the channel.

MVDream's Architecture

Before entering into the score distillation sampling technique, we need to know about the architecture they are using. In short, it's yet just another 2D image diffusion model like DALL-E, MidJourney, or Stable Diffusion. More specifically, they started with a pre-trained DreamBooth Model, a powerful open-source model to generate images based on Stable Diffusion. Then the change they made was to render a set of multi-view images directly instead of only one image, thanks to being trained on a 3D dataset of various objects. Here we take multiple views from the 3D object that we have in our dataset and use them to train the model to generate them backward. This is done by changing the self-attention block we see here in blue for a 3D one, meaning that we simply add a dimension to reconstruct multiple images at a time instead of one.

Below, you can see the camera and timestep that is also being inputted into the model for each view to help the model understand where which image is going and what kind of view needs to be generated. Now, all the images are connected and generated together so they can share information and better understand the global content. Then you feed it your text and train the model to reconstruct the objects from a dataset accurately. This is where they apply their multi-view score distillation sampling process I mentioned.

They now have a multi-view diffusion model which can generate, well, multiple views of an object, but they needed to reconstruct consistent 3D models, not just views. So this is often done using NeRF or neural radiance fields as it is done with DreamFusion, which we mentioned earlier. It basically uses the trained multi-view diffusion model that we have and freezes it, meaning that it is just being used and not being trained.

How MVDream Works

We start generating an initial image version guided by our caption and initial rendering with added noise using our multi-view diffusion model. We add noise so that the model knows it needs to generate a different version of the image while still receiving context for it. Then we use the model to generate a higher-quality image, add the image used to generate it, and remove the noise we manually added to use this result to guide and improve our NeRF model for the next step.

We do all that to better understand where in the image the NeRF model should focus its attention to produce better results in the next step, and we repeat that until the 3D model is satisfying enough. And voila, this is how they took a 2D text-to-image model, adapted it for multiple view synthesis, and finally used it to create a text-to-3D version of the model iteratively. Of course, they added many technical improvements to the approaches they based themselves on, which I did not enter into for simplicity. But if you are curious, I definitely invite you to read their great paper for more information.

There are also still some limitations with this new approach, mainly that the generations are only of 256 by 256 pixels, which is quite low resolution even though the results look incredible. They also mentioned that the size of the dataset for this task is definitely a limitation for the generalizability of the approach.

This was an overview of MVDream, and thank you for watching. I will see you next time with another amazing paper.

Thank you.

Keywords

MVDream
AI model
3D model generation
Text-to-3D
Score distillation sampling
Multi-view images
3D consistency problem
DreamBooth Model
Stable Diffusion
NeRF (neural radiance fields)

FAQs

Q: What is MVDream? A: MVDream is a new AI model designed to generate high-quality 3D models from simple text inputs.

Q: How does MVDream create 3D models? A: It uses a 2D image diffusion model adapted for multi-view synthesis and employs score distillation sampling to ensure consistency in the 3D models it generates.

Q: What is score distillation sampling? A: Score distillation sampling is a technique used to guide the generation of 3D models by progressively refining images and ensuring spatial coherence across different views.

Q: What are some limitations of MVDream? A: The model generates images at a resolution of 256x256 pixels, which is relatively low. Additionally, the size of the dataset used for training is a limitation for the generalizability of the approach.

Q: What is Kits.AI? A: Kits.AI is a platform for artists, producers, and fans to create AI voice models easily, with a library of licensed artist voices and community-created models.