From the course: Introduction to Artificial Intelligence
Image diffusion models
- Imagine that you are in a bakery standing in front of six beautiful cakes. Two were chocolate, the other two were vanilla, one looked like a wedding cake, and another was a Brazilian brigadeiro. As you stand there, someone tells you that you're the new baker. You've never baked in your life, but you grab a bowl and start baking. How would you teach yourself how to bake cakes? Imagine that there's no recipes and no other chefs. You just need to figure out how to bake a cake on your own. This is a lot like the challenge that generative AI systems run into when creating new images. The system didn't get to see the people, ingredients, or techniques that went into making the images. In a sense, the images are baked cakes. To generate a new image, the system needs to teach itself how to un-bake, and then re-bake, billions of images. So to do this, you put on an apron and smash each one of the cakes into mounds of mush. Now you have six mounds of chocolate, vanilla, eggs, cream, butter, and sugar. Each of them mashed up into a multicolored mess. And now your job is to re-bake these mounds back into the same neatly designed cakes. Ideally, someone should be able to walk into the bakery and never know that they've been mushed and re-baked. That might sound strange, but that's very similar to how image diffusion models work with generative AI systems. To learn how to generate new images the system blurs, or diffuses, each image until it's destroyed. You end up with a cloudy, colorful graphic like a mushed up cake. Then the system sharpens the images until they match the original. Ideally, they should look exactly the same. So why would the Gen AI system go through this terrible process of destroying and then recreating images? Remember, the generative AI systems create something new. There's no guidebook for generating new images. There's no one set of rules to create an image of an astronaut standing on a brick wall. The system needs to know all the lines and colors that went into a brick wall. Then it needs to know all the different patterns that make up images of astronauts. Then it needs to put the two together. It has to be complete with shadows and a realistic perspective. You could only get that kind of knowledge by having a deep understanding of the ingredients. It's like someone going into the bakery and asking for a half chocolate, half vanilla wedding cake. You know all the ingredients because you've baked and destroyed many different versions of similar cakes. It's only by fully understanding the ingredients that you could generate something new. Now image diffusion models, like all foundation models, rely on an enormous amount of data. If the system wants to generate something new, it's going to have to rely on a lot more than just six cakes. These models will look through billions of images and identify patterns using statistics. It won't see a cake or a wall or an astronaut the same way that you do. Instead it will see pixelated patterns that it's learned through self-supervised learning. It destroyed and recreated all those images. It's through this terrible process of destruction and recreation that these generative AI systems learn how to create something new.