From the course: Everyday AI Concepts
Let the machine learn without you
- Foundation models have an enormous appetite for data. The more data the model consumes, the more powerful the system becomes. That's why some of the most advanced generative AI systems are being developed by companies with huge computing resources. You need that kind of massive computing power to vacuum up and process all that data. But remember that there isn't just one type of data. Sometimes the data is labeled, but most of the data in the world is unlabeled. That means that sometimes the system will vacuum up an image that's labeled dog image, but most of the time, it will just be a photograph with no label, whether it contains a dog, a person, or a mountain. You can use something called unsupervised learning to cluster together photographs that have a similar pattern. You saw that unsupervised learning doesn't need these labels. So unsupervised learning can cluster together millions of images that can contain a dog-like pattern. It just won't know it's a dog because the data isn't labeled. Unsupervised learning will cluster the images, and then a human needs to go through and label that pattern as a dog. But having a human go through and label all these clusters would be an impossible task. So to feed the foundation models, these organizations needed a way to label all those unlabeled clusters. In a sense, it needed to use unsupervised learning to cluster together the unlabeled data and then label those clusters so it can be used with supervised learning. These new systems need to harness the power of unsupervised learning to vacuum the data, but at the same time use supervised learning to classify all that data. It's almost like the system is vacuuming and organizing at the same time. This technique is called self-supervised learning. The way it typically works is that you use unsupervised learning to guess the labels for new data. So the system might look through billions of images of dogs. Then it'll look at some of the text and other context to create a pseudo label. The system can create a pseudo label based on the text, "Just walking Charlie at the park." If the system generates enough of these pseudo labels, it can assume that these images contain a dog. You may have noticed that this isn't very precise. The system might find a picture of a person holding a cat that says, "She's scared of dogs." Then the system might create a pseudo label that misidentifies the cat as a dog. That's why a lot of these AI systems will hallucinate. The system hallucinates when it makes a statement that's verifiably wrong. You can ask for a picture or a person with a dog, but the system might show the previous image of a person with a cat. That's because it attached a dog pseudo label for the cat image and learned something that was wrong. Now, a key thing to keep in mind is that self-supervised learning can vacuum up all the data with very little human intervention. Remember, that's one of the big downsides of supervised learning, it has a very restrictive diet of human-labeled data. Self-supervised learning allows organizations to blow past this limitation and create its own labels based on a lot of computer guesswork.