From the course: Hands-On Introduction to Transformers for Computer Vision
So, what is a transformer? - PyTorch Tutorial
From the course: Hands-On Introduction to Transformers for Computer Vision
So, what is a transformer?
- [Instructor] Hey, everyone. Welcome to Chapter 2, Module 1. We're going to be talking about, So what is a transformer after all? We need to ease our way into this course, and we want to make sure that everyone is starting from the same point here. If you don't know already, transformers ushered in a new era in AI. It all started back in 2017 with the paper "Attention Is All You Need." That kicked off this new era of AI with all the transformers coming into both vision as well as language. And in 2020, ChatGPT as well as the vision transformer came out and really ushered in this new era to stay. But what is a transformer? First off, there's no rush. We'll be easing into this course and gradually ramping up as we go. This is not to say that we'll cut corners along this course, but I want to make sure that no one is left behind. If you ever feel like maybe it's going a little bit too fast or maybe I say a term that you're unfamiliar with, feel free to pause the course, do a quick Google search or two to understand, or maybe replay the last couple minutes to make sure that you're up to date. I'm going to try my best to pace this course as best as possible, but it's totally okay to go at your own pace as well. So, let's start with very high level. How do transformers work? Transformers work because they pay attention. As the paper is titled, of course attention is all you need. This is very important when it comes to transformers. But what is this attention that the transformers are actually talking about? Well, let's look at an example here. We can see the sentence on the board on the left, "The animal didn't cross the street because it was too tired." Well, when we think about attention, we see we have that word "it" underlined. Attention in transformers is how much the model can pay attention to not only one word, but how all the other words around it relate to that word. When we look at our right side here, this is actually a diagram showing us how all the different words in the sentence relate to the word "it." Some words relate much more stronger to the word "it," compared to some words which we can aptly ignore. For instance, when I say "it," I would most likely put my eyes straight to "animal" because "it" is referring to the animal at the beginning of the sentence. It's very important in this context. However, when I think about other words like "the" or "was" or "too," I don't really need to think about these words as much, and I can kind of ignore them, because they're not actually impacting the context of the sentence. This is the core of attention, which basically allows us to bring context into our sentences whenever we think about how we're going to interpret them. Hopefully, this makes a little bit of sense, but we can look at another example if it's not making any sense so far. In this example, let's look at two different sentences that use the same word. The first sentence is, "The mouse quickly ran towards the cheese." The second sentence is, "Left-click your mouse in order to open the file." Now, despite these being the same word, they're spelled the same way, and more importantly, they're going to be represented to the model in the same way, we don't encode this word differently just because it's one mouse versus another mouse, how would the model know which mouse I'm talking about in this case? Well, I can use the context clues around the sentence to understand exactly what I'm talking about. So, when I pay attention to the sentence, or the model pays attention to the sentence, I can look at this first sentence and say, "Oh, mouse," but I see words like "ran" and "cheese," I can probably put together that this is probably the animal mouse. When I look at these second sentence, I can look at the words around the word "mouse." I can look at things like "left-click" or "file." I begin to understand that, "Oh yeah, this is probably a computer mouse that we're talking about here." This kind of attention allowed transformers to really unlock so much potential, both in language as well as vision transformers that we just previously didn't have. We'll go more into this detail about how revolutionary they are, but these transformers quickly outclassed traditional methods because of this attention. We have previous methods such as CNNs, RNNs, which had really intuitive and awesome methods to get to the correct outputs before transformers. But after transformers come, it really just changed everything. Now we can just put so much more context, throw so much more at the model, and train it on so many more avenues that, really, if you're training large-scale models in pretty much any modality, it's hard to make an argument about why you would not use a transformer. The most awesome part about all of this is we kind of figured out really quickly that transformers were really good and had a lot of potential, and compared to traditional things like CNNs or RNNs, we can really unlock a lot with transformers. However, as we kept on doing more research and digging deeper, we realized, "Oh, there is like way more here than we thought there might be," and that they kind of just kept going up and up and up. And the more data we threw at it, the more compute we threw at, it, the more scale we added, the farther transformers went. Before we can go forward, though, first we need to take a look back first. We need to understand why transformers were so revolutionary. In order to understand how they were so revolutionary, we have to understand what they were iterating on. The last thing we're going to do here is kind of introduction to computer vision, some basics in AI and machine learning, and really help set the stage for 2017, when that paper finally drops, "Attention Is All You Need." Check out the next module to keep moving forward and understand exactly how transformers revolutionized computer vision and all of AI.