The Evolution of AI: My Journey Understanding How We Got Here

Sai sandeep Chenna

Published Jun 21, 2025

I'll be honest, I never really understood how AI worked until recently. Sure, I used ChatGPT and was amazed by what it could do, but I had no clue what was happening under the hood. That curiosity eventually got the better of me, and I decided to dig deeper.

What I discovered was a fascinating story of how we went from AI systems that could barely remember what happened a few words ago to ones that can write entire articles, generate images, and even have conversations that feel surprisingly human. The key to this transformation? Something called the transformer architecture.

During my master's degree and in the time since, I've spent a good amount of time diving into research papers, articles, YouTube videos, and other resources to understand these concepts. Gradually, I started to piece together the story. And honestly, it's pretty incredible how we got from there to here. Let me share what I learned.

The Early Days: RNNs, LSTMs, and the Bottleneck

Article content — The Old Days: When AI Had Memory Problems

Before 2017, the AI landscape was dominated by models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs). These models were adept at encoding historical information within sequences. However, they faced significant challenges, especially with long sequences and effectively capturing context. For instance, if you were trying to predict the last word in a sentence like "I grew up in Kolkata, went to school there, lived there for 20 years, and now I speak fluent..." You'd immediately say "Bengali" right? But RNNs and LSTMs often struggled to maintain context over long distances.

A major limitation surfaced in sequence-to-sequence models, commonly used in machine translation. These models, often built with LSTMs, used an "encoder" to process the input sentence and condense all its information into a single, fixed-length vector. This vector would then be passed to a "decoder" to generate the output. This single vector became an "encoder bottleneck" as it simply couldn't hold enough information for longer or more complex sentences.

The Birth of Attention: A Glimmer of a Solution

To overcome this bottleneck, around 2014, researchers came up with a clever solution. Instead of trying to remember everything in one place, what if the AI could look back at the original sentence whenever it needed to? This approach became known as "attention". The paper "Neural Machine Translation by Jointly Learning to Align and Translate" proposed a "soft attention mechanism".

The story behind this breakthrough is quite interesting. Dzmitry Bahdanau, one of the key researchers, was trying to solve the bottleneck problem during his internship. He first experimented with complex approaches involving "cursors" moving through sequences, but they were too complicated to implement within his remaining weeks. So he tried something simpler: allowing the decoder to simultaneously look at all parts of the input sequence.

The inspiration came from his own experience learning English in middle school. He noticed that when translating, "your gaze shifts back and forth between the source and target sequences as you translate." This observation led him to create a mechanism where the AI could do the same thing, search through the source sentence and focus on the most relevant parts when generating each word of the translation.

Instead of squeezing all information into one vector, this new approach allowed the decoder to automatically perform a soft search for parts of the source sentence that were relevant to predicting a target word.

Interestingly, the term "attention" wasn't even in the original concept, it was added later by Yoshua Bengio during the final review process. The mechanism worked so well from the very first try that they rushed to publish it, knowing that other research teams were working on similar ideas. This was a game-changer, but it was just the beginning.

2017: The Transformer Revolution – Attention Is All You Need

The true "explosion of Transformers" into Natural Language Processing (NLP) began in 2017 with the landmark paper "Attention Is All You Need". What made this paper revolutionary was its bold decision: it completely removed recurrent neural networks (RNNs) and instead relied solely on the attention mechanism. They called their new architecture the "Transformer" and it worked incredibly well.

This was a radical departure, as most papers at the time were incremental. The Transformer paper, however, combined multiple innovations into a highly effective architecture, which has since proven "remarkably resilient".

Key architectural features introduced or adopted in the Transformer include:

Positional Encoding: Since attention operates over sets of data and lacks an inherent notion of sequence order, positional encodings are added to token embeddings to inform the model about a token's position.
Residual Network Structure and Layer Norms: These components, borrowed from other deep learning advancements, significantly aid in the model's optimizability, allowing gradients to flow easily during training.
Multi-Headed Attention: The attention mechanism is applied "multiple times in parallel" with different sets of weights, allowing the model to seek out "different kinds of information" concurrently. These "heads" are like independent "message passing schemes" happening in parallel, while "layers" stack these schemes in series.

Deconstructing the Transformer: Communication and Computation

Think of a Transformer as working in two main steps:

The Widespread Impact and Effectiveness

The Transformer's design proved incredibly versatile, rapidly expanding beyond NLP to revolutionize various AI fields:

Computer Vision: Vision Transformers (ViT) process images by chopping them into small squares ("patches") and feeding them into the Transformer, allowing self-attention to learn visual relationships.
Speech Recognition: Models like Whisper convert audio spectrograms into "slices" that are treated like sequences of "tokens" for Transformer processing.
Reinforcement Learning: Decision Transformers model sequences of states, actions, and rewards as a "language," enabling planning and control.
Biology: AlphaFold, a groundbreaking model for protein folding, has a Transformer at its computational heart.

A key reason for the Transformer's success is its remarkable flexibility. Unlike older architectures that required data to conform to specific Euclidean spaces (like images for convolutional networks), Transformers treat all inputs as sets of tokens. You can "chop up everything and throw it into the mix" allowing self-attention to "figure out how everything should communicate". This "frees neural nets from the burden of Euclidean space".

Furthermore, Transformers excel due to three core properties:

Expressiveness: They can implement "very interesting functions," including a form of in-context learning or meta-learning. This means that when scaled large enough, Transformers can learn new tasks directly from examples given in the prompt, without needing traditional gradient-based fine-tuning for each new task. They appear to perform a kind of "gradient-based learning inside the activations".
Optimality: Their architecture, with features like residual connections and layer norms, makes them "very optimizable" by gradient descent, ensuring smooth and effective training.
Efficiency: Critically, Transformers are "extremely efficient" on modern hardware like GPUs. Their computational graph is "shallow and wide," which perfectly leverages the parallel processing capabilities of GPUs, enabling the training of much larger models, a crucial factor for achieving state-of-the-art performance in deep learning.

The Present and Future of AI: Evolving Capabilities

Today, we are witnessing the fruits of this evolution with models like ChatGPT, Whisper, Stable Diffusion, and new multimodal foundation models. These systems enable unprecedented applications in audio generation, art, music, storytelling, and exhibit remarkable reasoning capabilities in common sense, logical, and mathematical contexts. They are also increasingly being aligned with human values through techniques like reinforcement learning with human feedback (RLHF).

However, the journey continues, and while many challenges persist, significant progress has been made:

Vastly Extended Context Windows: While context length limits were a significant hurdle just a few years ago (e.g., 4,000 tokens), we've seen immense progress. Models like GPT-4 Turbo, Claude 3, and Gemini 1.5 Pro now boast context windows of 128K, 200K, and even 1 million tokens respectively. This progress is driven by architectural innovations and optimized attention mechanisms that allow Transformers to process entire books or extensive codebases, greatly enhancing their ability to maintain context over long documents.
External Knowledge Integration (RAG): The quest for truly long-term memory remains, as models don't possess inherent 'memories' of past interactions. However, significant strides have been made with techniques like Retrieval-Augmented Generation (RAG). RAG allows models to access and synthesize information from external, continually updated knowledge bases, effectively giving them a 'scratchpad' or 'long-term memory' to overcome the 'short-lived' interaction limitation.
Optimized Computational Efficiency: The quadratic scaling of standard attention with sequence length remains a theoretical challenge for ultra-long contexts. However, practical optimizations like FlashAttention and Multi-Query / Grouped-Query Attention have dramatically improved memory and speed efficiency, making longer contexts feasible on current hardware. Research into sub-quadratic attention mechanisms continues to push these boundaries further.
Enhanced Controllability: While inherent stochasticity can lead to creative outputs, greater controllability over model outputs is increasingly sought for critical applications. Advancements in prompt engineering, fine-tuning techniques (including targeted instruction tuning), and aligning models through sophisticated RLHF variants offer more precise steering of model behavior.
Mixture of Experts (MoE) and Specialization: While current foundation models are trained on vast, general data, the future increasingly includes more sophisticated Mixture of Experts (MoE) architectures. Rather than single monolithic models, MoE allows different 'expert' subnetworks to specialize in different domains or tasks, dynamically activating the relevant experts. This enables training even larger models efficiently and envisions highly specialized 'doctor GPT' or 'law GPT' models, leveraging vast general knowledge while excelling in niche domains.
Alignment and Understanding: The profound question of how these powerful models relate to human cognition and intelligence remains a vibrant area of research. Understanding and aligning AI systems with human values, ethics, and cognitive processes is paramount as their capabilities grow.

As I've learned more about this technology, I've come to appreciate not just how clever the solution was, but how it opened up possibilities that even its creators probably didn't fully anticipate. We're living through one of the most significant technological revolutions in human history, and it all started with teaching machines how to pay attention.

To view or add a comment, sign in

The Evolution of AI: My Journey Understanding How We Got Here

Sai sandeep Chenna

The Early Days: RNNs, LSTMs, and the Bottleneck

The Birth of Attention: A Glimmer of a Solution

2017: The Transformer Revolution – Attention Is All You Need

Deconstructing the Transformer: Communication and Computation

Recommended by LinkedIn

1. Communication (Multi-Head Attention)

2. Computation (Feed-Forward Network)

The Widespread Impact and Effectiveness

The Present and Future of AI: Evolving Capabilities

More articles by Sai sandeep Chenna

Others also viewed

Beyond Brute Force: Rethinking AI’s Path to True Intelligence—Thinking, Reasoning, and Creating New Knowledge

Artificial Intelligence - should we regulate AI ?

Artificial Intelligence: What it Means for the Fire Service

We need to talk about AI

What Investors Need to Know About the Recent Breakthrough in AI Interpretability

ARC-AGI Benchmark, AGI, and ASI: The Journey to Superintelligence?

AI by AI

We really need to talk about LLMs

Part 8 – Attention is All You Need: The One Idea That Blew Up AI Forever

AI Creating AI: The Dawn of Machine-Made Minds

Explore content categories

The Early Days: RNNs, LSTMs, and the Bottleneck

The Birth of Attention: A Glimmer of a Solution

2017: The Transformer Revolution – Attention Is All You Need

Deconstructing the Transformer: Communication and Computation

Recommended by LinkedIn

1. Communication (Multi-Head Attention)

2. Computation (Feed-Forward Network)

The Widespread Impact and Effectiveness

The Present and Future of AI: Evolving Capabilities

More articles by Sai sandeep Chenna

The Memory Thief: The Waking Dream

The Story of Clustering: How Machines Learned to Group Things

The Memory Thief

Others also viewed

Beyond Brute Force: Rethinking AI’s Path to True Intelligence—Thinking, Reasoning, and Creating New Knowledge

Artificial Intelligence - should we regulate AI ?

Artificial Intelligence: What it Means for the Fire Service

We need to talk about AI

What Investors Need to Know About the Recent Breakthrough in AI Interpretability

ARC-AGI Benchmark, AGI, and ASI: The Journey to Superintelligence?

AI by AI

We really need to talk about LLMs

Part 8 – Attention is All You Need: The One Idea That Blew Up AI Forever

AI Creating AI: The Dawn of Machine-Made Minds

Explore content categories