rasbt
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 1 deletion b/‎.gitignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ch04/04_gqa/README.md‎
Lines changed: 5 additions & 20 deletions b/‎ch04/04_gqa/README.md‎
Lines changed: 5 additions & 20 deletions
diff --git a/‎ch04/04_gqa/gpt_with_kv_gqa.py‎
Lines changed: 11 additions & 38 deletions b/‎ch04/04_gqa/gpt_with_kv_gqa.py‎
Lines changed: 11 additions & 38 deletions
diff --git a/‎ch04/04_gqa/gpt_with_kv_mha.py‎
Lines changed: 9 additions & 41 deletions b/‎ch04/04_gqa/gpt_with_kv_mha.py‎
Lines changed: 9 additions & 41 deletions
diff --git a/‎ch04/04_gqa/memory_estimator.py‎ renamed to ‎ch04/04_gqa/memory_estimator_gqa.py‎ b/‎ch04/04_gqa/memory_estimator.py‎ renamed to ‎ch04/04_gqa/memory_estimator_gqa.py‎
@@ -12,7 +12,7 @@ appendix-D/01_main-chapter-code/3.pdf
 appendix-E/01_main-chapter-code/loss-plot.pdf
 
 ch04/04_gqa/kv_bytes_vs_context_length.pdf
-ch04/04_gqa/savings_vs_n_kv_groups.pdf
+ch05/05_mla/kv_bytes_vs_context_length.pdf
 
 ch05/01_main-chapter-code/loss-plot.pdf
 ch05/01_main-chapter-code/temperature-plot.pdf
 
@@ -169,6 +169,7 @@ Several folders contain optional materials as a bonus for interested readers:
   - [FLOPS Analysis](ch04/02_performance-analysis/flops-analysis.ipynb)
   - [KV Cache](ch04/03_kv-cache)
   - [Grouped-Query Attention](ch04/04_gqa)
+  - [Multi-Head Latent Attention](ch04/05_mla)
 - **Chapter 5: Pretraining on unlabeled data:**
   - [Alternative Weight Loading Methods](ch05/02_alternative_weight_loading/)
   - [Pretraining GPT on the Project Gutenberg Dataset](ch05/03_bonus_pretraining_on_gutenberg)
 
@@ -2,12 +2,9 @@
 
 This bonus material illustrates the memory savings when using Grouped-Query Attention (GQA) over regular Multi-Head Attention (MHA).
 
-
-
 &nbsp;
 ## Introduction
 
-
 Grouped-Query Attention (GQA) has become the new standard replacement for a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA) in recent years. Note that it's not new and goes back to the 2023 [GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints](https://arxiv.org/abs/2305.13245). And even the larger variants in the good old Llama 2 series used it.
 
 Here's a brief GQA summary. Unlike MHA, where each head also has its own set of keys and values, to reduce memory usage, GQA groups multiple heads to share the same key and value projections.
@@ -28,19 +25,17 @@ While GQA is mainly a computational-efficiency workaround for MHA, ablation stud
 
 However, this assumes that the number of key-value groups is chosen carefully. However, if we set the number of key-value heads equal to the number of heads (this special case is known as multi-query attention), it will negatively affect the modeling performance.
 
-
-
 &nbsp;
 ## GQA Memory Savings
 
 The memory savings are mostly reflected in the KV storage. We can compute the KV storage size with the following formula:
 
 bytes ≈ batch_size × seqlen × (embed_dim / n_heads) × n_layers × 2 (K,V) × bytes_per_elem × n_kv_heads
 
-You can use the [memory_estimator.py](memory_estimator.py) script in this folder to apply this for different model configs to see how much memory you can save by using GQA over MHA:
+You can use the [memory_estimator_gqa.py](memory_estimator_gqa.py) script in this folder to apply this for different model configs to see how much memory you can save by using GQA over MHA:
 
 ```bash
-➜ uv run memory_estimator.py \
+➜ uv run memory_estimator_gqa.py \
   --emb_dim 4096 --n_heads 32 --n_layers 32 \
   --context_length 32768 --n_kv_groups 4 \
   --batch_size 1 --dtype bf16
@@ -62,25 +57,15 @@ Ratio (MHA / GQA)   : 4.00x
 Savings (GQA vs MHA): 75.00%
 ```
 
-The savings when using GQA over MHA are further shown in the plot below for different key-value group sizes:
+The savings when using GQA over MHA are further shown in the plot below for different key-value group sizes as a function of the context length:
 
 &nbsp;
 
-<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gqa-memory/2.webp?2" alt="GQA" width="500px" />
+<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gqa-memory/3.webp?4" alt="GQA" width="500px" />
 
 &nbsp;
 
-And the following plot shows how the KV cache size grows with an increasing context length:
-
-&nbsp;
-
-<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gqa-memory/3.webp?2" alt="GQA" width="500px" />
-
-&nbsp;
-
-You can reproduce these plots via `uv run plot_memory_estimates.py`.
-
-
+You can reproduce the plot via `uv run plot_memory_estimates_gqa.py`.
 
 &nbsp;
 ## GQA Code Examples
 
@@ -4,7 +4,7 @@
 # Code: https://github.com/rasbt/LLMs-from-scratch
 
 # This file collects all the relevant code that we covered thus far
-# throughout Chapters 3-4.
+# throughout Chapters 3-4, adapted to use Grouped-Query Attention (GQA).
 # This file can be run as a standalone script.
 
 import argparse
@@ -83,7 +83,8 @@ def forward(self, x, use_cache=False):
         # Shape: (b, num_heads, num_tokens, num_tokens)
         attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
 
-        # Use the mask to fill attention scores
+        ####################################################
+        # causal mask
         num_tokens_Q = queries.shape[-2]
         num_tokens_K = keys.shape[-2]
         device = queries.device
@@ -101,6 +102,7 @@ def forward(self, x, use_cache=False):
         k_positions = torch.arange(num_tokens_K, device=device, dtype=torch.long)
         mask = q_positions.unsqueeze(-1) < k_positions.unsqueeze(0)
 
+        # Use the mask to fill attention scores
         attn_scores = attn_scores.masked_fill(mask, -torch.inf)
 
         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
@@ -111,7 +113,7 @@ def forward(self, x, use_cache=False):
         context_vec = (attn_weights @ values).transpose(1, 2)
 
         # Combine heads, where self.d_out = self.num_heads * self.head_dim
-        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
+        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
         context_vec = self.out_proj(context_vec)  # optional projection
 
         return context_vec
@@ -184,7 +186,7 @@ def forward(self, x, use_cache=False):
 
         # x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
         ####################################################
-        # NEW
+        #  KV cache-related
         x = self.att(x, use_cache=use_cache)
         ####################################################
 
@@ -211,7 +213,7 @@ def __init__(self, cfg):
         # self.trf_blocks = nn.Sequential(
         #    *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
         ####################################################
-        # NEW
+        #  KV cache-related
         self.trf_blocks = nn.ModuleList(
             [TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
 
@@ -228,8 +230,7 @@ def forward(self, in_idx, use_cache=False):
         # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
 
         ####################################################
-        # NEW
-
+        #  KV cache-related
         if use_cache:
             pos_ids = torch.arange(self.current_pos, self.current_pos + seq_len, device=in_idx.device, dtype=torch.long)
             self.current_pos += seq_len
@@ -243,7 +244,7 @@ def forward(self, in_idx, use_cache=False):
 
         # x = self.trf_blocks(x)
         ####################################################
-        # NEW
+        # KV cache-related
         for blk in self.trf_blocks:
             x = blk(x, use_cache=use_cache)
         ####################################################
@@ -253,42 +254,14 @@ def forward(self, in_idx, use_cache=False):
         return logits
 
     ####################################################
-    # NEW
+    # KV cache-related
     def reset_kv_cache(self):
         for blk in self.trf_blocks:
             blk.att.reset_cache()
         self.current_pos = 0
     ####################################################
 
 
-def generate_text_simple(model, idx, max_new_tokens, context_size):
-    # idx is (B, T) array of indices in the current context
-    for _ in range(max_new_tokens):
-
-        # Crop current context if it exceeds the supported context size
-        # E.g., if LLM supports only 5 tokens, and the context size is 10
-        # then only the last 5 tokens are used as context
-        idx_cond = idx[:, -context_size:]
-
-        # Get the predictions
-        with torch.no_grad():
-            logits = model(idx_cond)
-
-        # Focus only on the last time step
-        # (batch, n_token, vocab_size) becomes (batch, vocab_size)
-        logits = logits[:, -1, :]
-
-        # Get the idx of the vocab entry with the highest logits value
-        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch, 1)
-
-        # Append sampled index to the running sequence
-        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)
-
-    return idx
-
-
-####################################################
-# NEW
 def generate_text_simple_cached(model, idx, max_new_tokens,
                                 context_size=None, use_cache=True):
     model.eval()
@@ -314,7 +287,6 @@ def generate_text_simple_cached(model, idx, max_new_tokens,
                 idx = torch.cat([idx, next_idx], dim=1)
 
     return idx
-####################################################
 
 
 def main():
@@ -324,6 +296,7 @@ def main():
     parser.add_argument("--n_layers", type=int, default=12, help="Number of transformer blocks.")
     parser.add_argument("--n_kv_groups", type=int, default=2, help="Number of key/value groups.")
     parser.add_argument("--max_new_tokens", type=int, default=200, help="Number of tokens to generate.")
+
     args = parser.parse_args()
 
     start_context = "Hello, I am"
 
@@ -33,7 +33,7 @@ def __init__(self, d_in, d_out, dropout, num_heads, qkv_bias=False):
         self.dropout = nn.Dropout(dropout)
 
         ####################################################
-        # NEW
+        # KV cache-related code
         self.register_buffer("cache_k", None, persistent=False)
         self.register_buffer("cache_v", None, persistent=False)
         self.ptr_current_pos = 0
@@ -53,7 +53,7 @@ def forward(self, x, use_cache=False):
         queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
 
         ####################################################
-        # NEW
+        # KV cache-related
         if use_cache:
             if self.cache_k is None:
                 self.cache_k, self.cache_v = keys_new, values_new
@@ -74,7 +74,7 @@ def forward(self, x, use_cache=False):
         attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
 
         ####################################################
-        # NEW
+        # causal mask
         num_tokens_Q = queries.shape[-2]
         num_tokens_K = keys.shape[-2]
         device = queries.device
@@ -107,12 +107,9 @@ def forward(self, x, use_cache=False):
 
         return context_vec
 
-    ####################################################
-    # NEW
     def reset_cache(self):
         self.cache_k, self.cache_v = None, None
         self.ptr_current_pos = 0
-    ####################################################
 
 
 #####################################
@@ -177,7 +174,7 @@ def forward(self, x, use_cache=False):
 
         # x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
         ####################################################
-        # NEW
+        #  KV cache-related
         x = self.att(x, use_cache=use_cache)
         ####################################################
 
@@ -204,7 +201,7 @@ def __init__(self, cfg):
         # self.trf_blocks = nn.Sequential(
         #    *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
         ####################################################
-        # NEW
+        #  KV cache-related
         self.trf_blocks = nn.ModuleList(
             [TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
 
@@ -221,8 +218,7 @@ def forward(self, in_idx, use_cache=False):
         # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
 
         ####################################################
-        # NEW
-
+        #  KV cache-related
         if use_cache:
             pos_ids = torch.arange(self.current_pos, self.current_pos + seq_len, device=in_idx.device, dtype=torch.long)
             self.current_pos += seq_len
@@ -236,7 +232,7 @@ def forward(self, in_idx, use_cache=False):
 
         # x = self.trf_blocks(x)
         ####################################################
-        # NEW
+        # KV cache-related
         for blk in self.trf_blocks:
             x = blk(x, use_cache=use_cache)
         ####################################################
@@ -246,42 +242,14 @@ def forward(self, in_idx, use_cache=False):
         return logits
 
     ####################################################
-    # NEW
+    # KV cache-related
     def reset_kv_cache(self):
         for blk in self.trf_blocks:
             blk.att.reset_cache()
         self.current_pos = 0
     ####################################################
 
 
-def generate_text_simple(model, idx, max_new_tokens, context_size):
-    # idx is (B, T) array of indices in the current context
-    for _ in range(max_new_tokens):
-
-        # Crop current context if it exceeds the supported context size
-        # E.g., if LLM supports only 5 tokens, and the context size is 10
-        # then only the last 5 tokens are used as context
-        idx_cond = idx[:, -context_size:]
-
-        # Get the predictions
-        with torch.no_grad():
-            logits = model(idx_cond)
-
-        # Focus only on the last time step
-        # (batch, n_token, vocab_size) becomes (batch, vocab_size)
-        logits = logits[:, -1, :]
-
-        # Get the idx of the vocab entry with the highest logits value
-        idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch, 1)
-
-        # Append sampled index to the running sequence
-        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)
-
-    return idx
-
-
-####################################################
-# NEW
 def generate_text_simple_cached(model, idx, max_new_tokens,
                                 context_size=None, use_cache=True):
     model.eval()
@@ -307,7 +275,6 @@ def generate_text_simple_cached(model, idx, max_new_tokens,
                 idx = torch.cat([idx, next_idx], dim=1)
 
     return idx
-####################################################
 
 
 def main():
@@ -316,6 +283,7 @@ def main():
     parser.add_argument("--n_heads", type=int, default=12, help="Number of attention heads.")
     parser.add_argument("--n_layers", type=int, default=12, help="Number of transformer blocks.")
     parser.add_argument("--max_new_tokens", type=int, default=200, help="Number of tokens to generate.")
+
     args = parser.parse_args()
 
     start_context = "Hello, I am"