Skip to content

Commit 488bef7

Browse files
authored
Image resizing
1 parent c6b8332 commit 488bef7

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

ch04/08_deltanet/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Recently, [Qwen3-Next](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9c
44

55
Both Qwen3-Next and Kimi Linear use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention, as shown in the figure below.
66

7-
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp" alt="Qwen3-Next versus Kimi Linear" style="zoom:47%;" />
7+
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp" alt="Qwen3-Next versus Kimi Linear">
88

99

1010

@@ -125,7 +125,7 @@ The delta rule part refers to computing the difference (delta, Δ) between new a
125125

126126
Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.)
127127

128-
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp" alt="Gated DeltaNet" style="zoom:47%;" />
128+
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp" alt="Gated DeltaNet" width=500px>
129129

130130
However, as shown in the figure above, the "gated" in the Gated DeltaNet also refers to several additional gates:
131131

@@ -271,7 +271,7 @@ context = context.reshape(b, num_tokens, self.d_out)
271271

272272

273273

274-
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp" alt="Quadratic attention" style="zoom:67%;" />
274+
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp" alt="Quadratic attention" width=500px />
275275

276276
In Gated DeltaNet, there's no *n*-by-*n* attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what's implemented as, where `S` is the state that gets updated recurrently for each time step *t*.
277277

@@ -353,4 +353,4 @@ uv run plot_memory_estimates_gated_deltanet.py \
353353

354354
Note that the above computes the `head_dim` as `emb_dim / n_heads`. I.e., 2048 / 16 = 128.
355355

356-
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp" alt="Gated DeltaNet scaling" style="zoom:47%;" />
356+
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp" alt="Gated DeltaNet scaling" width=500px>

0 commit comments

Comments
 (0)