You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both Qwen3-Next and Kimi Linear use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention, as shown in the figure below.
6
6
7
-
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear"style="zoom:47%;" />
7
+
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear">
8
8
9
9
10
10
@@ -125,7 +125,7 @@ The delta rule part refers to computing the difference (delta, Δ) between new a
125
125
126
126
Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.)
In Gated DeltaNet, there's no *n*-by-*n* attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what's implemented as, where `S` is the state that gets updated recurrently for each time step *t*.
277
277
@@ -353,4 +353,4 @@ uv run plot_memory_estimates_gated_deltanet.py \
353
353
354
354
Note that the above computes the `head_dim` as `emb_dim / n_heads`. I.e., 2048 / 16 = 128.
0 commit comments