Mathias Leonhardt

Posted on Jun 4 • Originally published at ki-mathias.de

Transformer Attention Is Hopfield's 1982 Update Rule (And What That Tells Us About LLM Memory)

#machinelearning #ai #neuralnetworks #math

Hopfield's associative-memory equation from 1982 and the scaled dot-product attention from Vaswani 2017 are the same operation. One substitution turns one into the other. The 2024 Nobel Prize in Physics — to Hopfield and Hinton — is the academic acknowledgement that the mathematics behind today's LLMs was already written four decades ago, in a different vocabulary.

This is a condensed write-up of the longer, interactive piece at ki-mathias.de/en/hopfield.html. Seven chapters there, five live MNIST demos. Here I focus on the four steps where the story has interesting empirical edges.

1. The identity

Modern Hopfield (Ramsauer et al., 2020) writes the update rule as

v ← X · softmax(β · Xᵀv)

where X ∈ ℝ^(N×p) is the matrix of stored patterns and β > 0 is an inverse-temperature parameter.

Scaled dot-product attention (Vaswani et al., 2017) writes

Attention(Q, K, V) = V · softmax(Kᵀ Q / √dₖ)

Set Q = v, K = X, V = X, and β = 1/√d_k. The two equations become identical. Not analogous. Identical. Same operation, written in two different notations.

In a Transformer, K and V are independent learned projections of the same input rather than the same matrix, and Q is yet another projection. Those are extra learnable transformations around the Hopfield core; the softmax-weighted lookup in the middle is unchanged.

Krotov & Hopfield (2016) had already worked out the dense associative memory generalisation that gives this form its exponential storage capacity. Vaswani 2017 reached the same equation by iterating on machine-translation benchmarks. Ramsauer 2020 noticed they were the same. The independent rediscovery is itself diagnostic: the structure isn't a design choice, it's a forced consequence of the requirements.

2. Why classical Hopfield breaks on MNIST (and why that's not a bug)

The original 1982 recall rule is

vᵢ ← sign(Σⱼ Wᵢⱼ · vⱼ)        # W = (1/N) Σₘ ξₘ ξₘᵀ,  Wᵢᵢ = 0   (index m runs over stored patterns)

This is the Hebb construction. Store ten MNIST digits, query each with 15 % pixel noise, observe what comes back.

Result: all ten queries collapse into the same end-state — an image that isn't visually any of the stored digits. Mean pairwise similarity between the ten "recalls": 0.99.

This is fully explained by the spectrum of W_Hebb. The eigenvalues are roughly

λ₁ ≈ 6.65,   λ₂ ≈ 0.65,   λ₃ ≈ 0.48,   ...

A factor-of-ten gap between λ₁ and the rest. The top eigenvector is essentially ξ̄ = (1/p) Σₘ ξₘ, the per-pixel mean — cosine 0.9999.

The Hebb rule is provably correct only under two conditions:

Pairwise orthogonality of stored patterns.
Zero-mean patterns.

MNIST digits violate both: pairwise inner products are 400–600 out of 784 (≈ two thirds of the pixels shared), and mean pixel values are −0.63 to −0.90 (much more "background" than "ink"). The failure is therefore not an implementation bug; it's the construction operating outside its range of validity. Centring the patterns kills the bias sink but reveals the next defect — the v → −v symmetry of E(v) = -½vᵀWv causes recalls to land on negations of stored patterns.

The didactic point: a learning rule is correct or incorrect relative to a data geometry. "Hebb is broken" is not a sentence. "Hebb is broken on MNIST" is.

3. The pseudoinverse fix and its capacity cliff

The Personnaz–Guyon–Dreyfus construction (1985) keeps the same recall machinery but builds W differently:

W_PI = X (XᵀX)⁻¹ Xᵀ

The factor (XᵀX)⁻¹ is exactly what's missing in Hebb — the inverse of the pattern-pattern Gram matrix. It removes correlations between stored patterns before the matrix becomes the energy landscape. For orthogonal patterns the two rules coincide; for correlated ones, only W_PI carries the algebraic guarantee

W_PI · ξₚ = ξₚ              # every stored pattern is a fixed point with eigenvalue 1

Empirical capacity on MNIST, p stored patterns, 10 % pixel noise, fraction of queries that recover the original:

p	Hebb	Pseudoinverse
10	0 %	100 %
100	0 %	100 %
150	0 %	97 %
200	0 %	32 %
250	0 %	1 %
300	0 %	0 %

A sharp phase transition between p ≈ 150 and p ≈ 250. Far below the algebraic ceiling p = N = 784, where the Gram matrix becomes singular. The identity W_PI ξₚ = ξₚ holds throughout — but the basin of attraction around each fixed point shrinks as the patterns crowd one another, and 10 % noise overshoots the basin once p exceeds ~150.

Side note for readers who came in via the Eigenvalues post: the operator X(XᵀX)⁻¹Xᵀ is exactly ridge regression with λ = 0 — the pseudoinverse hat matrix. The Hopfield update with this W is therefore a non-linear filter built on top of an ordinary projection onto the span of stored patterns. The capacity cliff is the cliff of unregularised projection at near-singular Gram.

4. Modern Hopfield = Attention (the move that fixes capacity)

Stop iterating sign(Wv). Replace it with the soft, input-dependent

v ← X · softmax(β · Xᵀv)

Three structural changes happen at once:

Component	Classical (1982/1985)	Modern (Ramsauer 2020)
Operator	fixed `W ∈ ℝ^(N×N)`	none — direct softmax-lookup on X
Update	linear in v + sign	non-linear (softmax in v)
Energy	quadratic `-½ vᵀWv`	log-sum-exp + `½‖v‖²` (Lyapunov)
Convergence	iterative, many sweeps	one step (for sufficiently large β)
Capacity	dynamically ≪ N	`Ω(exp(N))` — exponential in N

The exponential capacity is the practical reason this works for LLMs at all: with N = 768 (a typical embedding dim), you can store effectively-unbounded context. With N = 784 (MNIST), the classical pseudoinverse rule plateaus near p ≈ 150 on real data.

And the parameter β is interpretable. At small β, the softmax is near-uniform and the recall is a soft average of all stored patterns. At large β, it concentrates on the single best match — Modern Hopfield converges to 1-nearest-neighbour. Ramsauer's analysis of Transformer heads shows early layers running at low β (global averaging) and deeper layers running at high β (sharp lookup on a single token). The classical "attention is mysterious" complaint dissolves into a continuous interpolation between two known operations.

5. The surprise: same learning rule, different geometry, totally different outcome

The interesting finding from Negri, Tudisco, Lucibello et al. 2024 — Random Features Hopfield Networks generalize retrieval to previously unseen examples — is not "we made Hopfield better." It's the opposite:

The exact same learning rule that scores 65 % accuracy on MNIST (i.e., barely matches 1-NN, no real generalisation) achieves perfect generalisation — magnetisation 1.0 on unseen test patterns — when the data is built as a sparse mixture of a small set of random features.

Setup: let F ∈ {-1,+1}^(N×D) be a random feature matrix. Each pattern is ξ = sign(F · c) with c an L-sparse binary coefficient vector. Three sets share the same F:

Train (p stored patterns)
Features (the D feature columns of F — never stored)
Test (new patterns from the same distribution, never stored)

Sweep α = p/N and measure the magnetisation of each set. Three phases appear in order:

Storage (α small): only train patterns are stable attractors.
Learning (α medium): train magnetisation drops, features magnetisation rises — the network has begun to recognise the components it was implicitly trained on.
Generalisation (α large): test patterns become attractors too — without ever having been stored.

With the pseudoinverse rule this last transition is a hard jump to magnetisation 1.0, and the math explains why: once the trained patterns span enough of the feature mixtures, every feature mixture becomes an eigenvector of W_PI with eigenvalue 1 — by the same identity that made stored patterns fixed points.

The takeaway is not subtle: generalisation is a property of the data geometry, not of the learning rule. A textbook claim that "this learning rule generalises better" is well-typed only relative to a class of data. The reason language models generalise so well isn't that the attention mechanism has a special "ability" — it's that natural language already has the sparse compositional structure that makes Hopfield-style retrieval transfer beyond the training set. Words and constructions are a finite set of components; sentences are sparse mixtures. Hopfield-friendly by accident of biology.

6. What runs on this mathematics today

A non-exhaustive list, with the empirical claim each item is making:

Every Transformer attention layer. Modern Hopfield is what's there. With high probability the most-executed mathematical operation on global compute, by raw volume.
MHNfs (Klambauer/Hochreiter, Linz) — few-shot drug discovery, 100k+ context molecules as memory, SOTA on FS-Mol.
DeepRC (Widrich et al., NeurIPS 2020) — multi-instance learning over ~10⁶ immune-repertoire sequences, used for SARS-CoV-2 classification.
Memristor Hopfield chips (HP Labs Nature Electronics 2020; Peking U. Nature Comms 2024) — analogue MAX-CUT solvers, ~4 orders of magnitude energy advantage over digital. The Peking paper proves mathematical equivalence to a Hopfield attractor network, not just analogy.
PyTorch drop-in: ml-jku/hopfield-layers — Hopfield, HopfieldPooling, HopfieldLayer modules, swap-in replacements for LSTM / pooling / attention.

7. What this does not say

It does not claim that every modern ML algorithm is "secretly a Hopfield network." The identity is precise between Modern Hopfield and scaled dot-product attention. Diffusion models, state-space models, ConvNets have different mathematical structures.
The Chapter-6 generalisation result is for synthetic feature-mixture data, not for MNIST. To transfer it to real data you need to first extract a feature basis (PCA, dictionary learning, learned embeddings) — i.e. you have to engineer the right kind of sparse architecture, the data won't give it to you for free.
Classical Hopfield is not "back" as a general-purpose ML tool. ConvNets/diffusion/Transformers are better for almost every benchmark task. The Hopfield reading earns its keep when memory is the actual problem — few-shot, multi-instance, episodic recall, combinatorial optimisation on dedicated hardware.

Top comments (2)

Vic Chen • Jun 4

Really enjoyed this. The Hopfield-to-attention equivalence is one of those ideas that instantly upgrades how you think about LLM memory.

The part that landed for me most was your point that the Hebb failure on MNIST is about data geometry, not a broken rule. As someone building AI products, I think that framing matters a lot: many "model failures" are really mismatches between the update rule and the structure of the data.

Also appreciated the pseudoinverse section—it's a clean reminder that capacity claims only mean something with the retrieval assumptions attached.

Mathias Leonhardt • Jun 5

Thanks, Vic — your translation to product work nails it: "rule problem vs. data-geometry problem" is a diagnostic most monitoring stacks don't separate. Both kinds of degradation get bucketed as "accuracy drop", and the engineering response (more data, different model, different objective) gets routed by reflex rather than diagnosis. Half the time the answer is "your rule is fine, your data shifted into a geometry the rule doesn't fit" — and adding data doesn't help.
The capacity-with-retrieval-assumptions point you flagged is the one I find myself repeating most. Once you see that the bound depends on what kind of recall you want (exact fixed point vs. basin radius for a given noise level), a lot of capacity claims in the literature stop looking like contradictions and start looking like apples-and-oranges. The pseudoinverse capacity table in §3 was the cleanest way I found to make that visible — same matrix, same construction, same fixed-point algebra, but the dynamical number where things stop working sits a factor of 5× below the algebraic ceiling, and only depends on how much noise you'll accept at the input.