Predicting the kind, not the word
2026-06-26 · qualified result · experiment U
The question
A transformer predicts the next word. LeCun's bet is that you should not. Hide part of the input and predict the hidden part's representation instead, the abstract class of a word rather than its exact spelling. The promise is that the class is dense where the word is sparse. A rare word you have never counted still belongs to a category you have seen a thousand times. Add sparsity so the representation cannot collapse to a trivial answer.
We do not train with gradient descent, and a gradient JEPA needs a stop-gradient, a teacher, or a variance penalty to keep its representation from collapsing. So we asked: can the count world get rep-space prediction, and does it pay?
What we tried
We built the representation by counting. Each word gets a signature, a hashed, frequency-discounted count of the words around it. We grouped the signatures into 400 concept clusters in a single streaming pass, leader clustering: a ripe word joins the nearest running prototype, or spawns a new one. No k-means, no iteration.
Then we masked a word and predicted it two ways from the same counted context. The token head predicts the exact word, input space. The cluster head predicts the word's concept cluster, representation space. Same evidence, two targets. We also swept how sparse the context code is, from a single active cluster up to dense.
What happened
The clusters came out real but coarse. Numbers land together, function words land together, topics cling loosely. Because they are counted and never trained against the predictor, they cannot collapse. We ran rep-space prediction with none of JEPA's collapse machinery and the clusters stayed meaningful.
On its own metric the representation head towers over the token head.
| head | accuracy overall | accuracy on rare words |
|---|---|---|
| token (predict the word) | 6.7% | 0.0% |
| cluster (predict the class) | 65.4% | 11.2% |
| always guess the biggest cluster | 63.5% | — |
Read the overall column honestly and most of that 65% is a single trick: one giant function-word cluster is the answer so often that always guessing it scores 63.5%. The real signal from context is +1.9 points.
The rare-word column is where the bet pays. The exact rare word was never counted, so the token head scores a flat zero. Its class was counted plenty, so the cluster head reaches 11.2%, forty-four times chance. Predicting the kind succeeds exactly where predicting the word is hopeless. But converting the class back into a word does not help: routing the token choice through the predicted cluster scored 6.3%, below the 6.7% of predicting the token directly. The latent tells you the kind, not the word.
Sparsity was the sharper win. A single active context cluster beat the dense code.
| context code | test accuracy | perplexity |
|---|---|---|
| k = 1 (most sparse) | 66.3% | 6.9 |
| dense | 63.5% | 103.7 |
One active cluster is +2.7 points and fifteen times sharper. More active clusters dilute the prediction. The mechanism is anti-dilution, not anti-overfitting: the gap is sharpness, not variance.
The lesson
Counted representations cannot collapse, so we get rep-space targets with none of JEPA's collapse-prevention machinery. The latent predicts a rare word's class where the token counts predict nothing, but it does not sharpen the word itself. Sparsity is the clearer lever, and the sharpest code is a single cluster.
The honest verdict is modest. Rep-space prediction is real and grounded, and on rare words it does the one thing token counting cannot. The overall edge is small, and the robustness JEPA promises came from the ensemble of context experts, not from the choice of target. The keeper for the cortex is cheap: carry an online, counted, collapse-free concept cluster as an auxiliary class, feed the predictor a single-cluster code, and keep the exact word in input space. The class is for the tail. The word stays counted.
Lineage
Grew from the offset attention it reuses for the masked context, and from the sparsity and representation ideas the source mining surfaced.
Led to the open follow-up, a cleaner online latent clustered over content words only, and to the experiments that reuse these online concept clusters: the event model as the slots it tracks, constructions as the filler categories an open slot routes through, and learning the new without losing the old as the specific contexts ART learns to protect.
Thread: it sits on the source-mining branch (R, S, T, U) that grew out of the flat fourth level, the result that sent us looking past fixed local levels.