Attention, but counted instead of trained

2026-06-26 · clear win · experiment S

The question

Attention is the engine of a transformer. It asks which earlier positions predict the next token, and how much. It learns that with gradient descent over queries, keys, and values.

We do not use gradient descent. So we asked: is there a count-based answer? Can you get attention from nothing but counts keyed by position, with no training loop at all?

What we tried

The idea is simple. Keep one count table per relative position. One table answers "what word tends to follow when the word one step back was X." Another answers the same for two steps back, and so on, out to eight. Each table is an expert predicting the same next word from a different earlier slot.

Then pool the experts, weighting each position by how informative it is, measured straight off the counts as information gain. No queries, no keys, no values, no gradients. Attention as a stack of counting experts plus one weight per position.

What happened

It beat fixed n-grams on the metric that matters, calibration.

modeltop-1 accuracyperplexity
bigram16.2%13,477
trigram17.8%40,034
offset-attention16.3%4,318

(Perplexity is how surprised the model is, lower is better.) It cut perplexity threefold under the bigram while tying it on accuracy. The trigram had higher accuracy but far worse perplexity: a sparse trigram is confidently wrong when it leaves its training data. The offset model is the only one that is both accurate and calibrated, because it pools eight experts instead of betting everything on one rare context.

Then the headline test. We scrambled the order of the context words and watched.

orderedscrambled
offset-attention16.3%5.8%
bag-of-words7.3%7.3%

Scrambling collapsed the offset model from 16.3% to 5.8%. It was using word order. A bag-of-words control, built from the same counts but blind to position, did not move at all under scrambling, by construction. And on real ordered text the offset model doubled the bag's accuracy.

The lesson

A pure count model, with no backprop, is not a bag of words. It earns its accuracy from word position, and it crushes fixed n-grams on calibration by pooling many positions instead of one.

The information gain per position is a real, count-derived attention weight. It is sharpest at the nearest word and decays fast, then plateaus: far context never goes uninformative, it just levels off. This is the principled version of the attention we wanted, and it is the load-bearing upgrade to everything else here.

Lineage

Grew from the voting loss, which named product-of-experts as the right combiner, and from the flat fourth level, which called for a level that reaches past local context. The source mining queued it.

Led to predicting the kind, which reuses this masked, offset-keyed context, the word level of the heterogeneous stack, the core of the fair rematch, and the analogy in counts, where the offset form found that relations live in pair patterns.

Thread: the right combiner, pooling many positions instead of betting on one.