Use the map to read, not to walk

2026-06-26 · qualified win · experiment Z

The question

We have a map of meaning. Words that sit near each other mean similar things: numbers cluster, countries cluster, modal phrases cluster. We built it twice, and twice it failed at the one thing we hoped it would do, predict the next word. The map is a map. It is not a road.

So we stopped asking it to walk and asked it to read. A counter learns the next word by tallying what followed what. But most of the time the exact pair you are about to see was never tallied, because text is mostly rare. When the counter has no count, it shrugs and offers the average word. That is the gap. Can the similarity map fill it? When a context is rare or brand new, can it lend that context the statistics of its neighbors, the words it sits beside on the map?

What we tried

We built the map online, by counting, in one pass. Each word gets a signature, a running tally of the words around it. We grouped the signatures into concept clusters with leader clustering: a ripe word joins the nearest running prototype, or spawns a new one. No k-means, no gradients. Then we did it again one level up. We found phrases by surprise, the same branching-entropy signal that finds word boundaries, and gave each phrase the average signature of its words, and clustered those too. A hierarchy of similarity: words, then phrases.

Then we projected the map into the counter. For every word cluster we pooled the next-word counts of all its members. Now a rare context can borrow: when a word has thin direct evidence, we blend its own counts with its cluster's pooled counts, leaning harder on the cluster the rarer the word. The phrase cluster rides on top as a coarser hint when even the word is unsure. The blend weights are read straight off the counts. Nothing is trained.

What happened

The map held up. Numbers landed with numbers, countries with countries, and the phrases came out clean: can be, may be, would be, will be in one group; middle east, middle ages, west coast, east coast in another.

Then we sliced the test set by how much the counter already knew. The honest slice is the one where the counter is blind, where the exact next word was never counted after this context. That is a fifth of all predictions.

slice	model	perplexity
unseen context (a fifth of all predictions)	plain bigram	18,953,086
	+ word-map backoff	965,088
	+ phrase hierarchy	887,388
everything	plain bigram	1,263
	+ word-map backoff	671
	+ phrase hierarchy	662

On the blind slice the bigram is helpless. It has no count, so it offers the average word, and its perplexity is nineteen million. Lending it the cluster's pooled counts cuts that twenty-fold. The phrase hierarchy shaves another tenth on top. Overall perplexity nearly halves, and almost all of that comes from the tail.

But the map does not move the top guess. Accuracy, the single most likely next word, is flat everywhere, dead inside the noise. This is the same lesson the map taught us the first two times, now measured precisely: similarity moves where the probability sits, it does not change which word wins. It prices the tail. The local counts still pick the word.

The lesson

Use the map to read, not to walk. Projected into a counter, a similarity cluster is a backoff prior: it tells a context the counter has never seen where the probability should go, cutting unseen-context perplexity twenty-fold, without ever changing the single best guess. The hierarchy, phrase concepts above word concepts, deepens that prior a little further. The map prices the tail. The counts pick the word.

The keeper is cheap and it cannot collapse, because it is counted, never trained against the thing it helps. Carry the online word cluster, and the phrase cluster above it, as a rare-context backoff prior wired into the counter's keys. Do not expect it to raise accuracy. Expect it to stop the model from going blind the moment it meets something new. That is the axis the map was always going to win on, and the one place the first two tries never looked.

Lineage

Grew from the map that could not walk, re-aimed at last onto the rare-context slice it was parked for, with the online concept clusters from predicting the kind, not the word as the reps and counted attention as the predictor it pours into.

Led to the analogy in counts, which found this same leader-clustering is the wrong shape for relations: it was right for rare-context backoff and blurs the axes an analogy rides on.

Thread: the search for the right combiner. Not a better predictor and not a better map, but the right way to pour one representation into the other, so each does the job it is good at.

‹ What an agent learns while it dreams When the letters lie, it leans on the idea ›