The hierarchy pays off at the right altitude

2026-06-25 · win · experiment E

The question

We had just learned a painful lesson: combining higher-level experts had failed to lower the cost of predicting the next character. The diagnosis was that character prediction saturates, so phrase and topic structure barely move it. If that diagnosis was right, the same experts should pay off handsomely when measured at the level they actually operate on. So we tested the prediction directly. Predict the next word, not the next character.

What we tried

We built word-level experts: a word's own frequency, the previous word (a phrase), the previous two words (a longer phrase), and a recency cache for topic. We combined them by multiplying their opinions together, a product of experts, where each expert can quietly abstain.

What happened

It compounded, cleanly.

measurebeforeafter
perplexity476247
bits per word8.907.95

Perplexity nearly halved. And the learned combination weights told us where the value lived: almost all the mass landed on the phrase experts, the previous word and the two before it. The phrase structure that was invisible at the character level was doing real work at the word level.

The lesson

Each concept level helps predict the level it operates on. Word concepts help characters; phrases and topics help words. Measure at the right altitude and the hierarchy compounds.

This is the through-line of the whole project, confirmed by the cleanest experiment for it. The architecture, multi-level concepts combined by a product of experts, learned online and inspectably, works. The earlier failure was never the architecture. It was measuring a word-level idea with a character-level ruler.

Lineage

Grew from word concepts, which proved the first rung, and from the voting loss, which said to measure higher ideas at the right altitude.

Led to one part repeated, the Column that folds these experts into one part.

Thread: the hierarchy, each level helping predict the level it operates on.