Words that lower the cost of letters

2026-06-25 · win · experiment C

The question

The heart of the architecture is a concept hierarchy that grows from the bottom: letters become words, words become phrases. The first thing to prove is the first rung. Does giving the model a layer of word concepts actually help it predict the next character, compared to a flat character model? And can it do this with pure counting, no backprop?

What we tried

We took a flat character model and added a layer of word concepts above it, then conditioned the character prediction on which word it was inside. We tried it two ways: with a hand-given English lexicon, and fully unsupervised, where the words were discovered by the surprise signal from our first experiment. Everything was online association, no gradients.

What happened

The word layer helped, a lot.

model	cost (bits per char)
flat character model	2.124
+ word concepts (given lexicon)	1.653 (−22%)
+ word concepts (discovered)	1.773 (−17%)

A 22% improvement with a known lexicon, and 17% even when the words were discovered with no labels at all. And the payoff is inspectable: the discovered concept store is a Prism document full of real English words, the, a, to, and, her, was, she, not, that you can open and read.

One part did not pay off. A second layer of word-to-word context added nothing, because the character context already reached back into the previous word. A recency cache added only a hair. Both findings pointed the same way.

The lesson

A layer of word concepts lowers the cost of predicting letters by a fifth, with pure online counting and no labels, and the concepts are real English words you can read.

The first rung holds. The hierarchy is real and valuable. The two things that did not compound, word-to-word context and the cache, taught the lesson that shaped the next steps: to compound further, you need a level that reaches beyond the character window, a phrase or a topic, not more of what the character model already sees. (Discovery over-segments a little, which is the gap between the 17% and the 22%.)

Lineage

Grew from the start. This is the first rung of the hierarchy.

Led to the word-level payoff, which asked whether the hierarchy compounds higher, and constructions, which made a counted concept productive enough to predict a filler it never saw.

Thread: the hierarchy, each level helping predict the level it operates on.

‹ When voting made it worse Counting beat the neural net ›