Building the dials that bits-per-char hides
2026-06-25 · win · experiments G and H
The question
We kept reporting one number, bits-per-char, the cost of predicting the next character. An earlier experiment had already burned us: higher-level concepts barely moved that number even when they were clearly helping. The problem was the ruler, not the model. So we stopped and built better rulers.
What we tried
We built three dials. Overfitting: the gap between cost on training text and on held-out text. Real-word rate: what fraction of the model's generated output is actual English words. Phrase coherence: what fraction of generated word-pairs are plausible. Then we scored three models with them: characters only, characters plus discovered words, and characters plus phrases.
What happened
Each level bought exactly what it modeled, and the dials showed it where bits-per-char could not.
| model | overfit gap | real-word % | phrase coherence |
|---|---|---|---|
| characters only | +0.53 | 77% | 57% |
| + word concepts | +0.32 | 89% | 46% |
| + phrases | 100% | 82% |
Word concepts halved the overfit gap. The lexicon acts as a regularizer, a brake on memorizing. They also pushed the real-word rate from 77% to 89%. Phrases then lifted phrase coherence to 82%. The work was being done all along. Bits-per-char simply could not see it.
And the dials drew a hard line none of the earlier numbers had. Every model, even the best, still produced locally-plausible word-salad. None of them had global coherence: a passage that holds together across sentences.
The lesson
Each level buys what it models: word concepts buy generalization and word-validity, phrases buy phrase-coherence. And none of them yet buys global coherence.
This experiment did two things. It validated the architecture with real metrics instead of one misleading number. And it named the frontier precisely: global, discourse-level coherence, the dial that is still flat. From here on, that is the dial we are trying to move, and now we can detect it the moment something does.
Lineage
Grew from the voting loss, where one number hid the work the higher levels were doing.
Led to topic ignition, the first direct swing at the global-coherence dial this scorecard named.
Thread: global coherence, the open frontier, made measurable here.