One brain part, or many? We gave each level its own job
2026-06-26 · negative result · experiment X
The question
We already proved one part wins. Take a single Column, the small unit the whole system is made of, wire it wider and deeper, and it gets better. So the model is one part repeated.
But a brain is not one part repeated. The retina filters, the cortex votes, the thalamus gates, the basal ganglia choose, the hippocampus holds episodes. Fast sensory regions run on a short clock, slow integrative regions on a long one. So we asked the opposite question. Give each level its own job, its own reach, its own timescale, then let a gate decide which level speaks per token. Does specialization beat the uniform stack?
What we tried
We built four specialized levels, each predicting the same next character so the cost is comparable end to end.
- Letters: a dense local n-gram, short window, fast.
- Words: offset-keyed count-attention, a wider reach, a middle clock.
- Phrases: branching-entropy chunks plus a trajectory of change, slower.
- Topic: a slow online topic state, broadcast down, slowest of all.
Then a gate, a thalamus and basal ganglia in one: a per-token router that picks which level dominates, by each level's recent confidence and the letter level's surprise. It replaces a single fixed pooling rule.
What happened
On the headline number, specialization lost. In bits per character, lower is better:
| stack | bits per character |
|---|---|
| uniform Column (the part repeated) | 1.985 |
| letters only | 2.010 |
| full specialized stack | 2.369 |
Every distal level adds cost, because as a character predictor each one is noisy. A hard guess at the next word makes a poor prior on the next letter. So we judged each piece on the axis it was built for, and there each piece won. The word level doubled a bag-of-words on accuracy and beat it decisively on calibration. The phrase level discovered 8000 chunks with no spaces given, 58% of them multi-word, real collocations. The topic level grew 143 topics online, by leader-clustering, no k-means.
The one clean architectural win is the gate. When the levels are unequal, how you combine them is everything:
| combiner | bits per character |
|---|---|
| static pool, equal weight | 3.274 |
| dynamic confidence router | 2.369 |
The fixed equal-weight pool lets the three weak distal levels drag down the strong local one. The dynamic router, a winner-take-all by confidence, sends 85% of tokens to the letter level and opens to the others only where the letter level is unsure. It recovers near letter-level cost for free, per token, with no hand-tuned weights, a 0.9-bit gulf over the static pool.
The lesson
Specialization did not beat repetition on the headline number, but each piece won on the number it was built for. When you mix unequal parts, the load-bearing piece is the arbiter that decides who speaks. The brain grew a thalamus to choose between its parts.
The uniform Column is still the better character predictor. The heterogeneous stack buys things the flat metric cannot see: order-sensitivity and calibration at the word level, real units at the phrase level, online topics at the top. And it teaches the rule that matters for any mix: build the combiner first.
Lineage
Grew from one part repeated, the uniform Column this set out to beat, topic ignition, the slow broadcast topic at the top, and counted attention, the word level's engine.
Led to when the letters lie, which pours noise on this gated stack and watches the gate hand prediction up to the idea, the calibrated count, which replaces this gate's tuned threshold with a knob-free truth value, and learning the new without losing the old, whose broadcast winner is kin to this gate's routing.
Thread: the right combiner, here as the arbiter that decides which altitude speaks, and the open frontier of global coherence.