Learning the new without losing the old

2026-06-26 · qualified win · experiment AE

The question

The whole bet of this project is a model that never forgets. A counter is additive: it never overwrites, so it cannot lose what it learned. We tried to show that off early, in experiment B, by streaming four registers of English past the model. There was nothing to forget. English at the character level is one distribution, and more of it only helps. The test had no teeth.

The fix that experiment called for was a genuine register shift, and bounded memory. So we stream three registers that really differ: Darwin's Victorian science, Shakespeare's Early-Modern verse, the King James Bible's archaic scripture, one pass, no replay. And we cap the memory, so a new register's flood of n-grams forces eviction. Now the question is real. When the table overflows, what does the model keep? Forgetting is how much an earlier register's cost degrades after the later ones arrive.

What we tried

Two models, identical except for the eviction policy. Both count orders 1 to 5, both use the same backoff, so their peak quality is comparable. They differ only in what they throw away when the table is full.

The flat model is single-timescale recency: one leaky use-score per context, evict the least recently used. A new register's contexts are all fresh and used, so they wash out the old. The dual model is the brain-inspired stack: ECAN's fast salience and slow importance, CLS fast and slow stores, ART vigilance to resonate-or-spawn, LIDA broadcast to pick one winner per step. It evicts the lowest long-term importance, protecting the rare-but-important, and on each step it broadcasts only the most-specific context that recognized the input, giving that one a strong importance write while novelty earns retention slowly.

What happened

The brain-inspired model forgets about 21 times less, and predicts better. The retention matrix reads M[after training row][evaluated on column] in bits-per-char.

Flat (recency eviction)

after eval	darwin	shakespeare	bible
darwin	2.522	3.435	3.254
shakespeare	2.790	2.856	3.048
bible	2.863	2.968	2.188

Dual (importance, broadcast, vigilance)

after eval	darwin	shakespeare	bible
darwin	2.174	3.306	3.145
shakespeare	2.123	2.526	2.695
bible	2.172	2.550	1.965

Read the Darwin column down. The flat model learns Darwin at 2.522, then degrades it to 2.863 once Shakespeare and the Bible flood in: a third of a bit forgotten. The dual model ends Darwin at 2.172, essentially flat against its 2.174 peak. Total backward forgetting is +0.021 bits for the dual model against +0.454 for the flat one. And it wins on peak too, mean diagonal 2.22 against 2.52. Retention is not bought with quality: protecting the contexts that actually predicted well also sharpens the current register. After the full stream the dual model keeps 1,274 of Darwin's order-4 contexts against the flat model's 1,059, at the same table size. It evicts the right things.

The load-bearing piece was almost missed. The first dual model wrote importance to the highest-salience context, which means the short generic ones win, and it retained worse than the flat baseline (peak 3.05, forgetting +0.92). Broadcasting to generic contexts starves the rare specific ones that carry a register's identity. Switching the broadcast winner to the most-specific context that recognized the input, ART resonance, concentrated importance on exactly the predictive high-order n-grams and flipped the result. The idea looked dead at step one. The win came two steps in, on a dial we were not headlining.

The lesson

Under a real register shift and a memory budget, a flat recency cache forgets and the brain-inspired policy does not, by about 21 times, while also predicting better at the peak. The differentiator is real once the test has something to forget. The mechanism that mattered was not the leaky timescales alone, it was ART resonance: reinforce the most-specific context that recognized the input, not the most-salient generic one. The specific n-grams carry a register's identity, and protecting them is what defeats forgetting.

The honest scope: this is a bounded-memory phenomenon. With an uncapped table both models are non-forgetting, because counts are additive, and they are identical. But bounded memory is the only regime a lifelong learner ever lives in. The keeper for the cortex is the eviction policy: keep by long-term importance, write importance to the specific context that resonated, and let novelty earn its place over time.

Lineage

Grew from the counter that beat the neural net, whose forgetting test had nothing to forget and this finally gives teeth, from the gate, whose confidence routing is kin to the broadcast winner, and from predicting the kind, not the word, whose specific-context clusters are what ART learns to protect.

Thread: online learning and non-forgetting. The promise that a counter learns while it lives and does not forget, held up at last against a stream built to break it. The ideas are ECAN's two-timescale importance, complementary learning systems, ART vigilance, and LIDA broadcast.

‹ Grammar is just counting, made productive Is the analogy already in the counts? ›