The counter beat the neural net

2026-06-25 · win · experiment B

The question

The single most defensible claim in our vision is online learning without forgetting: learning from every sentence as it arrives, without a dense network smearing old knowledge under new. A transformer cannot do this. So we hunted for it first. Does a sparse, online, locally-updated learner avoid the catastrophic forgetting that sinks a dense gradient network, on text, at the same capacity?

What we tried

We streamed four distinct registers of English in sequence, Austen, then Shakespeare, then Darwin, then the King James Bible, one pass each, no going back. After each, we tested how well the model still predicted every register. We ran three learners: a dense network trained by gradient descent, our sparse design, and a plain counting model.

What happened

The forgetting we went looking for did not appear, and the reason was instructive. Four registers of English, at the character level, are nearly one stationary distribution. There was almost nothing to forget. Austen's cost even improved after the later registers, positive transfer, not interference. The test needs real task shift, which English-only-at-the-character-level does not provide.

But a different result fell out, and it was the important one.

learner	cost (bits per char)
dense gradient network	~3.5
sparse gradient network	~3.5
plain online counter	2.3

The plain counting model beat both gradient networks, and it is online, never-forgetting, and needs no backprop by construction.

The lesson

The associative counter is a better predictor than the gradient nets, and it is online and non-forgetting for free. The north star and the better predictor turned out to be the same thing.

This is the result that chose our substrate. We stopped building gradient networks and committed to sparse-associative counting. The forgetting claim itself remains untested, because English-only gives nothing to forget. It waits for a setup with real task shift, and stays one of the live candidates for a genuine breakthrough.

Lineage

Grew from the start. This is where the substrate was chosen.

Led to everything downstream. The counting substrate grounds the online-only rule that every later experiment holds to.

Thread: the substrate. Counts beat gradients, and counts are online and non-forgetting for free.

‹ Words that lower the cost of letters Where one word ends ›