How big a brain the data wants

2026-06-25 · win · experiment F

The question

Before adding any new machinery, we wanted to understand the machine we had. Two knobs control it: how much text it reads, and how much capacity it has. So we asked how prediction scales with each, and how they interact.

What we tried

We took a 100 MB English corpus and ran the counting model across a grid: data from one to ninety times, and capacity from small (short context) to large (long context). One held-out test slice, measured everywhere.

What happened

A small model saturates early. A large one keeps learning, but only once it has enough data.

capacitygain from 90 times more data
small−0.057 (barely moves)
large−0.587, still falling on the last step

The small model was capacity-bound: ninety times the data bought almost nothing. The large model kept improving and had not flattened even at the end of the grid.

And the two knobs are coupled. The large model actually overfit at 1 MB, doing worse than a medium one, but won clearly at 3 MB and beyond. The best capacity is not fixed. It grows with the data.

The lesson

The right size of the model grows with the size of the corpus. With enough data, a bigger model keeps paying; with too little, it overfits.

This was a steering result. It said the cheapest next wins were scale, more data and more capacity, not clever new mechanisms. And it set up a tension we would resolve later: at small data, extra capacity looks useless, which is exactly the trap that made an early "deeper level didn't help" result look like a wall when it was really just starvation. The fix was to run at a gigabyte, where this law predicts the bigger model should win.

Lineage

Grew from the plain question of how the machine we had scales, asked before adding any new part.

Led to one part repeated, a good fast base to scale, and the gigabyte run that tested this law at full size.

Thread: scale. Data was not the problem, capacity and speed were.