More data helps, all the way to a gigabyte

2026-06-26 · win · experiments N and O

The question

Two practical questions, joined at the hip. First: does more data keep helping, or does a simple counting model stop improving once it has seen enough? Second: if more data does help, can we afford it? A pure Python loop over a gigabyte of text is painfully slow.

These matter because the data-hungry questions ahead, like whether deeper levels pay off at scale, are only worth asking if we can run them at all.

What we tried

For scale, we fed one character column the whole of a one-gigabyte Wikipedia dump, 827 million characters, and measured the cost at growing data sizes. For speed, we built the same column three ways: the original sorted-count version, a denser histogram version, and a version that runs on the Mac's GPU through Metal. Same model, same answer, three engines.

What happened

More data helps, all the way to a gigabyte.

data	cost (bits per char)
10 MB	1.997
100 MB	1.821
300 MB	1.773
822 MB	1.744

It keeps falling. The gains shrink near the end, because an order-five character model is nearing its own ceiling and wants a wider one, but the data axis itself was never the problem. An earlier result that "a deeper level didn't pay off" turned out to be a 2 MB data-starvation artifact, not a real wall.

The GPU made the gigabyte cheap.

engine	learn time (100 MB)
sorted counts	11.2 s
dense histogram	5.45 s
GPU through Metal	0.33 s

Two wins stack here. The dense histogram beats the sorted version twofold on the processor, and it is what the GPU wants. The GPU's scatter-add then delivers roughly 25 to 50 times on top. A column that took 156 seconds now takes 2 to 3. A gigabyte learns in well under a minute.

The lesson

The data axis was never the bottleneck, and now neither is compute. Capacity, more context, more voting columns, is effectively free.

This pair clears the runway. The earlier "more levels stopped paying" results all need re-running at scale, because they were measured on a corpus too small to tell. And the GPU's bigger payoff is still ahead of us, in the voting and attention experiments, where many experts get gathered and reduced in parallel. We used MLX, which is Metal under the hood, the low-risk path. A hand-written kernel waits only if MLX ever runs out of room.

Lineage

Grew from the scaling law, which said scale pays, and one part repeated, the Column this run feeds and speeds up.

Led to the depth re-run at scale, now affordable, and the fast substrate every later experiment rides on.

Thread: scale. Data was never the bottleneck, and now neither is compute.

‹ Meaning is a map, not a road Finding phrases by surprise ›