The level that reaches past the last few words

2026-06-25 · clarifying · experiment K

The question

The model is built from levels: characters, then words, then phrases, stacked up. The obvious move is to keep stacking. So we asked the plain question. Does adding a fourth and fifth level keep paying off, once there is enough data to feed them?

What we tried

We built three, four, and five-level versions of the cortex and trained each across growing amounts of text. The fourth level was a longer word context. The fifth was a recency cache that tracks the current topic.

What happened

More data helped a lot. More fixed levels did not.

change	effect
10 MB to 40 MB of data	cost fell from 1.94 to 1.77
add a 4th level (longer word context)	flat: 1.770 to 1.767
add a 5th level (topic cache)	small constant gain, slightly hurt phrase coherence

Data was the lever. The fourth level, a longer stretch of local word context, saturated just as the character level had: once the nearby context is used up, more of the same kind buys nothing. The fifth level, the topic cache, gave a small steady gain to the raw cost and lowered overfitting, but the edge was a constant, it did not grow with data, and it nudged phrase coherence the wrong way. The topic cache joined the graveyard, parked for a later resurrection once it can be scoped to a discovered topic segment instead of a fixed window.

The lesson

The payoff is not more fixed local levels. It is a level that reaches past the local context the lower levels already cover.

This is the result that aimed the rest of the work. Stacking more n-gram levels is a dead end, because each one saturates the moment it overlaps the level below. What is needed is a different kind of level: attention that pulls from far-back positions, and topics that span whole segments. That conclusion sent us straight to the attention and ignition experiments. (A slow per-position evaluation here was the bottleneck, fixed with batched evaluation in the gigabyte work.)

Lineage

Grew from one part repeated, the fast Column that made levels at scale reachable.

Led to the source mining and the experiments it queued: counted attention, a vote that remembers, topic ignition, and predicting the kind. This flat result is the fork that sent the work past fixed local levels.

Threads: scale, and global coherence, the open frontier these later swings aim at.

‹ Finding phrases by surprise One part, wired bigger ›