Finding phrases the way you'd guess them

2026-06-26 · mixed · experiment M

The question

Our very first experiment found word boundaries in raw characters by watching prediction error. The natural next question: does the same trick work one level up? Feed it a stream of words, and does surprise discover phrases? Feed it a whole document, and does surprise find topic boundaries?

If it does, the same one signal carves the whole hierarchy, from letters to words to phrases to topics.

What we tried

We ran the branching-entropy signal over the word stream to find phrase cuts, with no labels. Then we ran a topic-boundary version over a Wikipedia dump and scored it against the real article boundaries in the markup.

What happened

Phrases: a clear qualitative win. Cutting the word stream by surprise recovered real units, unsupervised:

united states · such as · see also · list of · according to · to be · part of

These are genuine phrases the model was never told about. (A few markup fragments slipped in too, an artifact of a crude corpus cleaner, not the method.)

Topics: real but weak.

signal	F1 against true article boundaries
random	0.066
topic surprise	0.148

Surprise does carry topic-boundary information, 2.2 times better than chance. But it over-segments by about double, and the absolute strength is low. The signal is there; the precision is not, yet.

The lesson

The same surprise signal that finds word boundaries finds phrases one level up, on its own. Topic boundaries are harder, and need a cleaner corpus and a predictive content signal, not a bag of words.

Phrases are a real win and feed straight into the architecture. Topics are promising but immature. The payoff, when topics get solid, is concrete: a topic boundary resets the attention window, giving every higher level a segment to scope itself to. That is the same boundary that the ignition experiment wants in order to commit a fresh topic.

Lineage

Grew from the first boundary signal, the same surprise trick carried one level up.

Led to a memory of change, the topic boundary that ignition wants to reset its window, the event model, where Bayesian surprise beats this entropy signal at topic boundaries, and constructions, the frames an open slot abstracts over.

Thread: surprise, the one signal carving the whole hierarchy from letters to phrases.

‹ More data, all the way to a gigabyte When more levels stop paying ›