What the model thinks is happening

2026-06-26 · qualified win · experiment AC

The question

A reader knows when one article ends and the next begins. Not from a marker on the page, but from a jolt: the topic shifts, and the whole sense of what comes next is rewritten. We wanted that jolt as a signal the model could read.

Per-token surprise is the obvious candidate, and it is the wrong one. Surprise is −log P(actual word), so it spikes on any hard word, and every long content word inside an article lights it up. The boundary signal drowns in within-topic noise. Zacks calls the real thing event segmentation, and Kumar's 2023 work names the right measure: not how surprising the word was, but how much the belief moved. Bayesian surprise, KL(Pt ‖ Pt-1), the distance between the next-word belief before the word and after it. A rare word that was already expected barely moves the distribution. The first word of a new article reshapes it. We asked whether that signal finds real article boundaries on real text, and whether a persistent sense of "what event we are in" helps predict the next word.

What we tried

We streamed 36 MB of enwik9, 5.18 million words, with 4,598 real <page> boundaries as ground truth, about one every 1,126 words. At every step we measured the KL between the top-k next-word belief before and after the token, turned it into a z-score with a leaky running mean and variance, and fired a boundary when it spiked. We compared this signal head to head with per-token surprisal and with the branching-entropy signal that finds word boundaries. We also carried a persistent event slot, a leaky profile over the active concept clusters, archived it on each boundary, and mixed it into the predictor as a topic prior. Everything online: the clusters are leader-clustered, the normalizer is a leaky accumulator, the predictor is a word trigram counter.

What happened

Bayesian surprise finds the article boundaries. The other two do not. All three were thresholded at the same rate, so precision equals recall equals F1.

tolerance	KL (Bayesian surprise)	surprisal	branching-entropy
±10 words	0.091	0.003	0.000
±25 words	0.154	0.027	0.001
±50 words	0.156	0.093	0.012

At ±25 words KL beats surprisal 5.7 times and branching-entropy about 120 times. And the lead grew with scale: on a 3 MB pilot KL scored 0.099, on the full 36 MB it scored 0.154 while the others stayed on the floor. Branching-entropy fires on syntactic fan-out, which is uniform across articles, so it is useless for topic boundaries. The Kumar-2023 claim reproduces on real multi-article text.

The honest negative is what KL is not. As a hard segmenter at our firing threshold it is precision-only: it fired 20 cuts in 5.18 million words, about a tenth of them real, covering well under one percent of the gold boundaries. The z-gate is tuned to fire on the largest belief jolts, so it catches a handful of unmistakable shifts and ignores the rest. The win is KL as a ranked signal, not KL as a set of fired events. Matching the article rate would need a far lower threshold, and that trades away the precision.

The persistent event slot helped prediction, but only where promised.

slice	without slot	with slot	gain
backoff slice (0.9% of words)	11.723	11.580	+0.143 bpw
overall	10.7828	10.7816	+0.0012 bpw

On the backoff slice, the words where the local trigram and bigram are both blank, the active topic saves 0.143 bits per word. But at 36 MB the local tables cover so much that the backoff slice is only 0.9% of words, so the overall gain is near-invisible. A gentle prior beat an aggressive one: over-trusting the topic erases half the gain. This is the same law experiment T found by a different road. There the global topic helped only on the word-backoff slice. Here a topic discovered from belief jolts rather than a recency histogram lands in the same place with the same shape. Top-down priors pay off only where local context has run out.

The lesson

Measure surprise as belief-update, not token-cost. KL(Pt ‖ Pt-1) isolates the moment the whole next-word belief is rewritten, which is what a topic boundary is, and it beats per-token surprisal 5.7 times and branching-entropy 120 times at finding real article boundaries on real text. The lead grows with data. As a hard segmenter it is precision-only, so the win is KL-as-signal. The persistent event slot it drives is a real prior but a narrow one: it helps only on the 1% of words local context cannot reach, the altitude law again, confirmed by a second road.

The keeper for the cortex is plain. Carry Bayesian surprise as the topic-boundary signal, not surprisal. Keep the event slot as a soft, low-weight prior fired only on the backoff slice. Do not expect it to sharpen the 99% of tokens the local counts already own.

Lineage

Grew from topic ignition, whose altitude law this re-confirms through a different mechanism, from finding phrases by surprise, the boundary signal it leaves behind for topics, and from predicting the kind, not the word, whose online concept clusters are the slots it tracks.

Thread: global coherence, the frontier the scorecard named. A model that knows when the topic turns is a model with a sense of the whole, not just the last few words. The idea is Zacks on event segmentation and Kumar 2023 on Bayesian surprise.

‹ Is the analogy already in the counts? How sure is a count? ›