A vote that remembers what it just saw

2026-06-26 · qualified result · experiment R

The question

Our model votes fresh at every step. It throws away the last instant's belief and recomputes from scratch. The brain seems to do the opposite: it accumulates evidence for a hypothesis over time, so one bad glimpse cannot tank a decision. This was the single most-recommended idea from re-reading the research corpus.

So we asked two things. Does a vote that remembers hold up better when the input is unreliable? And does the drop in its confidence make a free boundary detector?

What we tried

Instead of recomputing the next-character belief each step, we kept a leaky accumulator. Each step it decays the old evidence a little and adds the new. One noisy step gets averaged in, not handed the wheel. We compared this to the fresh vote, both built from the same character experts, then corrupted a fraction of the context and watched.

What happened

On clean text the fresh vote wins. Under noise, the accumulator wins.

context noisefresh (bits)evidence (bits)
0%3.675.61
10%7.216.50
20%10.207.28
30%12.768.00

(Lower is better.) On clean text the fresh vote is sharp and far better. But every corrupted character is a full-strength bad step for it, so its cost explodes. The accumulator degrades gracefully: its cost rose only 2.4 bits across the whole range while the fresh vote's rose 9.1. Past 10% noise, remembering wins. Exactly the "one bad step can't tank it" prediction.

The second test failed. We read the drop in the model's confidence as a word-boundary signal and put it head to head with the branching-entropy rise from our very first experiment.

signalF1
random0.42
confidence drop0.51
branching-entropy rise0.76

The drop beats random but loses badly to the incumbent. And the reason is the very thing that helped the first test: the accumulator smooths the stream, and boundary detection wants a sharp spike, not a smoothed one.

The lesson

Accumulating evidence buys robustness to noise, paid for in clean-case sharpness. It is a prediction tool, not a boundary detector. Smoothing and boundary-sharpness pull in opposite directions.

This is a clean example of the rule we hold ourselves to. Read only the clean-text headline and you throw the idea away. Read the noise column and it earns its place. The natural next step is a version that down-weights a step only when the experts disagree, keeping the clean sharpness and the noise immunity at once.

Lineage

Grew from the source mining, and from the first boundary signal, the entropy rise this tried and failed to beat as a boundary detector.

Led to the fused next step, accumulate but down-weight only on expert disagreement, to the fair rematch, where this accumulated evidence earns its keep on rare contexts inside the full stack, and to the calibrated count, where this leaky accumulator becomes the precision weight that calibrates the model.

Thread: surprise, the one signal behind boundaries, attention, and accumulated evidence.