How sure is a count?

2026-06-26 · qualified win · experiment AB

The question

A count tells you how often a context fired. It never tells you whether it was right. The model stakes the same confidence on a context it has seen a million times and one it saw twice in a moment of noise. So it is wildly overconfident, and it has no way to know.

Pei Wang's NARS has a fix that costs nothing. Split the count. For every context, keep two tallies: hits w+ when its bet pays off, misses w- when it does not. Then a context carries a truth value: a frequency f = w+ / w (how often it is right) and a confidence c = w / (w + 1) (how much evidence stands behind that). Both read straight off the counts, no knob. We asked three things. Does that truth value make the model honestly calibrated? Does it make a better expert weight? And can a knob-free f·c gate replace the hand-tuned threshold that decides when a long context overrules a short one?

What we tried

One causal pass over text8. At each position we score the context's running top-1 against the character that actually comes next, before folding that character in, and increment w+ or w-. That gives every context its (f, c). We then used the truth value three ways: to weight each order's vote in the product-of-experts; to pool orders by adding evidence (a thin, unreliable order barely moves the pool); and to drive a gate that opens the long context exactly when its f·c beats the backoff's. We compared all of it to the bare count pool that gives every order an equal vote.

What happened

The calibration win is the headline: a tenfold drop.

combiner	ECE (lower better)	accuracy
bare-count pool	0.280	0.568
confidence-weighted pool	0.280	0.565
precision-weighted	0.052	0.545
(f,c)-revision	0.027	0.411

The bare pool says "98.5% sure" on 179,000 predictions and is right 76% of the time. After revision, when the model says "85% sure" it is right 90%, and when it says "94% sure" it is right 95%. The stated confidence tracks the true accuracy almost down the diagonal. The count now knows how sure it is.

Then prediction, where we did not expect a win. Weighting each expert by its own track record cuts perplexity hard.

combiner	perplexity	bits-per-char
bare-count pool	12.4	3.63
(f,c)-revision	6.6	2.72
precision-weighted	4.3	2.11

The bare pool hands a thin, unreliable order-5 context an equal vote, and it drags the answer off-target. Down-weighting that order by its own reliability is exactly what a product-of-experts always needed and never had. Perplexity falls 12.4 to 4.3, accuracy roughly held.

The gate is the honest qualifier. On clean text, overall, the knob-free f·c gate loses to the best tuned threshold (2.04 vs 1.96 bpc). Order-5 beats order-2 nearly everywhere, so "always open" is hard to beat, and the principled gate is too cautious: it closes whenever order-5's evidence is thin, even though order-5's full distribution usually still wins. But restrict to the slice the gate is for, the rare contexts whose order-5 evidence is genuinely unreliable, and it is the only policy that beats always-open.

policy (rare / unreliable slice)	bits-per-char
always open (order 5)	2.764
best tuned threshold	2.764
principled f·c (no knob)	2.739

The lesson

Split a count into hits and misses and the model learns how sure it is. The NARS truth value turns an overconfident count into an honestly-calibrated one, ECE 0.280 to 0.027, with no fitted parameter. As an expert weight it also cuts perplexity threefold, because a product-of-experts wants reliable voters, not equal ones. The knob-free gate it drives is too conservative to win on clean text, but it is the single policy that beats always-open on the rare, unreliable slice. So it is parked there, on the axis it can win.

The keeper for the cortex is the truth value itself: carry (f, c) on every context, weight by precision, and read the gate off f·c instead of a swept threshold. The calibration is free and it cannot be tuned away, because it is counted against the very thing it measures.

Lineage

Grew from the vote that remembers, whose leaky accumulator became the precision weight here, and from the gate, whose hand-tuned confidence router this tried to replace with a knob-free f·c.

Thread: the search for the right combiner. The truth value is the missing ingredient, the weight a product-of-experts always wanted, read straight off counting hits against misses. The idea is Pei Wang's, from NARS.

‹ What the model thinks is happening What an agent learns while it dreams ›