Is the analogy already in the counts?

2026-06-26 · qualified win · experiment AD

The question

king is to queen as man is to ? The famous answer is that this lives in a vector space: subtract, add, and the arrow from king to queen carries you from man to woman. The space comes from SVD or word2vec. We do not train, so we cannot build that space the usual way. But a published result (PMC11493305) says the parallelogram is already latent in the raw co-occurrence counts, and that SVD only smooths it to human parity. If that is true, a pure counter should reason by analogy with nothing but counts and a cosine.

So we asked three things. Is the analogy in the raw counts? Our only legal smoother is online leader-clustering; can it stand in for the SVD step the literature relies on? And can NARS-style transitive induction, deriving a new link from two observed ones, beat a direct counter on pairs that never co-occur?

What we tried

We streamed 16 MB of text8 and built, for each word, a co-occurrence profile over the top 6,000 context words. We scored profiles two count-native ways: log counts, and positive PMI (a per-cell ratio of counts, which is computing a value, not factorizing a matrix, so it stays on the allowed list). We defined the relation as a profile difference and solved a:b::c:? by 3CosAdd, the cosine to profile[c] + (profile[b] − profile[a]). We built analogy items from four classic families: capital-country, currency, plural, gender. Then we swept leader-cluster smoothing, mixing each word's profile toward its cluster centroid, to see if it could play SVD's role. Separately we induced gap-3 links from gap-1 and gap-2 counts through every bridge word, weighted by the NARS truth product, and tested on pairs that essentially never co-occur directly.

What happened

The parallelogram is in the raw counts. On the standard category-restricted protocol, raw PPMI profiles solve the analogy at 56% top-1 and 94% top-5, against a 14% random and 10% frequency floor. About four times the baseline, no SVD, no embedding, no gradient.

family (restricted, raw PPMI)	top-1	top-5
capital-country	64	92
currency	30	100
plural	89	99
gender	43	87
macro	56	94

Worked examples, open vocabulary: man:woman :: king:? returns daughter, son; france:paris :: japan:? returns korea, china, singapore; car:cars :: dog:? returns hound, foxes. The answer is usually near the parallelogram point, just crowded by distractors when nothing restricts the candidates.

Two honest negatives, both clean. First, leader-clustering does not substitute for SVD. Every smoothing strength was flat-to-worse than raw counts: a gentle mix tied, a strong one collapsed the macro score from 56 to 29.

smoothing strength (PPMI macro top-1)	score
β = 0 (raw counts)	56
β = 0.3	56
β = 0.6	53
β = 0.9	29

The reason is structural. paris and tokyo share a cluster, so averaging a word toward its centroid blurs the very paris − france direction the parallelogram rides on. SVD denoises while keeping the axes. Leader-cluster averaging throws the axes away. It was the right shape of smoothing for rare-context backoff in experiment Z; it is the wrong shape for relations.

Second, NARS transitive induction did not beat a direct counter. On the held-out slice it managed top-1 0.10% against the unigram's 0.07%, but its mean rank was 1366 against the direct counter's 345. Transitive composition spreads mass across everything any bridge ever predicted, so it nudges a rare target into the top-5 while diluting the sharp local signal the direct counter keeps. And the premise that the direct counter is blind on that slice was itself false: even at gap-3 with no co-occurrence, the surrounding collocation gives it a mean rank of 345.

The lesson

Analogy is already in the counts. A per-cell PMI ratio and a cosine recover the parallelogram at four times baseline, 94% top-5 restricted, in one pass with no SVD and no gradient. But the cortex's own smoother, online leader-clustering, is the wrong shape for it: averaging a word toward its cluster centroid blurs the exact relation-axes the parallelogram needs, so smoothing only ever hurts. SVD is not interchangeable with count-pooling; it denoises while preserving the axes, and we have no count-native operator that does that. NARS transitive induction spreads its mass too broadly to beat a direct counter on held-out pairs. Counting gets you the analogy for free and the induced link not at all.

The keeper is the raw PPMI profile as a relation organ: ask it a:b::c:? and it answers by parallelogram, no training. Do not route it through cluster smoothing. The recurring theme across this experiment, Z, and S is the same: the representation is in the counts, but we still lack a count-native combiner that sharpens without blurring.

Lineage

Grew from the similarity hybrid, whose leader-clustering was the right shape for backoff and turns out wrong for relations, and from counted attention, the offset form that found relations live in pair patterns.

Thread: representations and the right combiner. The map of meaning is real and it holds the analogy; what we lack is the operator that sharpens it without blurring the axes. The ideas are NARS and the parallelogram-in-counts result.

‹ Learning the new without losing the old What the model thinks is happening ›