Grammar is just counting, made productive
2026-06-26 · qualified win · experiment AF
The question
An n-gram knows such as because it counted such as a thousand times. It does not know two kilometers unless it counted exactly that, and it never will for most pairs, because most word combinations are rare or new. The model floors them. A grammar is supposed to fix this: it abstracts a slot, so a frame it has only seen with a few fillers can accept a filler it has never seen there.
The usage-based linguists, Bybee, Goldberg, Tomasello, say a grammar is not handed down from above. It falls out of counting, if for every frame you count two things. Token frequency, how often this exact frame-and-filler occurred, which entrenches: a frame with one dominant filler freezes into an idiom. And type frequency, how many distinct fillers followed, which makes a slot productive: a frame with many fillers spread across categories becomes an open slot that predicts the category, not any one word, so it can fire for a filler never seen in that frame. We asked whether counting those two things turns a flat n-gram into a compositional one.
What we tried
One streaming pass over 14 MB of text8. For each frame, the preceding word, we counted both its token frequency and its distinct fillers, and we gave each filler an online category from leader-clustering. We classified ripe frames: frozen if one filler owns half the tokens, open-slot if a dozen fillers spread over three or more categories with no dominator. The open-slot frame predicts through its category: the frame supplies which categories its slot accepts, the category supplies which words live in it. We held out 30% of the distinct frame-filler pairs, fillers never seen in that frame during training, for the compositional test. We also added statistical preemption: when a rival frame commits much harder to a category, the weakly-held competitor link is down-weighted, count-based inhibition against over-generation.
What happened
On held-out, never-seen pairs the open-slot construction beats the n-gram 4.3 times on perplexity.
| held-out slice | n-gram perplexity | construction perplexity | construction wins |
|---|---|---|---|
| all held-out pairs | 19,610 | 6,471 | 60.3% |
| open-slot frames only | 23,461 | 5,405 | 80.1% |
The n-gram has literally never counted this filler after this frame, so it falls to the smoothing floor. The construction recognizes that the filler's category is one the frame's slot accepts and lends it the category's mass. Compositional generalization, produced by counting two things instead of one.
And the labels behaved as the theory predicts. The high-token frames froze into real idioms: such as, known as, based on, part of, number of. The high-type frames abstracted to a category, the cleanest being the number frames, which all converged on a measurement-unit slot:
"two ___" → {km, per, miles, ft, square, approximately} a NUMBER + UNIT construction
"to ___" → {be, have, him, them, due, according} post-"to" / infinitive
"a ___" → {single, program, strong, larger, longer} determiner + noun/adjective
No grammar was given. The frame discovered that a number is followed by a unit, by counting. Preemption then cut over-generation: the weakly-held competitor links lost 39.5% of their mass while the attested links kept 102% of theirs, the suppressed probability renormalized back onto the real forms. The unobserved form is inhibited, the conventional one is not.
The honest limit: this is a generalization axis, not an in-distribution one. On ordinary next-word prediction, where the exact pair was seen, a well-counted n-gram is sharper, because a specific count beats a category average. The construction is a backoff for the unseen, not a replacement. It should sit behind the specific count and fire when the n-gram floors. Killing it on the in-distribution metric would have been the wrong call.
The lesson
Count two things per frame, token and type, and a flat n-gram becomes compositional. Token frequency entrenches a frame into a frozen idiom; type frequency abstracts it into an open-slot construction that predicts the category and so can fire for a filler it never saw there. On held-out, never-seen frame-filler pairs the construction beats the n-gram 4.3 times on perplexity and wins on 80% of them, and statistical preemption cuts over-generation by 39.5% with no loss to attested forms. It is a backoff for the unseen, not a replacement: when the exact pair was seen, the specific count is sharper.
The keeper for the cortex is two counts per frame and an open-slot head that predicts through an online category, serving as the compositional backoff the specific count cannot provide, with count-based preemption to keep it from over-generating. Grammar here is genuinely just counting, made productive by counting the right two things.
Lineage
Grew from words that lower the cost of letters, the first proof that a counted concept earns its keep, from finding phrases by surprise, the frames an open slot abstracts over, and from predicting the kind, not the word, whose online categories the slot routes through.
Thread: representations and the right combiner. A construction is a representation that generalizes, and the open-slot head is the combiner that pours the frame's category preference into the category's lexicon. The ideas are Bybee, Goldberg, and Tomasello on usage-based grammar.