When combining the experts made it worse

2026-06-25 · negative result · experiment D

The question

We had several experts that each helped: a character model, a word lexicon, a phrase model, a topic cache. The obvious move was to combine them all into one voting cortex and watch the cost drop further. So we built it, and asked whether the full ensemble beats the simple two-expert mix we already had.

What we tried

We combined five experts two ways. A gated linear vote, which blends their opinions. And a product of experts, which multiplies them, so constraints compound and an unsure expert abstains. We measured both against the simple tuned mix of just characters plus the word lexicon.

What happened

Both lost to the simpler model.

modelcost (bits per char)
simple char + lexicon mix1.653
gated linear vote (5 experts)1.707
product of experts (5 experts)1.767

Adding the phrase and topic experts made the next-character prediction worse, not better. The reason is the one that shaped everything after: character prediction saturates. The local character context already determines most of the next character, so phrase and topic structure have almost nothing left to contribute at that level.

We also learned how the two combiners differ. The linear vote degenerates to winner-take-all, it just picks the single best expert rather than blending. The product of experts is the right combiner, because constraints multiply and an uninformative expert auto-abstains, but only where the experts carry information the base level lacks.

The lesson

Bits-per-char is the wrong metric for higher-level concepts. Character prediction saturates, so phrase and topic structure barely move it. And the product of experts is the right combiner, not linear voting.

This negative result was one of the most productive in the whole project. It diagnosed the saturation, named the right combiner, and sent us to measure higher-level ideas at the word level instead, where the very next experiment cut perplexity in half.

Lineage

Grew from the start. We built the first full ensemble here.

Led to one part repeated, which pooled experts the right way, and counted attention, which pooled positions.

Thread: the right combiner. Product of experts, not linear voting, runs from here through the Column to counted attention.