We Fitted a Bayesian Model to 15,000 Jokes. Every Show Ranking Is Within Noise.

Earlier this week we removed a silent format coefficient that was penalizing multi-cam shows by 15–25%. A data-science audit had flagged it as statistically unidentifiable with only three scored shows. We agreed and pulled it.

Then we went further. We fit a hierarchical Bayesian model to the entire dataset to answer the deeper question: when you properly control for joke type, character, and episode, how much of a comedy show’s ranking is actual signal vs. within-noise differences?

The answer is more humbling than we expected.

The Model

We sampled 15,000 jokes across The Office, Seinfeld, and Friends (5,000 per show) and fit a model predicting each joke’s impact score (the LLM’s 0–10 audience-reaction estimate) as:

``` impact_j = grand_mean + format_effect[format(j)] # fixed effect + show_effect[show(j)] # partially-pooled random effect + joke_type_effect[type(j)] + episode_effect[episode(j)] # random intercept + character_effect[char(j)] # random intercept + residual_noise ```

Everything was sampled with PyMC using NUTS (2 chains, 500 post-warmup draws, 0 divergences). This is a textbook hierarchical-effects model — the kind of setup you’d use for player effects in a sports analytics paper.

Finding 1: The format effect is statistically zero

Here’s the posterior for the format coefficient (single-cam vs. multi-cam baseline):

	Posterior median	95% CrI	P(effect > 0)
Single-cam (vs. multi-cam baseline)	−0.052	[−0.590, +0.442]	0.40

Translation: the posterior distribution puts a 60% chance that the single-cam effect on impact is negative, 40% it’s positive. The credible interval straddles zero. After controlling for everything else, we cannot distinguish single-cam from multi-cam on impact.

This is vindication. The old 15–25% coefficient wasn’t just poorly calibrated — it was applying a correction to an effect the data doesn’t support.

Finding 2: The three shows are statistically indistinguishable

Show random-effect deflections (on the 0–10 impact scale):

Show	Median deflection	95% CrI
Seinfeld	+0.154	[−0.224, +0.530]
The Office	−0.007	[−0.505, +0.456]
Friends	−0.131	[−0.498, +0.235]

All three intervals overlap. The posterior median orders them Seinfeld > Office > Friends, which matches our raw Humor Index rankings. But the statistical story is that this ordering is within noise. The probability that Seinfeld’s show-effect really is higher than Friends’ is around 82%. That’s meaningfully better than a coin flip, but it’s not the 99%+ certainty you’d want to publish a ranking claim with.

If we get three more scored shows into the dataset, these intervals will narrow. But as of today, with 3 shows and 15K sampled jokes, the shows’ impact-quality differences don’t clear the statistical bar.

Finding 3: 64% of variance is unexplained joke-level noise

The model’s variance decomposition:

Within-joke residual (unexplained): 63.9%
Between-episode within show: 11.8%
Between-joke-type: 8.9%
Between-show: 7.9%
Between-character: 7.5%

Shows explain only 7.9% of total joke-level variance. That is almost identical to the variance explained by joke type (8.9%) or individual character (7.5%), and less than variance between episodes within a show (11.8%).

Two-thirds of the variance is within-joke residual — the LLM gives similar jokes meaningfully different scores. Some of this is real (the same joke type can be executed well or badly), some is LLM noise. Without an inter-rater reliability study we can’t distinguish.

What This Actually Means for the Rankings

The Humor Index, Comedy WAR, and every leaderboard on this site are computed from aggregates of joke-level scores. When the joke-level model can’t distinguish shows, the aggregates rank them — but those ranks sit on a foundation of overlap.

In practice: if you’re reading "The Office has a Humor Index of 80.2, Seinfeld 79.1, Friends 78.7," you should read that as "The posterior median orders them Office > Seinfeld > Friends, but the differences are within the range of how much rescoring noise would move these numbers." A 1–2 point Humor Index gap is inside the noise floor.

Note (April 2026): When this post was first published, Seinfeld led at 83.9 due to stand-up bits being scored as sitcom comedy. That was fixed with a standup-aware rescore — see the [Office vs Seinfeld reordering post](/blog/seinfeld-vs-the-office) for the back-and-forth. The core finding of this post — that all three shows sit within each other’s credible intervals — is unchanged.

This doesn’t mean the rankings are wrong. It means they’re not statistically distinguishable given current data. That’s a feature of being honest about our sample size and model, not a bug in the analysis.

What We’re Changing on the Site

1. Credible interval badges on show pages. Next to each show’s Humor Index, we’re surfacing the 95% credible interval from this model. A reader can see that Friends and Office have overlapping intervals and draw their own conclusion.

2. Variance decomposition on the methodology page. The 64% within-joke noise figure is going in the Known Limitations section. Readers should know that two-thirds of what our model sees in joke-level scores is unexplained.

3. The format filter stays. Since format doesn’t have an identifiable effect on impact, the filter is just a convenience for users who want to compare multi-cam to multi-cam. It’s no longer a silent correction.

The Big Picture

This result aligns with what a lot of comedy writers will tell you: there is no universally correct answer to "which show is funnier." Our data suggests the answer is somewhere between "they’re essentially the same" and "the differences we measure are small enough that the model can’t confidently order them."

We’re publishing the full model artifacts — posterior samples, variance components, and credible intervals — in the site’s `public/data/` directory, so anyone who wants to reanalyze is welcome to.

Model outputs: [format_posteriors.json](/data/format_posteriors.json) • [show_credible_intervals.json](/data/show_credible_intervals.json) • [variance_decomposition.json](/data/variance_decomposition.json)