Earlier this week we removed a silent format coefficient that was penalizing multi-cam shows by 15–25%. A data-science audit had flagged it as statistically unidentifiable with only three scored shows. We agreed and pulled it.
Then we went further. We fit a hierarchical Bayesian model to the entire dataset to answer the deeper question: when you properly control for joke type, character, and episode, how much of a comedy show’s ranking is actual signal vs. within-noise differences?
The answer is more humbling than we expected.
The Model
We sampled 15,000 jokes across The Office, Seinfeld, and Friends (5,000 per show) and fit a model predicting each joke’s impact score (the LLM’s 0–10 audience-reaction estimate) as:
``` impact_j = grand_mean + format_effect[format(j)] # fixed effect + show_effect[show(j)] # partially-pooled random effect + joke_type_effect[type(j)] + episode_effect[episode(j)] # random intercept + character_effect[char(j)] # random intercept + residual_noise ```
Everything was sampled with PyMC using NUTS (2 chains, 500 post-warmup draws, 0 divergences). This is a textbook hierarchical-effects model — the kind of setup you’d use for player effects in a sports analytics paper.
Finding 1: The format effect is statistically zero
Here’s the posterior for the format coefficient (single-cam vs. multi-cam baseline):
| | Posterior median | 95% CrI | P(effect > 0) | |---|---|---|---| | Single-cam (vs. multi-cam baseline) | −0.052 | [−0.590, +0.442] | 0.40 |
Translation: the posterior distribution puts a 60% chance that the single-cam effect on impact is negative, 40% it’s positive. The credible interval straddles zero. After controlling for everything else, we cannot distinguish single-cam from multi-cam on impact.
This is vindication. The old 15–25% coefficient wasn’t just poorly calibrated — it was applying a correction to an effect the data doesn’t support.
Finding 2: The three shows are statistically indistinguishable
Show random-effect deflections (on the 0–10 impact scale):
| Show | Median deflection | 95% CrI | |---|---|---| | Seinfeld | +0.154 | [−0.224, +0.530] | | The Office | −0.007 | [−0.505, +0.456] | | Friends | −0.131 | [−0.498, +0.235] |
All three intervals overlap. The posterior median orders them Seinfeld > Office > Friends, which matches our raw Humor Index rankings. But the statistical story is that this ordering is within noise. The probability that Seinfeld’s show-effect really is higher than Friends’ is around 82%. That’s meaningfully better than a coin flip, but it’s not the 99%+ certainty you’d want to publish a ranking claim with.
If we get three more scored shows into the dataset, these intervals will narrow. But as of today, with 3 shows and 15K sampled jokes, the shows’ impact-quality differences don’t clear the statistical bar.
Finding 3: 64% of variance is unexplained joke-level noise
The model’s variance decomposition:
- Within-joke residual (unexplained): 63.9%
- Between-episode within show: 11.8%
- Between-joke-type: 8.9%
- Between-show: 7.9%
- Between-character: 7.5%
Shows explain only 7.9% of total joke-level variance. That is almost identical to the variance explained by joke type (8.9%) or individual character (7.5%), and less than variance between episodes within a show (11.8%).
Two-thirds of the variance is within-joke residual — the LLM gives similar jokes meaningfully different scores. Some of this is real (the same joke type can be executed well or badly), some is LLM noise. Without an inter-rater reliability study we can’t distinguish.
What This Actually Means for the Rankings
The Humor Index, Comedy WAR, and every leaderboard on this site are computed from aggregates of joke-level scores. When the joke-level model can’t distinguish shows, the aggregates rank them — but those ranks sit on a foundation of overlap.
In practice: if you’re reading "Seinfeld has a Humor Index of 83.9 vs. The Office’s 80.2," you should read that as "Seinfeld scores higher on our current sample, but the difference is within the range of how much rescoring noise would move these numbers." A 3-point Humor Index gap is bigger than the typical inter-episode bootstrap CI but smaller than the show-level credible interval.
This doesn’t mean the rankings are wrong. It means they’re not statistically distinguishable given current data. That’s a feature of being honest about our sample size and model, not a bug in the analysis.
What We’re Changing on the Site
1. Credible interval badges on show pages. Next to each show’s Humor Index, we’re surfacing the 95% credible interval from this model. A reader can see that Friends and Office have overlapping intervals and draw their own conclusion.
2. Variance decomposition on the methodology page. The 64% within-joke noise figure is going in the Known Limitations section. Readers should know that two-thirds of what our model sees in joke-level scores is unexplained.
3. The format filter stays. Since format doesn’t have an identifiable effect on impact, the filter is just a convenience for users who want to compare multi-cam to multi-cam. It’s no longer a silent correction.
The Big Picture
This result aligns with what a lot of comedy writers will tell you: there is no universally correct answer to "which show is funnier." Our data suggests the answer is somewhere between "they’re essentially the same" and "the differences we measure are small enough that the model can’t confidently order them."
We’re publishing the full model artifacts — posterior samples, variance components, and credible intervals — in the site’s `public/data/` directory, so anyone who wants to reanalyze is welcome to.
Model outputs: [format_posteriors.json](/data/format_posteriors.json) • [show_credible_intervals.json](/data/show_credible_intervals.json) • [variance_decomposition.json](/data/variance_decomposition.json)