Transparency
How We Score
Every number on this site is derived from the same pipeline. Here's exactly how it works.
The Formula
Three Pillars
The Humor Index is a weighted composite of three independently scored dimensions. Each episode is analyzed at the joke level, then aggregated to season and show level.
Raw comedic density across the episode plus peak density across its best runs. We count every distinct joke that lands — setups with identifiable punchlines, physical gags, callbacks, and running gags — then divide by the episode's net comedy runtime. The peak-density component (15% of the final score) rewards episodes that hit elite stretches; the weighted-JPM component (10%) rewards sustained density over the full episode.
Quality over quantity. Each joke is scored across five sub-dimensions (see below). Craft rewards structural sophistication: misdirection, subverted expectations, perfect character fit, and timing precision. A show can have low JPM and high Craft (slow-burn prestige comedy) or vice versa.
Resonance and staying power. Quotability (does this line get repeated?), rewatch value (is it funnier the second time?), cultural footprint, and callback payoff. Impact captures what pure craft analysis misses — a technically average joke that becomes a catchphrase scores higher here. A small memorability bonus is added on top for episodes packed with iconic moments.
Craft Breakdown
The Craft Rubric
Craft is scored 1–10 across five sub-dimensions, equally weighted. Each is assessed per joke, then averaged across the episode.
Setup Quality
How efficiently and elegantly the premise is established. A great setup is invisible — it plants the seed without telegraphing the punchline.
Example: "George is a marine biologist now." (Seinfeld S5E14)
Misdirection
The degree to which the joke leads the audience toward a false expectation before subverting it. True misdirection is earned, not cheap.
Example: "I am not superstitious... but I am a little stitious." (The Office)
Subversion / Surprise
Does the punchline go somewhere unexpected? Subversion is the delta between what the audience anticipated and what actually happened.
Example: The Soup Nazi's final line in the episode finale.
Character Fit
Could only this character have delivered this joke? High character fit means the joke reveals or reinforces something true about who the character is.
Example: Every Dwight exit from a room.
Timing Precision
Beat length, pause placement, delivery speed. Timing is harder to score but we proxy it through editing rhythm, reaction shot positioning, and line overlap.
Example: Jim's camera looks. George's pauses before responding.
Format Adjustment
The Laugh Track Correction
Multi-camera shows with sweetened laugh tracks structurally inflate joke density — the laugh track signals joke boundaries that a single-camera show leaves implicit. To compare across formats fairly, we apply a format coefficient to the raw Humor Index.
| Format | Coefficient | Effect |
|---|---|---|
| Single Camera | 1.00× | Baseline |
| Hybrid | 1.00× | Baseline |
| Multi-Camera (Live) | 1.00× | Baseline |
| Multi-Camera (Sweetened) | 1.00× | Baseline |
| Animation | 1.00× | Baseline |
Deprecated as of 2026-04-16. Coefficients have all been set to 1.0 (no format adjustment). The old coefficients were confounded with show identity and calibrated against an opaque small sample. We now publish raw scores and let you filter by format on the shows and rankings pages. See the "Why we don't adjust for format" section below.
Policy
Why We Don't Adjust for Format
An earlier version of the Humor Index silently penalized multi-cam shows (live audience, laugh track) by 15–25%, on the premise that audience reaction inflated perceived impact. That adjustment had three problems:
- Confounding. With only three scored shows, the "format effect" can't be statistically separated from show-level idiosyncrasies. You can't identify a format coefficient with that few levels of the treatment variable.
- Opaque calibration. The coefficients were point estimates with no published confidence interval and a sample we can't re-examine.
- Silent correction. Friends' 72.8 and Seinfeld's 77.9 partly reflected a 15% multi-cam tax, without that being visible to readers.
Our fix: report raw scores, tag every show with its format, and offer format-filtered leaderboards so you can compare like against like. Multi-cam shows and single-cam shows aren't directly comparable on a single scale — they're different comedy traditions with different conventions.
Score changes from the removal: Seinfeld 77.9 → 83.9, Friends 72.8 → 78.7, The Office 81.0 → 80.2. The original scores are preserved on each page as humor_index_v1 for transparency. Subsequent update (April 18, 2026): we then discovered that Jerry's stand-up bits at The Improv were being scored as sitcom comedy. Applying a 0.30 standup weighting and rescoring all 172 Seinfeld episodes with 3-run consensus moved Seinfeld from 83.9 → 77.8. Reconciliation (May 25, 2026): as new shows were added the live leaderboard drifted from the canonical aggregation, so we re-aggregated all nine scored shows with one consistent method and added bootstrap 95% confidence intervals. Current published order: 30 Rock 84.4, Arrested Development 82.0, The Office 79.2, Community 77.9, Parks and Recreation 77.7, Taxi 77.3, Schitt's Creek 77.3, Seinfeld 77.0, Friends 73.3. 30 Rock and Arrested Development pull clear at the top; the middle six cluster inside each other's intervals; Friends now sits clearly below the pack.
Reliability
Our Scorer's Noise Floor
We ran a test-retest study: 30 episodes were scored twice, both in blind mode, with different random seeds. Results:
Intraclass correlation (ICC)
- Humor Index (0–100): 0.28 (poor, < 0.40)
- avg_craft (0–10): 0.28
- avg_impact (0–10): 0.24
- total_jokes detected: 0.67 (moderate)
- JPM: 0.53 (moderate)
Mean absolute difference between two blind runs of the same episode: 10.7 Humor Index points. 72% of single-run Humor Index variance is run-to-run scorer noise; only 28% is real episode signal.
Why: joke detection is stable (67% signal), but per-joke craft/impact ratings jitter by ~5% between runs. The Humor Index formula amplifies that noise through threshold-based metrics (peak_density, memorability_bonus) that flip on small score changes.
What this means for rankings:
- Show-level rankings hold up. Averaging over 170–236 episodes drives the standard error on show Humor Index to roughly ±0.4 points. The 3–6 point gaps between Seinfeld, Office, and Friends are well above that floor.
- Individual episode rankings have ±5 point noise. Two episodes within ~10 Humor Index points of each other are essentially tied under single-run scoring.
- Extreme episodes still stand out. Dinner Party (100) vs an average 75 is comfortably above the noise. It's the close-finish ordering that's uncertain.
What we're doing: consensus scoring. Our pipeline supports multi-run scoring via --num-runs. All new shows starting with Parks and Recreation will be scored 3× per episode and averaged. This should raise ICC to moderate (≥0.40); five runs would reach the "good" threshold (≥0.75).
Full study: see the blog post.
Bias
Show-Identity Bias: Small and Not Significant
We scored 99 episodes in blind mode (no show name, no character list, no description fed to the LLM) and compared to their non-blind production scores. Paired differences:
- Pooled (n=99): −1.47 HI points (95% CI: [−3.72, +0.79])
- Seinfeld (n=33): −2.45 (CI [−5.71, +0.82])
- The Office (n=33): −1.23 (CI [−5.11, +2.65])
- Friends (n=33): −0.72 (CI [−5.29, +3.84])
All CIs include zero. No show shows a statistically significant bias from knowing the show name. The direction (blind scores slightly HIGHER, not lower) is the opposite of what a naive "AI favors famous shows" hypothesis would predict.
Bayesian Model
What a Hierarchical Model Actually Finds
We fit a hierarchical Bayesian model to 15,000 jokes (5,000 per scored show) predicting joke-level impact from show, format, joke type, episode, and character. Here's what came out.
Format effect
Single-cam vs multi-cam: −0.052 (95% CrI: [−0.590, +0.442])
The 95% credible interval straddles zero. After controlling for show, joke type, episode, and character, we cannot statistically distinguish single-cam from multi-cam on impact. This vindicates the decision to set the format coefficient to 1.0.
Show effects (impact deflection from grand mean)
- Seinfeld: +0.154 (95% CrI: [−0.224, +0.530])
- The Office: −0.007 (95% CrI: [−0.505, +0.456])
- Friends: −0.131 (95% CrI: [−0.498, +0.235])
All three intervals overlap. The posterior ordering (Seinfeld > Office > Friends) matches our Humor Index rankings, but the differences are within the statistical noise of this model. Probability that Seinfeld beats Friends on show-effect is approximately 82% — better than a coin flip, but not 99%+ certain.
Variance decomposition
- 63.9% — within-joke residual (unexplained)
- 11.8% — between episodes within a show
- 8.9% — between joke types
- 7.9% — between shows
- 7.5% — between characters
Shows explain only 7.9% of total joke-level variance. Rankings between shows capture a small fraction of what makes a joke score well. Two-thirds of the variance is unexplained within-joke residual — some real (same joke type executed better or worse), some LLM noise.
Full Bayesian model outputs are published at /data/format_posteriors.json, /data/show_credible_intervals.json, and /data/variance_decomposition.json.
Uncertainty
Confidence Intervals & Percentiles
Every score on the site is a point estimate with real uncertainty. We now publish:
- 95% bootstrap confidence intervals on episode, season, and show Humor Indexes. Episode-level CIs resample the episode's own jokes with replacement 200× and take the 2.5th and 97.5th percentiles of the resulting score distribution. Season/show-level CIs resample episodes.
- Show-relative percentile on every episode. An episode at p90 in Friends means it's funnier than 90% of scored Friends episodes, independent of the absolute score.
- Z-scores within show and within season. Useful for cross-show comparisons that control for the show's overall comedy baseline.
These are model-uncertainty estimates — they capture how much the score would jitter if we resampled jokes or episodes. They do not capture structural bias (LLM compression, format effects, etc.). See the Known Limitations section.
Career Value
Comedy WAR (Wins Above Replacement)
Career WAR ranks characters by total comedic contribution relative to a "replacement-level" bench player. Higher WAR means more jokes at higher average quality than you'd get from a typical recurring character.
Formula (v2)
WAR = total_jokes × max(shrunk_quality − replacement_quality, 0)
shrunk_quality = (n · observed_quality + 30 · league_median) / (n + 30)
observed_quality = (avg_craft + avg_impact) / 2
- Replacement quality = 25th percentile of the (craft+impact)/2 quality metric among bench-player characters (10–50 analyzed jokes). As of 2026-04-16 that level sits at 6.555.
- Bayesian shrinkage (k=30) pulls small-sample estimates toward the league median (6.775), preventing a 10-joke guest star with a lucky mean from beating a 1,000-joke lead.
- WAR/Episode = WAR ÷ episodes appeared. Use this for cross-era and cross-run-length comparisons.
History: v1 used a fixed midpoint ("−5") as the replacement baseline, which caused WAR to collapse to roughly 1.5 × total_jokes (effectively a screen-time metric). v2 swaps in an empirical replacement level and adds Bayesian shrinkage — rankings now reflect genuine quality × volume.
Honesty
Known Limitations
LLM score compression
Across 594 scored episodes, average craft scores have a standard deviation of just 0.36 (nominal 0–10 scale). Most of the headline signal in the Humor Index comes from joke count and peak density, not fine-grained craft differences between episodes.
Small samples and Bayesian shrinkage
Character WAR with fewer than 50 analyzed jokes is shrunk aggressively toward the league median. Rankings stabilize once a character crosses ~100 jokes.
Visual gags are underweighted
AI works from transcripts and scene descriptions. Physical comedy — a pratfall, a facial expression, Kramer's entrances — is harder to capture and likely underscored.
Sarcasm and irony are hard
Tone doesn't exist in a transcript. Ironic deadpan (e.g. Jim's camera looks) is identifiable through context, but subtle sarcasm likely gets miscategorized at the edges.
Cultural context decays
Jokes referencing specific 90s or 00s cultural moments may score lower on "cultural footprint" than they should, since the model's resonance signals are present-weighted.
JPM uses estimated runtime
Jokes Per Minute currently divides by an LLM-estimated episode runtime rather than an authoritative TMDB runtime. This makes JPM slightly self-correlated with joke count and format. Switching to TMDB runtime is planned.
Scoring is not blind to show identity
The LLM sees the show name and character list when scoring. This can introduce show-level priors. We ran a 99-episode blind-mode study (above): pooled difference was −1.47 HI points, 95% CI [−3.72, +0.79] — no statistically significant bias, and the direction was opposite to what a naive "AI favors famous shows" hypothesis would predict. A full-corpus rescoring is still future work.
No audience data
We don't use ratings, streaming numbers, or social media sentiment in the score itself. Across 591 episodes, the Humor Index correlates with IMDb audience ratings at r = −0.005 — they measure different things.
Only scripted comedy
Stand-up specials, improv, and sketch comedy require a different methodology. This pipeline is calibrated for scripted, episodic television only.
Glossary
Key Terms, Defined
Jokes Per Minute (JPM)
The number of distinct, gradeable jokes per minute of runtime. We count separable jokes a viewer could point to — not every comedic line — so JPM is comparable across every show.
Humor Index
A 0–100 composite of peak joke density, craft, impact, and weighted JPM. Every joke is scored, then rolled up to episode, season, and show level.
Craft
The writing quality of a joke, scored 1–10 across originality, structure, character integration, economy, and earned-vs-cheap.
Impact
Audience resonance of a joke (1–10): quotability, rewatch value, cultural footprint, and callback payoff.
WAR (Wins Above Replacement)
A character’s joke count × how far their shrunk average quality exceeds a replacement-level baseline. Rewards both volume and per-joke quality.
Replacement level
The quality of a 25th-percentile "bench" character. WAR measures value above this floor, so merely showing up doesn’t accrue value.