What is jokes per minute (JPM)?

Jokes per minute (JPM) is the number of distinct, separately identifiable jokes a show lands per minute of runtime. The Humor Index counts gradeable jokes — not every comedic line or laugh-track beat — so the figure is lower than informal counts but consistent across every show.

What is the Humor Index?

The Humor Index is a 0–100 score combining four signals: peak joke density, craft (writing quality), impact (audience resonance), and weighted jokes per minute. Every joke in an episode is scored by AI, then aggregated to episode, season, and show level.

What is comedy WAR (Wins Above Replacement)?

WAR adapts baseball’s Wins Above Replacement to comedy: a character’s joke count multiplied by how much their shrunk average quality exceeds a replacement-level baseline (the 25th-percentile bench character). It rewards both volume and quality, so high-output stars rank above one-line scene-stealers.

What do craft and impact mean?

Craft is the writing quality of a joke across five sub-dimensions (originality, structure, character integration, economy, earned-vs-cheap). Impact is audience resonance — quotability, rewatch value, cultural footprint, and callback payoff. Both are scored 1–10 per joke.

The Show Your Work

How we turn jokes into numbers.

Every score on this site comes from the same pipeline — AI reading the whole transcript, tagging every laugh, rating each on a five-axis craft rubric, then aggregating up to the episode, season, and show. Here's the math, the rubric, and the noise floor we've measured against it.

25%

Density

40%

Craft

35%

Impact

The Formula

Three Pillars

The Humor Index is a weighted composite of three independently scored dimensions. Each episode is analyzed at the joke level, then aggregated to season and show level.

JPMDensity (JPM + Peak)

25%

Raw comedic density across the episode plus peak density across its best runs. We count every distinct joke that lands — setups with identifiable punchlines, physical gags, callbacks, and running gags — then divide by the episode's net comedy runtime. The peak-density component (15% of the final score) rewards episodes that hit elite stretches; the weighted-JPM component (10%) rewards sustained density over the full episode.

CraftCraft Score

40%

Quality over quantity. Each joke is scored across five sub-dimensions (see below). Craft rewards structural sophistication: misdirection, subverted expectations, perfect character fit, and timing precision. A show can have low JPM and high Craft (slow-burn prestige comedy) or vice versa.

ImpactImpact Score

35%

Resonance and staying power. Quotability (does this line get repeated?), rewatch value (is it funnier the second time?), cultural footprint, and callback payoff. Impact captures what pure craft analysis misses — a technically average joke that becomes a catchphrase scores higher here. A small memorability bonus is added on top for episodes packed with iconic moments.

Craft Breakdown

The Craft Rubric

Craft is scored 1–10 across five sub-dimensions, equally weighted. Each is assessed per joke, then averaged across the episode.

Setup Quality

How efficiently and elegantly the premise is established. A great setup is invisible — it plants the seed without telegraphing the punchline.

Example: "George is a marine biologist now." (Seinfeld S5E14)

Misdirection

The degree to which the joke leads the audience toward a false expectation before subverting it. True misdirection is earned, not cheap.

Example: "I am not superstitious... but I am a little stitious." (The Office)

Subversion / Surprise

Does the punchline go somewhere unexpected? Subversion is the delta between what the audience anticipated and what actually happened.

Example: The Soup Nazi's final line in the episode finale.

Character Fit

Could only this character have delivered this joke? High character fit means the joke reveals or reinforces something true about who the character is.

Example: Every Dwight exit from a room.

Timing Precision

Beat length, pause placement, delivery speed. Timing is harder to score but we proxy it through editing rhythm, reaction shot positioning, and line overlap.

Example: Jim's camera looks. George's pauses before responding.

Format Adjustment

The Laugh Track Correction

Multi-camera shows with sweetened laugh tracks structurally inflate joke density — the laugh track signals joke boundaries that a single-camera show leaves implicit. To compare across formats fairly, we apply a format coefficient to the raw Humor Index.

Format	Coefficient	Effect
Single Camera	1.00×	Baseline
Hybrid	1.00×	Baseline
Multi-Camera (Live)	1.00×	Baseline
Multi-Camera (Sweetened)	1.00×	Baseline
Animation	1.00×	Baseline
Sketch	1.00×	Baseline

Deprecated as of 2026-04-16. Coefficients have all been set to 1.0 (no format adjustment). The old coefficients were confounded with show identity and calibrated against an opaque small sample. We now publish raw scores and let you filter by format on the shows and rankings pages. See the "Why we don't adjust for format" section below.

Policy

Why We Don't Adjust for Format

An earlier version of the Humor Index silently penalized multi-cam shows (live audience, laugh track) by 15–25%, on the premise that audience reaction inflated perceived impact. That adjustment had three problems:

Confounding. With only three scored shows, the "format effect" can't be statistically separated from show-level idiosyncrasies. You can't identify a format coefficient with that few levels of the treatment variable.
Opaque calibration. The coefficients were point estimates with no published confidence interval and a sample we can't re-examine.
Silent correction. Friends' 72.8 and Seinfeld's 77.9 partly reflected a 15% multi-cam tax, without that being visible to readers.

Our fix: report raw scores, tag every show with its format, and offer format-filtered leaderboards so you can compare like against like. Multi-cam shows and single-cam shows aren't directly comparable on a single scale — they're different comedy traditions with different conventions.

Score changes from the removal: Seinfeld 77.9 → 83.9, Friends 72.8 → 78.7, The Office 81.0 → 80.2. The original scores are preserved on each page as humor_index_v1 for transparency. Subsequent update (April 18, 2026): we then discovered that Jerry's stand-up bits at The Improv were being scored as sitcom comedy. Applying a 0.30 standup weighting and rescoring all 172 Seinfeld episodes with 3-run consensus moved Seinfeld from 83.9 → 77.8. Reconciliation (May 25, 2026): as new shows were added the live leaderboard drifted from the canonical aggregation, so we re-aggregated every scored show with one consistent method and added bootstrap 95% confidence intervals. Current published order (21 shows): Fleabag 95.8, Chappelle's Show 95.0, Veep 94.8, Seinfeld 94.5, Flight of the Conchords 87.6, Broad City 86.9, 30 Rock 84.4, Arrested Development 82.0, Curb Your Enthusiasm 80.8, Freaks and Geeks 80.0, The Office 79.4, It's Always Sunny in Philadelphia 79.2, The Simpsons 79.0, Community 77.9, Parks and Recreation 77.8, Futurama 77.6, Taxi 77.4, Schitt's Creek 77.2, The Larry Sanders Show 76.5, The Fresh Prince of Bel-Air 76.0, Friends 73.5. 30 Rock and Arrested Development pull clear at the top; the broad middle clusters inside each other's intervals; Friends and the single-season Freaks and Geeks sit below the pack.

Reliability

Our Scorer's Noise Floor

We ran a test-retest study: 30 episodes were scored twice, both in blind mode, with different random seeds. Results:

Intraclass correlation (ICC)

Humor Index (0–100): 0.28 (poor, < 0.40)
avg_craft (0–10): 0.28
avg_impact (0–10): 0.24
total_jokes detected: 0.67 (moderate)
JPM: 0.53 (moderate)

Mean absolute difference between two blind runs of the same episode: 10.7 Humor Index points. 72% of single-run Humor Index variance is run-to-run scorer noise; only 28% is real episode signal.

Why: joke detection is stable (67% signal), but per-joke craft/impact ratings jitter by ~5% between runs. The Humor Index formula amplifies that noise through threshold-based metrics (peak_density, memorability_bonus) that flip on small score changes.

What this means for rankings:

Show-level rankings hold up. Averaging over 170–236 episodes drives the standard error on show Humor Index to roughly ±0.4 points. The 3–6 point gaps between Seinfeld, Office, and Friends are well above that floor.
Individual episode rankings have ±5 point noise. Two episodes within ~10 Humor Index points of each other are essentially tied under single-run scoring.
Extreme episodes still stand out. Dinner Party (100) vs an average 75 is comfortably above the noise. It's the close-finish ordering that's uncertain.

What we're doing: consensus scoring. Our pipeline supports multi-run scoring via --num-runs. All new shows starting with Parks and Recreation will be scored 3× per episode and averaged. This should raise ICC to moderate (≥0.40); five runs would reach the "good" threshold (≥0.75).

Full study: see the blog post.

Bias

Show-Identity Bias: Small and Not Significant

We scored 99 episodes in blind mode (no show name, no character list, no description fed to the LLM) and compared to their non-blind production scores. Paired differences:

Pooled (n=99): −1.47 HI points (95% CI: [−3.72, +0.79])
Seinfeld (n=33): −2.45 (CI [−5.71, +0.82])
The Office (n=33): −1.23 (CI [−5.11, +2.65])
Friends (n=33): −0.72 (CI [−5.29, +3.84])

All CIs include zero. No show shows a statistically significant bias from knowing the show name. The direction (blind scores slightly HIGHER, not lower) is the opposite of what a naive "AI favors famous shows" hypothesis would predict.

Bayesian Model

What a Hierarchical Model Actually Finds

We fit a hierarchical Bayesian model to 15,000 jokes (5,000 per scored show) predicting joke-level impact from show, format, joke type, episode, and character. Here's what came out.

Format effect

Single-cam vs multi-cam: −0.052 (95% CrI: [−0.590, +0.442])

The 95% credible interval straddles zero. After controlling for show, joke type, episode, and character, we cannot statistically distinguish single-cam from multi-cam on impact. This vindicates the decision to set the format coefficient to 1.0.

Show effects (impact deflection from grand mean)

Seinfeld: +0.154 (95% CrI: [−0.224, +0.530])
The Office: −0.007 (95% CrI: [−0.505, +0.456])
Friends: −0.131 (95% CrI: [−0.498, +0.235])

All three intervals overlap. The posterior ordering (Seinfeld > Office > Friends) matches our Humor Index rankings, but the differences are within the statistical noise of this model. Probability that Seinfeld beats Friends on show-effect is approximately 82% — better than a coin flip, but not 99%+ certain.

Variance decomposition

63.9% — within-joke residual (unexplained)
11.8% — between episodes within a show
8.9% — between joke types
7.9% — between shows
7.5% — between characters

Shows explain only 7.9% of total joke-level variance. Rankings between shows capture a small fraction of what makes a joke score well. Two-thirds of the variance is unexplained within-joke residual — some real (same joke type executed better or worse), some LLM noise.

Full Bayesian model outputs are published at /data/format_posteriors.json, /data/show_credible_intervals.json, and /data/variance_decomposition.json.

Uncertainty

Confidence Intervals & Percentiles

Every score on the site is a point estimate with real uncertainty. We now publish:

95% bootstrap confidence intervals on episode, season, and show Humor Indexes. Episode-level CIs resample the episode's own jokes with replacement 200× and take the 2.5th and 97.5th percentiles of the resulting score distribution. Season/show-level CIs resample episodes.
Show-relative percentile on every episode. An episode at p90 in Friends means it's funnier than 90% of scored Friends episodes, independent of the absolute score.
Z-scores within show and within season. Useful for cross-show comparisons that control for the show's overall comedy baseline.

These are model-uncertainty estimates — they capture how much the score would jitter if we resampled jokes or episodes. They do not capture structural bias (LLM compression, format effects, etc.). See the Known Limitations section.

Career Value

Comedy WAR (Wins Above Replacement)

Career WAR ranks characters by total comedic contribution relative to a "replacement-level" bench player. Higher WAR means more jokes at higher average quality than you'd get from a typical recurring character.

Formula (v2)

WAR = total_jokes × max(shrunk_quality − replacement_quality, 0)

shrunk_quality = (n · observed_quality + 30 · league_median) / (n + 30)

observed_quality = (avg_craft + avg_impact) / 2

Replacement quality = 25th percentile of the (craft+impact)/2 quality metric among bench-player characters (10–50 analyzed jokes). As of 2026-04-16 that level sits at 6.555.
Bayesian shrinkage (k=30) pulls small-sample estimates toward the league median (6.775), preventing a 10-joke guest star with a lucky mean from beating a 1,000-joke lead.
WAR/Episode = WAR ÷ episodes appeared. Use this for cross-era and cross-run-length comparisons.

History: v1 used a fixed midpoint ("−5") as the replacement baseline, which caused WAR to collapse to roughly 1.5 × total_jokes (effectively a screen-time metric). v2 swaps in an empirical replacement level and adds Bayesian shrinkage — rankings now reflect genuine quality × volume.

Honesty

Known Limitations

LLM score compression

Across 594 scored episodes, average craft scores have a standard deviation of just 0.36 (nominal 0–10 scale). Most of the headline signal in the Humor Index comes from joke count and peak density, not fine-grained craft differences between episodes.

Small samples and Bayesian shrinkage

Character WAR with fewer than 50 analyzed jokes is shrunk aggressively toward the league median. Rankings stabilize once a character crosses ~100 jokes.

Visual gags are underweighted

AI works from transcripts and scene descriptions. Physical comedy — a pratfall, a facial expression, Kramer's entrances — is harder to capture and likely underscored.

Sarcasm and irony are hard

Tone doesn't exist in a transcript. Ironic deadpan (e.g. Jim's camera looks) is identifiable through context, but subtle sarcasm likely gets miscategorized at the edges.

Cultural context decays

Jokes referencing specific 90s or 00s cultural moments may score lower on "cultural footprint" than they should, since the model's resonance signals are present-weighted.

JPM uses estimated runtime

Jokes Per Minute currently divides by an LLM-estimated episode runtime rather than an authoritative TMDB runtime. This makes JPM slightly self-correlated with joke count and format. Switching to TMDB runtime is planned.

Scoring is not blind to show identity

The LLM sees the show name and character list when scoring. This can introduce show-level priors. We ran a 99-episode blind-mode study (above): pooled difference was −1.47 HI points, 95% CI [−3.72, +0.79] — no statistically significant bias, and the direction was opposite to what a naive "AI favors famous shows" hypothesis would predict. A full-corpus rescoring is still future work.

No audience data

We don't use ratings, streaming numbers, or social media sentiment in the score itself. Across 591 episodes, the Humor Index correlates with IMDb audience ratings at r = −0.005 — they measure different things.

Only scripted comedy

Stand-up specials, improv, and sketch comedy require a different methodology. This pipeline is calibrated for scripted, episodic television only.

Glossary

Key Terms, Defined

Jokes Per Minute (JPM)

The number of distinct, gradeable jokes per minute of runtime. We count separable jokes a viewer could point to — not every comedic line — so JPM is comparable across every show.

Humor Index

A 0–100 composite of peak joke density, craft, impact, and weighted JPM. Every joke is scored, then rolled up to episode, season, and show level.

Craft

The writing quality of a joke, scored 1–10 across originality, structure, character integration, economy, and earned-vs-cheap.

Impact

Audience resonance of a joke (1–10): quotability, rewatch value, cultural footprint, and callback payoff.

WAR (Wins Above Replacement)

A character’s joke count × how far their shrunk average quality exceeds a replacement-level baseline. Rewards both volume and per-joke quality.

Replacement level

The quality of a 25th-percentile "bench" character. WAR measures value above this floor, so merely showing up doesn’t accrue value.