Methodology deep-dive · By Marcus Halberg · Methodology desk

Calibration, plain English: how DissMarket grades a forecast

A forecaster who calls every event at 70% should see those events happen 70% of the time across the sample. That sentence is the whole job. Everything below is what it takes to actually grade a forecaster against it — the Brier score, the calibration curve, the sample-size floor, and the specific reasons prediction markets are harder to grade than pollsters even when the math is identical.

What calibration is (and why it isn’t accuracy)

Calibration is a property of a sequence of forecasts. Specifically: across all the times a forecaster says “70% chance of X,” X should happen close to 70% of the time. Across all their 80% calls, 80% of the time. Across all their 30% calls, 30% of the time. A forecaster whose stated probabilities track the actual long-run rates is well-calibrated — whatever they think they know, the world is paying out at roughly the rate they said it would.

This is a sharply different thing from accuracy, and the two are often confused. Accuracy is a single-call concept: was the most likely outcome the one that happened? If a forecaster says “70% chance of rain” and it rains, that’s a “correct” call in the accuracy sense. But a single right-or-wrong outcome tells you almost nothing about whether the forecaster is well-calibrated, because the same call from a forecaster who says “99% chance of rain” would also count as correct — and those two forecasters are not equivalent. One of them is, on this single piece of evidence, indistinguishable from a coin flip dressed up in a probability.

This matters enormously for prediction-market commentary. Most cable-news framing of “Polymarket got it right” or “Polymarket got it wrong” treats a single resolved market as evidence about Polymarket as a forecaster. It is not (more on the n problem below). At best, it is one data point in a calibration assessment that requires dozens to hundreds of similar points to have anything to say. Treating any individual market resolution as a verdict on prediction markets in general is a category error — the same one as treating one heads on a coin flip as a verdict on whether the coin is fair.

The way to internalise this: calibration is a property of the sequence, not the call. Any individual call can look brilliant or terrible. The forecaster underneath might be either, and you can’t tell from one outcome which one they are.

The Brier score

The standard way to put a number on calibration over a sequence of forecasts is the Brier score. Mechanically, it’s the mean squared error between the forecast probability and the outcome — where the outcome is 1 if the event happened and 0 if it didn’t. Lower is better. A score of 0 means perfect (every forecast was exactly right). A score of 1 means worst possible (every forecast was confidently wrong in the opposite direction). The no-information baseline — calling everything 50% — comes in at 0.25.

The math is straightforward enough to walk through in plain text. A forecaster says 70% chance the event happens. If the event happens, the squared error is (1 − 0.7)² = 0.09. If the event doesn’t happen, the squared error is (0 − 0.7)² = 0.49. Average those squared errors across a whole sequence of forecasts, and you have the Brier score for that forecaster on that sequence. The further the forecast probability sits from the actual outcome, the more it costs.

Here is one cycle of five forecasts, scored:

ForecastOutcomeSquared error
0.70Happened (1)(1 − 0.70)² = 0.09
0.30Didn’t (0)(0 − 0.30)² = 0.09
0.90Happened (1)(1 − 0.90)² = 0.01
0.55Didn’t (0)(0 − 0.55)² = 0.3025
0.20Happened (1)(1 − 0.20)² = 0.64

Total squared error: 1.1325. Brier score: 1.1325 / 5 = 0.227. That’s slightly better than the no-information baseline (0.25), which on a five-forecast sample is essentially a coin-toss verdict. The forecaster got two of five “most likely” calls wrong, and the big-confidence misses (the 0.20 that hit, the 0.55 that didn’t) drove most of the score. The 0.90-that-happened was the cheapest forecast in the sample.

The Brier score is the standard for a specific reason: it is strictly proper. A scoring rule is strictly proper if a forecaster minimises their expected score by reporting their true belief. You cannot game the Brier score by hedging away from your real probability — if you actually believe an event is 70% likely and you report 60% to look more cautious, your expected Brier score gets worse, not better. That property is what makes it usable for grading forecasters who know they’re being graded. It is also why the academic forecasting literature keeps coming back to it after sixty years.

Two caveats worth flagging. First, the Brier score conflates calibration and resolution — the score is improved both by saying the right probabilities and by saying confident probabilities when you have information. You can decompose it into the two pieces (the Murphy decomposition), but the raw score doesn’t separate them. Second, the no-information baseline of 0.25 is base-rate-dependent: a forecaster scoring questions where the true rate is, say, 10% will see different baseline numbers than one scoring 50/50 questions. Comparisons across question types are not apples to apples.

Calibration plots, plain English

The Brier score gives you one number for a whole sequence of forecasts. A calibration plot gives you the shape of the error. The construction is mechanical: bin the forecaster’s forecasts by the probability they reported (0–10%, 10–20%, …, 90–100%), and for each bin, plot the average forecast probability against the actual share of outcomes that resolved positive. A perfectly calibrated forecaster sits on the y=x diagonal — the 70% bin happens 70% of the time, the 30% bin happens 30% of the time, all the way across.

Deviations from y=x have specific interpretations. A forecaster whose plot sits below the diagonal at high probabilities (says 80%, only happens 60% of the time) is over-confident on the upside — they assert events more strongly than the world supports. A forecaster whose plot sits above the diagonal at low probabilities (says 10%, happens 25% of the time) is over-confident on the downside — they dismiss events more strongly than the world supports. The shape of a real calibration curve usually combines both, with an S-curve or a reverse-S that diagnoses where the forecaster’s priors are pulling them away from reality.

Public calibration plots exist for Polymarket, Manifold, FiveThirtyEight, and several academic forecasting tournaments across the 2024 US election cycle, and they’re worth looking at if you want to see what well-calibrated and not-well-calibrated forecasters actually look like in practice. I’m not going to quote specific percentages here, because the calibration numbers floating around in commentary are often computed on a sample of resolved questions chosen post-hoc (a selection effect we’ll come back to), and I don’t want to launder a number I haven’t verified out-of-sample. The methodology page links a few of the more rigorously assembled public datasets when we’ve checked them.

Why calibration is hard with prediction markets

Calibration requires resolved forecasts. Lots of them. The 2024 US election cycle produced a usable n for Polymarket calibration assessment precisely because hundreds of markets — state-level results, Senate seats, House seats, governor races, propositions — resolved within a few months of each other on roughly comparable resolution criteria. That was, by historical standards, an exceptional year for the question. The 2025–2027 stretch will give us tens of resolutions in the same category, not hundreds. Calibration assessments built on those tens will have wide confidence intervals.

The deeper problem is structural. Most political prediction markets resolve only once. A market on the 2028 Democratic nominee will resolve in 2028 (the resolution criterion is the actual nomination) and produce exactly one data point. To assess how well-calibrated the Polymarket book on that question was, you would need to bin it against many other comparable one-shot markets that resolved in the same window. That cross-market binning is itself a methodological choice, not a free observation — you have to decide which markets are comparable, over what time horizons, with what resolution-criterion families. Different choices yield different calibration verdicts on the same underlying book.

This is the deeper reason DissMarket triangulates against pollsters rather than evaluating Polymarket in isolation. The named-pollster ecosystem (Pew, Marist, Gallup, Echelon, YouGov, and the others) has decades of out-of-sample calibration history. We know roughly what their Brier scores look like against the resolution criterion of “the actual vote count two months from now,” because two-month-out election polls have been graded for the better part of a century. Prediction-market calibration data is a fraction of that depth. Comparing the two signals on the same question on the same day is more informative, in this period, than trying to grade either one in isolation against an n that hasn’t accumulated yet. The longer version of that argument lives in the four-ways piece if you want it.

How DissMarket plans to score voters’ calibration (Phase 3)

Per the main methodology page, DissMarket is rolling out in three phases. Phase 1 is the current X polls — directional public-opinion reads, not statistically representative, labelled as such. Phase 2 is the verified panel: a registered respondent base with demographic capture and survey-weighted aggregation. Phase 3 is the calibration-weighted aggregation, and the calibration scoring described above is what makes Phase 3 possible.

The mechanism is straightforward. Each registered user on the verified panel will make probability-style predictions on the questions DissMarket mirrors from Polymarket and other markets. When those markets resolve, the user’s forecasts are scored with the Brier score against the realised outcome. Over many resolved markets, each user accumulates a personal Brier-score history — the resolution criterion is published per question, and the calibration record is publicly viewable per user. Users with stronger calibration history get more weight in the calibration-weighted aggregate signal.

The aggregate Phase 3 signal will be published in three flavours: raw (one vote per registered user), demographically weighted (the Phase 2 signal, weighted to a representative US adult universe), and calibration-weighted (each user’s vote weighted by their accumulated Brier-score performance on resolved markets). Each is useful for a different question. The calibration-weighted version is the one designed for cases where prediction quality is the goal — quant alpha signals, or any use case that prizes “who has been right out-of-sample” over “who is representative of the country.”

This is the long unlock. Phase 1 is a public-opinion read. Phase 2 is a survey-quality public-opinion read. Phase 3 is a calibration-graded crowd forecast — closer in kind to what the academic forecasting tournaments do, and built on a much larger respondent pool. None of it works without enough resolved markets per user to make the calibration score meaningful, which is the binding constraint and the one we’ll be honest about as the data accumulates.

What I’d want to see before calling anything well-calibrated

Two numbers, and the reasoning behind both. The first is n=30 resolved questions, minimum, before I’d offer a directional read on whether a forecaster is well-calibrated. Below that, the confidence interval on any Brier-score comparison is wide enough that the apparent ranking can flip from sample to sample — you’d be telling a story the data doesn’t actually support. The second is n=100+ resolved questions before I’d call the assessment publishable as a calibration claim per se. At that sample size, the Brier-score estimate is tight enough to distinguish well-calibrated forecasters from middling ones with reasonable confidence, and the calibration curve is shaped enough that the over-confidence and under-confidence regions are visible rather than just noise.

This is why I’m cautious about anyone — including us — making strong calibration claims this early. The launch X poll is, in calibration terms, an n=1 resolution event, and the resolution criterion (the 2028 Democratic nominee) won’t fire for roughly two years. We can say things about how the public opinion read related to the Polymarket price on the day of the poll. We cannot say anything about whether either signal was well-calibrated on that question for at least two more election cycles’ worth of data — and that’s before getting to the question of how to bin one-shot political markets at all, which is the deeper problem above.

Two related caveats, both worth stating because they get glossed over in most calibration commentary. Selection effect: if you grade a forecaster on the subset of their forecasts that happen to have resolved, you are scoring an in-sample slice, not the population. Forecasters who concentrate their high-confidence calls on questions that resolve quickly will look better than ones who take long-horizon positions, even if their underlying calibration is identical. Question-mix: a forecaster who only ever calls 90/10 markets has a different scoring profile than one who specialises in 55/45 markets, and Brier-score comparisons across those mixes aren’t apples to apples. The base-rate read on the House 2026 market is the kind of question where these caveats matter — it’s a relatively long-horizon, mid-confidence market, and grading it against, say, a same-day prop bet would mislead in either direction.

If you’re reporting on this beat, the operating heuristic I’d offer is this. Before believing any “X is well-calibrated” claim, look for four things: an explicit n on the resolved-question sample, a disclosed binning rule, the resolution criterion attached to each market in the sample, and a calibration curve — not just a Brier number. If any of those four is missing, the claim is decorative, not falsifiable. The archive of pieces we’ve published so far, including the daily-recap thread in the launch recap, tries to make those four explicit per market, with epistemic humility about what hasn’t resolved yet.

What this means for next time

When you read a calibration claim about any pollster or prediction market this cycle, write down four things before you cite it: the n of resolved forecasts, the binning rule used to aggregate them, the resolution criterion attached to each, and whether a calibration curve is published alongside the Brier number. If three or four of those are present, the claim is doing real work. If two or more are missing, the claim is doing rhetorical work. Treat it accordingly.