Most forecasters ask to be trusted. Meridian leaves a record.
The top of this page is for executive review. Live scores show what resolved recently. Resolved calls show examples. The method and audit appendix explain how the record is produced.
Compared to consultancy reports
We score against unseen years, not the past.
An accuracy score on every claim, not a narrative.
Thirty-year horizon. EIU publishes five.
Audit trail per cell, not an executive summary.
Compared to in-house models
Cross-domain calibration. Nine subjects on one score.
No retroactive tuning. Versions tracked separately.
ForecastBench-aligned. Replicable by an external team.
139 verified corrections, each dated and sourced.
5.61-point average miss on a 100-point scale
We hid the target years from the model, then scored what it predicted against what actually happened. Across 297 forecast-vs-actual pairs the average miss was 5.61 points on a 0-to-100 scale. A separate 25-year structural backtest covering 17,875 pairs comes in tighter at ±4.7 on the pre-2020 subset.
99 forecast surfaces running 2020 to 2050
9 subjects times 11 regions makes 99 forecast surfaces. Each runs from 2020 to 2050 and updates as new data arrives. Tap any subject card below for sources and the regions covered.
297 forecast-vs-actual pairs already scored
297 forecasts already scored against what actually happened. 33 per subject. 27 per region. Both more than enough for reliable inference. Every new forecast joins the registry.
Thirty years forward, on every surface we run
Each of the 99 forecast surfaces runs from 2020 to 2050. The horizon a board needs to read its own decision against the regime it will live inside, not just the next quarter or fiscal year.
Lower is better. 0.25 is a coin flip.
Calibration you can audit, not just trust.
Brier score measures how close a probability was to what actually happened. A perfect forecast scores 0. A confident wrong call scores 1. This window scores 0.031 on 36 resolved predictions in 1-day binary financial threshold templates (will SP500 close above X, will VIX be above Y, etc.). Not directly comparable to Tetlock or ForecastBench numbers — those benchmark multi-month forecasts on different question mixes. Skill validation runs through a documented gate sequence: BSS vs persistence baseline, then BSS vs market-implied probability, then live P&L. See below for the gate state.
382 predictions resolved in the last 30 days
Live production loop. Templates score themselves as outcomes resolve, and the rolling Brier you see updates automatically. The number is small enough to read in a glance, large enough for honest inference. Multi-class bucket templates (BTC ranges) are excluded from this headline pending per-event multinomial-Brier aggregation; they are tracked separately.
Most forecasters do not publish their accuracy. We do.
Six examples from the last 30 days. Five confident wins, one directional-right under-confident miss. The registry publishes both.
- Economy
WTI crude will close above $80 a barrel
Meridian97% YesResolved 2026-05-19Yes, $101.56FRED DCOILWTICOVerdictRight callStrongly heldBrier 0.001 - Economy
Weekly initial jobless claims will exceed 250,000
Meridian9% NoResolved 2026-05-19No, 211,000FRED ICSAVerdictRight callStrongly heldBrier 0.008 - Money
Fed Funds Effective Rate will be above 4.0%
Meridian15% NoResolved 2026-05-19No, 3.63%FRED EFFRVerdictRight callConfidently heldBrier 0.023 - Money
2Y-10Y Treasury spread will be positive (no inversion)
Meridian89% YesResolved 2026-05-19Yes, +0.54FRED T10Y2YVerdictRight callStrongly heldBrier 0.013 - Economy
VIX will exceed 30 (panic territory)
Meridian6% NoResolved 2026-05-19No, 18.43FRED VIXCLSVerdictRight callStrongly heldBrier 0.003 - Money
Gold will close above $3,200 an ounce
Meridian65% YesResolved 2026-05-19Yes, $4,587Yahoo GC=FVerdictRight callUnder-confident, right sideBrier 0.125
Sample of resolved calls from the production registry. The full registry, scoring details, and reasoning trail are shared with engaged clients under confidentiality.
Live production registry
Snapshot 2026-06-167-day rolling window, registry snapshot
- 0.0309
- Calibration miss
- 0.2431
- Resolution lift
- 0.2431
- Climatology variance
- Economyn = 110.010
- Societyn = 100.012
- Moneyn = 60.065
- Geopoliticsn = 50.070
- AI & computen = 40.042
14-day rolling window, registry snapshot
- 0.0122
- Calibration miss
- 0.2298
- Resolution lift
- 0.2494
- Climatology variance
- Moneyn = 630.008
- Economyn = 430.045
- Societyn = 180.022
- Geopoliticsn = 90.083
- AI & computen = 80.110
30-day rolling window, registry snapshot
- 0.0118
- Calibration miss
- 0.2229
- Resolution lift
- 0.2500
- Climatology variance
- Moneyn = 1930.041
- Economyn = 1460.023
- Societyn = 220.029
- Geopoliticsn = 120.152
- AI & computen = 90.163
The longer window
17,875 pairs · 25 yearsLong-window structural accuracy across the five subjects with continuous public data back to the year 2000. Aggregate ±4.7 MAE across 10,450 pre-2020 pairs.
| Subject | Shock 20–24 | Pre-2020 | Rank ρ |
|---|---|---|---|
| Climate | ±5.2 | ±3.0 | 0.97 |
| Society | ±7.0 | ±3.5 | 0.86 |
| Education | ±5.4 | ±5.2 | 0.92 |
| Economy | ±7.4 | ±5.5 | 0.70 |
| Geopolitics | ±8.7 | ±6.4 | 0.95 |
The scorecard is computed only over 2020 to 2024. That window is the compound-shock era: COVID, Russia-Ukraine, Iran-Israel, sovereign debt repricing, AI capability acceleration all overlapping at once. Expanding the test back 25 years gives a long-window view that includes the 2008 GFC, 9/11, Crimea 2014, Brexit, and the 2018 to 2019 trade war as named single-shock events.
The structural slice is the pre-2020 subset where both base year and target year fall before 2020. Pre-2020 is not calm in the absolute sense, it is calmer relative to the 2020 to 2024 overlap of disruptions. Long-window structural is the accurate phrasing.
Aggregate across all five subjects. 17,875 forecast-vs-actual pairs over the full 25-year window come out to a weighted MAE of ±6.1, or ±4.7 on the 10,450-pair pre-2020 structural subset.
Why these five. Climate, society, education, economy, and geopolitics are the subjects with continuous public data back to 2000 across all 11 regions. Tech subjects don't have 25 years of base values, so their hold-out numbers stand on their own.
Sources. V-Dem v16 indicators (1900 to present), World Bank WDI macro and environmental series (1960 to present), editorial event overlay for documented shocks.
99 forecasts in motion.
Money
Debt service rising, monetary regimes unsettled.
20202050- What it covers
- Government debt, currency stability, capital flows.
- Sources
- IMF, BIS, World Bank, national central banks
- Region coverage
- All 11 regions, continuous 2020 to 2050
AI & compute
Capability compounding past AGI, into ASI, no plateau in sight.
20202050- What it covers
- Top-end AI capability, who controls the compute, where it gets deployed.
- Sources
- Epoch AI, Stanford HAI, regulatory filings, public capex disclosures, METR task-length benchmarks
- Region coverage
- All 11 regions, with a separate view of top AI clusters
Tech frontier
Long-cycle technologies arriving in compressed sequence.
20202050- What it covers
- Quantum computing, biotech, energy transition, long-cycle resets.
- Sources
- NIH, NSF, IEA, journal publication rates, public capex
- Region coverage
- All 11 regions, plus quantum and biotech sub-views
Society
Trust eroding, accelerating through post-labor shock.
20202050- What it covers
- Trust in institutions, demographics, cohesion vs fragmentation.
- Sources
- Edelman Trust Barometer, World Values Survey, national census
- Region coverage
- All 11 regions, with an institutional vs societal split
Geopolitics
Volatility rising sharply through the 2032 governance crisis.
20202050- What it covers
- Alliances, conflict probability, how governments behave.
- Sources
- ACLED, CSIS, sanctions registries, public defense spending
- Region coverage
- All 11 regions, two-country and multi-country views
Climate
Physical risk rising; ASI mitigation cascade still past the 2050 horizon.
20202050- What it covers
- Emissions, physical risk, how fast the transition is happening.
- Sources
- NOAA, IPCC AR cycles, IEA, national emissions inventories
- Region coverage
- All 11 regions, with an ecology sub-view
Economy
Disruption peaks at 2030 post-labor shock, restructured 2033.
20202050- What it covers
- Growth, jobs, productivity, which sectors are rising and falling.
- Sources
- World Bank, OECD, IMF WEO, national statistical agencies
- Region coverage
- All 11 regions, with sector-level views
Education
Traditional value falling as marginal cost of knowledge approaches zero.
20202050- What it covers
- Schooling, skills, talent flows, attainment by region.
- Sources
- UNESCO, OECD PISA, World Bank Ed Stats, national education ministries
- Region coverage
- All 11 regions, attainment and skills-gap sub-views
Positive signals
Quiet trajectories the foresight literature underweights.
20202050- What it covers
- Underweighted good news most foresight leaves out.
- Sources
- WHO, UNESCO, Our World in Data, World Bank
- Region coverage
- All 11 regions, the counter-narrative view
Tap any card for the technical detail behind the plain-language summary.
- 01
We predict a year we haven't seen.
Train on data through 2024, predict 2025, score against what actually happened. 297 forecast-vs-actual pairs. No data leakage.
The technical name is a held-out backtest. The model only sees data on or before a chosen base year, projects forward to a target year it has never seen, then gets scored against what actually happened.
Current sweep: base year 2020, target years 2022, 2024, 2026. That produces 297 forecast-vs-actual pairs at the configuration behind the 5.61-point miss.
An expanded economy-only sweep over 2000 to 2024 base years adds 3,575 pairs, using V-Dem and World Bank WDI to backfill. That sweep produces the structural-vs-shock split shown below.
- 02
We grade confidence, not just correctness.
You get penalized for being wrong, and also for being overconfident when you shouldn't be. The score breaks down three ways a forecast can fail.
The score itself is the Brier score, mean squared error between forecast and outcome. A rolling 7-day Brier runs continuously across all resolved predictions. A regression alert fires if it slips by more than 0.066 against the trailing baseline.
To find why a Brier score moves, we apply the Murphy decomposition. It splits the score into three independent ways a forecast can be wrong:
Calibration. When you say 70%, does it happen 70% of the time?
Resolution. Can you tell different cases apart, or is every forecast roughly the same?
Sharpness. Are you saying something informative, or hedging at 50/50?
- 03
We combine five forecasters, not one.
Five model families, each from a distinct lineage. Members are weighted by their track record, not by reputation or how recent they are.
The technical name is a cross-family ensemble. Five forecasters drawn from different model families, combined with a skew-adjusted aggregation method (Powell, Satopää, MacKay, Tetlock, 2024). An evolution of the techniques developed during the IARPA forecasting tournaments.
Forecasters that miss more often get less weight, based on track record. Not by reputation. Not by how recent the work is.
A second test catches errors the Brier score misses: a model that is consistently off in one direction scores poorly on Brier but still ranks everything perfectly. Two rank correlation metrics close that gap. Current sweep: ρ = 0.906, τ-b = 0.759 across 297 pairs. Close to 1.0 means the rank ordering is nearly perfect.
- 04
Every number traces to a named public source.
139 verified corrections to historical baselines so far. Every cell in the data substrate is auditable.
Historical data is built from named, citable public sources, ingested through an audit-trailed pipeline. An ongoing recalibration process has applied 139 verified corrections to historical baselines to date, improving the ground truth against which every forecast is scored.
Each correction is dated, sourced, and linked to a specific historical cell. The audit register records what was changed, why, and against which public source the change resolves.
The forecast schema also aligns with ForecastBench (Karger et al. 2024), the open academic benchmark for forecasting systems. Any prediction can be re-scored from scratch against ground truth. Different model versions are tracked separately, so improvements are not accidentally credited to old work.
Each card opens to the decision frame, what you walk away with, and where the rigor comes from.
- 01
Holds up under multi-year program review.
A persistent forecast memory with a Brier history per claim and an audit appendix any external reviewer can replicate.
The decisionWhich long-horizon programs and partnerships are tracking as intended, and which need to be rethought before the next funding cycle. The calls that resolve over decades and have to defend themselves year after year.
What you walk away withA forecast registry your board can inspect, year over year, with a calibration trend that improves as the program runs.
Where the rigor comes from- ·Held-out backtest at 297 pairs (2020 → 2022/2024/2026)
- ·Brier score per claim with Murphy decomposition
- ·139 verified corrections to historical baselines, each dated and sourced
- ·Rank correlation ρ = 0.906 across the 297 pairs
Replaces or augmentsNIC Global Trends, CSIS, and RAND scenario work. Those disclaim prediction and refresh narrative scenarios every four years. This produces scored, continuously updated forecasts with the audit trail institutional review demands.
- 02
Reads many regions at once.
Nine subjects across eleven regions, running simultaneously. The intersections are the part most foresight work doesn't reach.
The decisionWhere momentum is real across regions, and where it's local elite consensus that won't generalize. The cross-region read that tells you which signals reinforce each other across the map and which are isolated to one cluster.
What you walk away withA one-page cross-region brief naming the three highest-confidence subject-region intersections for the question on your desk, with the data trail underneath.
Where the rigor comes from- ·9 subjects × 11 regions = 99 surfaces, each on the same 0-to-100 calibration
- ·Sources include IMF, BIS, World Bank, IPCC, IEA, Edelman, WHO, NOAA, ACLED, Stanford HAI, Epoch AI, UNESCO
- ·Positive signals tracked as a first-class surface, not an afterthought
- ·Cross-family ensemble of five forecasters, skew-adjusted aggregation
Replaces or augmentsVerisk Maplecroft and other current-state index families. Those publish current-state only, no trajectory, no Brier. This adds the 30-year forward view and the per-claim accuracy score those don't produce.
- 03
Backs up the calls a commission has to vote on.
A held-out backtest, a Brier per claim, and a calm-window economy number of 5.5 points that survives appropriations review.
The decisionThe tactical calls that get voted on and have to survive minutes-of-meeting scrutiny. Industry-mix shifts, workforce projection, infrastructure justification, appropriations defense. Reads where the question after the vote is always where did this number come from.
What you walk away withA forecast appendix with a Brier score per claim and a held-out backtest disclosure that survives audit-committee inspection. A scenario range for shock years, a point estimate for calm years.
Where the rigor comes from- ·Held-out 2025 backtest — the model only saw data through 2024
- ·Murphy decomposition on every claim
- ·No retroactive tuning. Model versions tracked separately.
- ·139 verified corrections, dated and source-linked
- ·ForecastBench-schema aligned (Karger et al. 2024)
Replaces or augmentsEurasia Group, EIU, S&P Global consulting reports for the board-and-commission context. Those publish no held-out backtest and no Brier on their methodology pages. This produces the verify-me artifacts that trust-me advisory cannot.
- 04
When to act on a number, when to plan for a range.
Calm regimes get a point estimate. Shock regimes get a scenario range. The calibration tells you which.
The decisionIndustry-mix shifts. Site selection. Workforce projection. Whether to commit to a single sector outlook or hedge across a range. The calibration tells you when the regime is calm enough to act on a point estimate and when the regime is in shock and you need a scenario range.
What you walk away withA forecast appendix that names the regime (calm or shock), gives you a number or a range with that regime's measured uncertainty, and carries a Brier history on the exact claim.
Where the rigor comes from- ·Regime-aware backtest. 2000-2024 expanded economy sweep, 3,575 pairs.
- ·Calm window scored at ±5.5, shock window at ±9.8, separately
- ·Tightest measured slice: Tech · Quantum at ±2.8 on the 2025 hold-out
- ·Widest measured slice: Money · Crypto at ±8.9 on the 2025 hold-out
- ·Cross-family ensemble, weighted by track record not reputation
Replaces or augmentsOxford Economics (87 economies, economic-only, no Brier) and EIU (5-year horizon). This adds 30-year horizon and cross-domain STEEPE signals into the same calibration so an industry-mix call factors in the forces that bend it, not only its own historical trend.
Registry and calibration trend are shared with engaged clients under confidentiality.
- Snapshot date
- 2026-06-16
- Benchmark accuracy
- Base year 2020 → target years 2022, 2024, 2026. 297 forecast-vs-actual pairs. Model trained on data through the base year only.
- Brier window
- 7-day rolling on the production loop, n=36 resolved predictions. Three windows reported (7d, 14d, 30d) for cross-section. Multi-class bucket templates (BTC price ranges) are excluded from the headline pending per-event multinomial-Brier aggregation, and are tracked separately in the audit module.
- Long-window backtest
- 17,875 pairs across 25 years on five long-history subjects. Sources: V-Dem v16, World Bank WDI.
- Reference scale and known gaps
- Brier is computed on the standard scale (Brier 1950). External benchmarks like ForecastBench (Karger et al. 2024, arXiv:2409.19839) and Tetlock GJP score multi-month forecasting on questions with non-zero base-rate uncertainty. The registry above is dominantly 1-day binary macro thresholds; raw aggregate numbers are not directly comparable to those references. Skill is validated through the gate sequence below, not by direct Brier comparison.
- Validation gates
- A template ships to the live trade harness only after clearing, in order: (a) Brier Skill Score vs persistence baseline above +0.10 over a backtest of n ≥ 200; (b) BSS vs market-implied probability above +0.03 over the same; (c) net-positive paper-trade P&L over 60 days. Each gate is reproducible from the public audit module (brier-audit.ts in the exponentialworld repo) and is the published reason a template is or is not in live use.
- Cohort status
- v1-live-fred-20260504 — accruing, n=5 of 30 resolved
- v3-cai-veda-r2-20260607 — no resolutions yet
- v2-ic4wsa-blend-20260605 — no resolutions yet
- pre-versioned — no resolutions yet
Template-version cohorts surface separately so the post-Round-2 macro-only cohort (v3-cai-veda-r2-*) accrues to N=30 before being scored, rather than being pooled with the legacy financial-heavy v2 into a single misleading headline. Bradley, Schwartz, and Hashino 2008 floor.
- Audit access
- Full registry, calibration trend, and per-cell correction log shared with engaged clients under confidentiality.