Methodology

A forecast earns trust after reality answers.

Meridian is measured every day as calls resolve. Wins, misses, rolling Brier windows, hold-out tests, and cohort status remain visible here because forecast credibility should compound in public, not hide in a deck.

Updated
Daily
Snapshot
2026-06-16
Start with
Live scores

Most forecasters ask to be trusted. Meridian leaves a record.

The top of this page is for executive review. Live scores show what resolved recently. Resolved calls show examples. The method and audit appendix explain how the record is produced.

What makes it different

Compared to consultancy reports

  • We score against unseen years, not the past.

  • An accuracy score on every claim, not a narrative.

  • Thirty-year horizon. EIU publishes five.

  • Audit trail per cell, not an executive summary.

Compared to in-house models

  • Cross-domain calibration. Nine subjects on one score.

  • No retroactive tuning. Versions tracked separately.

  • ForecastBench-aligned. Replicable by an external team.

  • 139 verified corrections, each dated and sourced.

Benchmark · 2020-2026 hold-out · 297 pairs
0%
Accuracy
2020-2026 hold-out · n=297

5.61-point average miss on a 100-point scale

We hid the target years from the model, then scored what it predicted against what actually happened. Across 297 forecast-vs-actual pairs the average miss was 5.61 points on a 0-to-100 scale. A separate 25-year structural backtest covering 17,875 pairs comes in tighter at ±4.7 on the pre-2020 subset.

0
Things we track
9 subjects × 11 regions

99 forecast surfaces running 2020 to 2050

9 subjects times 11 regions makes 99 forecast surfaces. Each runs from 2020 to 2050 and updates as new data arrives. Tap any subject card below for sources and the regions covered.

0
Pairs graded
Target years unseen by the model

297 forecast-vs-actual pairs already scored

297 forecasts already scored against what actually happened. 33 per subject. 27 per region. Both more than enough for reliable inference. Every new forecast joins the registry.

0 yr
Forecast horizon
2020 to 2050

Thirty years forward, on every surface we run

Each of the 99 forecast surfaces runs from 2020 to 2050. The horizon a board needs to read its own decision against the regime it will live inside, not just the next quarter or fiscal year.

Production calibration · current
0.000
Brier score
7-day rolling · n=36 · live

Lower is better. 0.25 is a coin flip.

Calibration you can audit, not just trust.

Brier score measures how close a probability was to what actually happened. A perfect forecast scores 0. A confident wrong call scores 1. This window scores 0.031 on 36 resolved predictions in 1-day binary financial threshold templates (will SP500 close above X, will VIX be above Y, etc.). Not directly comparable to Tetlock or ForecastBench numbers — those benchmark multi-month forecasts on different question mixes. Skill validation runs through a documented gate sequence: BSS vs persistence baseline, then BSS vs market-implied probability, then live P&L. See below for the gate state.

0
Resolved
Predictions with outcomes · 30d

382 predictions resolved in the last 30 days

Live production loop. Templates score themselves as outcomes resolve, and the rolling Brier you see updates automatically. The number is small enough to read in a glance, large enough for honest inference. Multi-class bucket templates (BTC ranges) are excluded from this headline pending per-event multinomial-Brier aggregation; they are tracked separately.

5.61 PT AVG2020-2026 HOLD-OUTACTUAL · 2022-2026
Average miss: 5.61 points on a 0-to-100 scale. 297 forecast-vs-actual pairs across the 2020-2026 hold-out.

Most forecasters do not publish their accuracy. We do.

01 · CallsResolved calls from the production registry

Six examples from the last 30 days. Five confident wins, one directional-right under-confident miss. The registry publishes both.

  1. Economy

    WTI crude will close above $80 a barrel

    Meridian
    97% Yes
    Resolved 2026-05-19
    Yes, $101.56
    FRED DCOILWTICO
    Verdict
    Right call
    Strongly held
    Brier 0.001
  2. Economy

    Weekly initial jobless claims will exceed 250,000

    Meridian
    9% No
    Resolved 2026-05-19
    No, 211,000
    FRED ICSA
    Verdict
    Right call
    Strongly held
    Brier 0.008
  3. Money

    Fed Funds Effective Rate will be above 4.0%

    Meridian
    15% No
    Resolved 2026-05-19
    No, 3.63%
    FRED EFFR
    Verdict
    Right call
    Confidently held
    Brier 0.023
  4. Money

    2Y-10Y Treasury spread will be positive (no inversion)

    Meridian
    89% Yes
    Resolved 2026-05-19
    Yes, +0.54
    FRED T10Y2Y
    Verdict
    Right call
    Strongly held
    Brier 0.013
  5. Economy

    VIX will exceed 30 (panic territory)

    Meridian
    6% No
    Resolved 2026-05-19
    No, 18.43
    FRED VIXCLS
    Verdict
    Right call
    Strongly held
    Brier 0.003
  6. Money

    Gold will close above $3,200 an ounce

    Meridian
    65% Yes
    Resolved 2026-05-19
    Yes, $4,587
    Yahoo GC=F
    Verdict
    Right call
    Under-confident, right side
    Brier 0.125

Sample of resolved calls from the production registry. The full registry, scoring details, and reasoning trail are shared with engaged clients under confidentiality.

02 · AccuracyThe 2025 test
BRIER SCOREPERFECTTOP AIBEST HUMANSCOIN FLIPPUNDITSWORST0.00~0.15~0.200.25~0.351.00WE ARE HERE0.031MEASURED 7-DAY · N=36
Brier score measures how close a probability was to what actually happened. A perfect forecast scores 0. A confident wrong call scores 1. This window scores 0.031 on 36 resolved predictions in 1-day binary financial threshold templates (will SP500 close above X, will VIX be above Y, etc.). Not directly comparable to Tetlock or ForecastBench numbers — those benchmark multi-month forecasts on different question mixes. Skill validation runs through a documented gate sequence: BSS vs persistence baseline, then BSS vs market-implied probability, then live P&L. See below for the gate state.
03 · Live production registryRolling Brier on production templates, three windows

Live production registry

Snapshot 2026-06-16
Last 7 days
0.031
n = 36

7-day rolling window, registry snapshot

Murphy decomposition
0.0309
Calibration miss
0.2431
Resolution lift
0.2431
Climatology variance
By category
  • Economyn = 110.010
  • Societyn = 100.012
  • Moneyn = 60.065
  • Geopoliticsn = 50.070
  • AI & computen = 40.042
Last 14 days
0.032
n = 141

14-day rolling window, registry snapshot

Murphy decomposition
0.0122
Calibration miss
0.2298
Resolution lift
0.2494
Climatology variance
By category
  • Moneyn = 630.008
  • Economyn = 430.045
  • Societyn = 180.022
  • Geopoliticsn = 90.083
  • AI & computen = 80.110
Last 30 days
0.040
n = 382

30-day rolling window, registry snapshot

Murphy decomposition
0.0118
Calibration miss
0.2229
Resolution lift
0.2500
Climatology variance
By category
  • Moneyn = 1930.041
  • Economyn = 1460.023
  • Societyn = 220.029
  • Geopoliticsn = 120.152
  • AI & computen = 90.163
Three rolling windows on the production registry. The Murphy decomposition and per-category breakdown sit underneath each headline number.
04 · Long windowTwenty-five years of structural accuracy

The longer window

17,875 pairs · 25 years

Long-window structural accuracy across the five subjects with continuous public data back to the year 2000. Aggregate ±4.7 MAE across 10,450 pre-2020 pairs.

SubjectShock 20–24Pre-2020Rank ρ
Climate±5.2±3.00.97
Society±7.0±3.50.86
Education±5.4±5.20.92
Economy±7.4±5.50.70
Geopolitics±8.7±6.40.95

The scorecard is computed only over 2020 to 2024. That window is the compound-shock era: COVID, Russia-Ukraine, Iran-Israel, sovereign debt repricing, AI capability acceleration all overlapping at once. Expanding the test back 25 years gives a long-window view that includes the 2008 GFC, 9/11, Crimea 2014, Brexit, and the 2018 to 2019 trade war as named single-shock events.

The structural slice is the pre-2020 subset where both base year and target year fall before 2020. Pre-2020 is not calm in the absolute sense, it is calmer relative to the 2020 to 2024 overlap of disruptions. Long-window structural is the accurate phrasing.

Aggregate across all five subjects. 17,875 forecast-vs-actual pairs over the full 25-year window come out to a weighted MAE of ±6.1, or ±4.7 on the 10,450-pair pre-2020 structural subset.

Why these five. Climate, society, education, economy, and geopolitics are the subjects with continuous public data back to 2000 across all 11 regions. Tech subjects don't have 25 years of base values, so their hold-out numbers stand on their own.

Sources. V-Dem v16 indicators (1900 to present), World Bank WDI macro and environmental series (1960 to present), editorial event overlay for documented shocks.

05 · What we trackNine subjects, eleven regions, thirty years

99 forecasts in motion.

  1. Money

    Debt service rising, monetary regimes unsettled.

    20202050
    What it covers
    Government debt, currency stability, capital flows.
    Sources
    IMF, BIS, World Bank, national central banks
    Region coverage
    All 11 regions, continuous 2020 to 2050
  2. AI & compute

    Capability compounding past AGI, into ASI, no plateau in sight.

    20202050
    What it covers
    Top-end AI capability, who controls the compute, where it gets deployed.
    Sources
    Epoch AI, Stanford HAI, regulatory filings, public capex disclosures, METR task-length benchmarks
    Region coverage
    All 11 regions, with a separate view of top AI clusters
  3. Tech frontier

    Long-cycle technologies arriving in compressed sequence.

    20202050
    What it covers
    Quantum computing, biotech, energy transition, long-cycle resets.
    Sources
    NIH, NSF, IEA, journal publication rates, public capex
    Region coverage
    All 11 regions, plus quantum and biotech sub-views
  4. Society

    Trust eroding, accelerating through post-labor shock.

    20202050
    What it covers
    Trust in institutions, demographics, cohesion vs fragmentation.
    Sources
    Edelman Trust Barometer, World Values Survey, national census
    Region coverage
    All 11 regions, with an institutional vs societal split
  5. Geopolitics

    Volatility rising sharply through the 2032 governance crisis.

    20202050
    What it covers
    Alliances, conflict probability, how governments behave.
    Sources
    ACLED, CSIS, sanctions registries, public defense spending
    Region coverage
    All 11 regions, two-country and multi-country views
  6. Climate

    Physical risk rising; ASI mitigation cascade still past the 2050 horizon.

    20202050
    What it covers
    Emissions, physical risk, how fast the transition is happening.
    Sources
    NOAA, IPCC AR cycles, IEA, national emissions inventories
    Region coverage
    All 11 regions, with an ecology sub-view
  7. Economy

    Disruption peaks at 2030 post-labor shock, restructured 2033.

    20202050
    What it covers
    Growth, jobs, productivity, which sectors are rising and falling.
    Sources
    World Bank, OECD, IMF WEO, national statistical agencies
    Region coverage
    All 11 regions, with sector-level views
  8. Education

    Traditional value falling as marginal cost of knowledge approaches zero.

    20202050
    What it covers
    Schooling, skills, talent flows, attainment by region.
    Sources
    UNESCO, OECD PISA, World Bank Ed Stats, national education ministries
    Region coverage
    All 11 regions, attainment and skills-gap sub-views
  9. Positive signals

    Quiet trajectories the foresight literature underweights.

    20202050
    What it covers
    Underweighted good news most foresight leaves out.
    Sources
    WHO, UNESCO, Our World in Data, World Bank
    Region coverage
    All 11 regions, the counter-narrative view
06 · How it worksFour checks, all published, all auditable

Tap any card for the technical detail behind the plain-language summary.

  1. 01

    We predict a year we haven't seen.

    Train on data through 2024, predict 2025, score against what actually happened. 297 forecast-vs-actual pairs. No data leakage.

    The technical name is a held-out backtest. The model only sees data on or before a chosen base year, projects forward to a target year it has never seen, then gets scored against what actually happened.

    Current sweep: base year 2020, target years 2022, 2024, 2026. That produces 297 forecast-vs-actual pairs at the configuration behind the 5.61-point miss.

    An expanded economy-only sweep over 2000 to 2024 base years adds 3,575 pairs, using V-Dem and World Bank WDI to backfill. That sweep produces the structural-vs-shock split shown below.

  2. 02

    We grade confidence, not just correctness.

    You get penalized for being wrong, and also for being overconfident when you shouldn't be. The score breaks down three ways a forecast can fail.

    The score itself is the Brier score, mean squared error between forecast and outcome. A rolling 7-day Brier runs continuously across all resolved predictions. A regression alert fires if it slips by more than 0.066 against the trailing baseline.

    To find why a Brier score moves, we apply the Murphy decomposition. It splits the score into three independent ways a forecast can be wrong:

    Calibration. When you say 70%, does it happen 70% of the time?

    Resolution. Can you tell different cases apart, or is every forecast roughly the same?

    Sharpness. Are you saying something informative, or hedging at 50/50?

  3. 03

    We combine five forecasters, not one.

    Five model families, each from a distinct lineage. Members are weighted by their track record, not by reputation or how recent they are.

    The technical name is a cross-family ensemble. Five forecasters drawn from different model families, combined with a skew-adjusted aggregation method (Powell, Satopää, MacKay, Tetlock, 2024). An evolution of the techniques developed during the IARPA forecasting tournaments.

    Forecasters that miss more often get less weight, based on track record. Not by reputation. Not by how recent the work is.

    A second test catches errors the Brier score misses: a model that is consistently off in one direction scores poorly on Brier but still ranks everything perfectly. Two rank correlation metrics close that gap. Current sweep: ρ = 0.906, τ-b = 0.759 across 297 pairs. Close to 1.0 means the rank ordering is nearly perfect.

  4. 04

    Every number traces to a named public source.

    139 verified corrections to historical baselines so far. Every cell in the data substrate is auditable.

    Historical data is built from named, citable public sources, ingested through an audit-trailed pipeline. An ongoing recalibration process has applied 139 verified corrections to historical baselines to date, improving the ground truth against which every forecast is scored.

    Each correction is dated, sourced, and linked to a specific historical cell. The audit register records what was changed, why, and against which public source the change resolves.

    The forecast schema also aligns with ForecastBench (Karger et al. 2024), the open academic benchmark for forecasting systems. Any prediction can be re-scored from scratch against ground truth. Different model versions are tracked separately, so improvements are not accidentally credited to old work.

07 · Built forFour decisions this is shaped for

Each card opens to the decision frame, what you walk away with, and where the rigor comes from.

  1. 01

    Holds up under multi-year program review.

    A persistent forecast memory with a Brier history per claim and an audit appendix any external reviewer can replicate.

    The decision

    Which long-horizon programs and partnerships are tracking as intended, and which need to be rethought before the next funding cycle. The calls that resolve over decades and have to defend themselves year after year.

    What you walk away with

    A forecast registry your board can inspect, year over year, with a calibration trend that improves as the program runs.

    Where the rigor comes from
    • ·Held-out backtest at 297 pairs (2020 → 2022/2024/2026)
    • ·Brier score per claim with Murphy decomposition
    • ·139 verified corrections to historical baselines, each dated and sourced
    • ·Rank correlation ρ = 0.906 across the 297 pairs
    Replaces or augments

    NIC Global Trends, CSIS, and RAND scenario work. Those disclaim prediction and refresh narrative scenarios every four years. This produces scored, continuously updated forecasts with the audit trail institutional review demands.

  2. 02

    Reads many regions at once.

    Nine subjects across eleven regions, running simultaneously. The intersections are the part most foresight work doesn't reach.

    The decision

    Where momentum is real across regions, and where it's local elite consensus that won't generalize. The cross-region read that tells you which signals reinforce each other across the map and which are isolated to one cluster.

    What you walk away with

    A one-page cross-region brief naming the three highest-confidence subject-region intersections for the question on your desk, with the data trail underneath.

    Where the rigor comes from
    • ·9 subjects × 11 regions = 99 surfaces, each on the same 0-to-100 calibration
    • ·Sources include IMF, BIS, World Bank, IPCC, IEA, Edelman, WHO, NOAA, ACLED, Stanford HAI, Epoch AI, UNESCO
    • ·Positive signals tracked as a first-class surface, not an afterthought
    • ·Cross-family ensemble of five forecasters, skew-adjusted aggregation
    Replaces or augments

    Verisk Maplecroft and other current-state index families. Those publish current-state only, no trajectory, no Brier. This adds the 30-year forward view and the per-claim accuracy score those don't produce.

  3. 03

    Backs up the calls a commission has to vote on.

    A held-out backtest, a Brier per claim, and a calm-window economy number of 5.5 points that survives appropriations review.

    The decision

    The tactical calls that get voted on and have to survive minutes-of-meeting scrutiny. Industry-mix shifts, workforce projection, infrastructure justification, appropriations defense. Reads where the question after the vote is always where did this number come from.

    What you walk away with

    A forecast appendix with a Brier score per claim and a held-out backtest disclosure that survives audit-committee inspection. A scenario range for shock years, a point estimate for calm years.

    Where the rigor comes from
    • ·Held-out 2025 backtest — the model only saw data through 2024
    • ·Murphy decomposition on every claim
    • ·No retroactive tuning. Model versions tracked separately.
    • ·139 verified corrections, dated and source-linked
    • ·ForecastBench-schema aligned (Karger et al. 2024)
    Replaces or augments

    Eurasia Group, EIU, S&P Global consulting reports for the board-and-commission context. Those publish no held-out backtest and no Brier on their methodology pages. This produces the verify-me artifacts that trust-me advisory cannot.

  4. 04

    When to act on a number, when to plan for a range.

    Calm regimes get a point estimate. Shock regimes get a scenario range. The calibration tells you which.

    The decision

    Industry-mix shifts. Site selection. Workforce projection. Whether to commit to a single sector outlook or hedge across a range. The calibration tells you when the regime is calm enough to act on a point estimate and when the regime is in shock and you need a scenario range.

    What you walk away with

    A forecast appendix that names the regime (calm or shock), gives you a number or a range with that regime's measured uncertainty, and carries a Brier history on the exact claim.

    Where the rigor comes from
    • ·Regime-aware backtest. 2000-2024 expanded economy sweep, 3,575 pairs.
    • ·Calm window scored at ±5.5, shock window at ±9.8, separately
    • ·Tightest measured slice: Tech · Quantum at ±2.8 on the 2025 hold-out
    • ·Widest measured slice: Money · Crypto at ±8.9 on the 2025 hold-out
    • ·Cross-family ensemble, weighted by track record not reputation
    Replaces or augments

    Oxford Economics (87 economies, economic-only, no Brier) and EIU (5-year horizon). This adds 30-year horizon and cross-domain STEEPE signals into the same calibration so an industry-mix call factors in the forces that bend it, not only its own historical trend.

Registry and calibration trend are shared with engaged clients under confidentiality.

Request an introduction →

AppendixAudit reference
Snapshot date
2026-06-16
Benchmark accuracy
Base year 2020 → target years 2022, 2024, 2026. 297 forecast-vs-actual pairs. Model trained on data through the base year only.
Brier window
7-day rolling on the production loop, n=36 resolved predictions. Three windows reported (7d, 14d, 30d) for cross-section. Multi-class bucket templates (BTC price ranges) are excluded from the headline pending per-event multinomial-Brier aggregation, and are tracked separately in the audit module.
Long-window backtest
17,875 pairs across 25 years on five long-history subjects. Sources: V-Dem v16, World Bank WDI.
Reference scale and known gaps
Brier is computed on the standard scale (Brier 1950). External benchmarks like ForecastBench (Karger et al. 2024, arXiv:2409.19839) and Tetlock GJP score multi-month forecasting on questions with non-zero base-rate uncertainty. The registry above is dominantly 1-day binary macro thresholds; raw aggregate numbers are not directly comparable to those references. Skill is validated through the gate sequence below, not by direct Brier comparison.
Validation gates
A template ships to the live trade harness only after clearing, in order: (a) Brier Skill Score vs persistence baseline above +0.10 over a backtest of n ≥ 200; (b) BSS vs market-implied probability above +0.03 over the same; (c) net-positive paper-trade P&L over 60 days. Each gate is reproducible from the public audit module (brier-audit.ts in the exponentialworld repo) and is the published reason a template is or is not in live use.
Cohort status
  • v1-live-fred-20260504accruing, n=5 of 30 resolved
  • v3-cai-veda-r2-20260607no resolutions yet
  • v2-ic4wsa-blend-20260605no resolutions yet
  • pre-versionedno resolutions yet

Template-version cohorts surface separately so the post-Round-2 macro-only cohort (v3-cai-veda-r2-*) accrues to N=30 before being scored, rather than being pooled with the legacy financial-heavy v2 into a single misleading headline. Bradley, Schwartz, and Hashino 2008 floor.

Audit access
Full registry, calibration trend, and per-cell correction log shared with engaged clients under confidentiality.