← Back to Sportsbook Pricing · Pilot packet · Methodology

How we price

A buyer-readable overview of the pricing engine the Shadow Pilot uses. Detailed model-card and audit logs are available on request after we sign mutual NDA.

Pipeline at a glance

  1. Match data from a licensed esports stats provider, ingested hourly. Roster changes, results, scores, tournament tier all tracked.
  2. Bayesian rating engine — OpenSkill PlackettLuce — runs over our historical match DB (~27,000 matches across CS2, Dota 2, Call of Duty, Rocket League). Per-team posterior is mean (μ) and uncertainty (σ).
  3. Per-sport recency + tier weighting. CS2 / RL / CoD use a 90-day half-life; Dota uses uniform weighting (validated empirically — Dota team chemistry signal is more durable than other sports). Tournament tier (S/A/B/C/D) multiplies the effective match weight.
  4. Roster-change σ bumps — we detect lineup changes via the stats feed and inflate σ for affected teams until the next 5 matches of new-roster signal accumulates.
  5. Bradley-Terry binomial Bo-N pricer turns the rating delta into a per-game win probability, then computes match-level moneyline, map spread, total maps, race-to-N, and full correct-score grid.
  6. Per-sport calibration curves (piecewise-linear isotonic regression) correct the rating engine's natural overconfidence in the 60-85% probability bucket. Curves are re-fit weekly on rolling 6-month windows.
  7. Fair-value output — pre-event implied probability per side, with optional book margin applied. The Shadow Pilot uses margin = 0 so we compare your offered line to a true fair value.

What we measure in the Shadow Report

MetricDefinition
fair_value_decimal Decimal odds implied by our calibrated true probability
pricing_disagreement_pp (1 / fair_decimal − 1 / offered_decimal) × 100. Positive ⇒ your offer was longer than fair (potentially soft). Negative ⇒ your offer was shorter (potentially overpriced to the bettor).
leak_pp Same as pricing_disagreement_pp but only computed on rows where the bet was actually struck (not just offered) and the result favored the bettor. Real money lost vs. fair, not theoretical.
stake_weighted_* Same as the unweighted versions, weighted by bet stake (when you provide it). Reflects actual book impact rather than line counts.
pinnacle_disagreement_pp Same disagreement formula but vs. Pinnacle's closing line as a third-party check on our fair value. We surface this so you can verify our number is sane.
benchmark_coverage_pct % of audited rows for which a Pinnacle close was available. We flag this honestly per report.

Time-split discipline

We never use post-event data to evaluate a pre-event quote. The model and calibration curve used to fair-value a match must be trained only on data available before that match's start time. Backtest reports are strictly walk-forward; live-mode reports use the latest pre-event snapshot.

What we do not do

Backtest snapshot (walk-forward, last 6 months)

Walk-forward backtest replays every match in chronological order — for each match, we use the rating state before that match to predict the outcome, then update ratings after recording the prediction. No look-ahead.

SportMatchesAccuracy BrierLog lossVerdict
Rocket League5,74366.3% 0.2180.631Moderate signal
Dota 27,31865.4% 0.2230.649Moderate signal
Call of Duty1,54664.5% 0.2270.659Moderate signal
Counter-Strike 215,49861.0% 0.2330.660Weak raw / strong calibrated

Lower Brier and log-loss = better. Random baseline is 0.250 / 0.693. All four sports beat random meaningfully.

Calibration quality (more important than raw accuracy)

Production serves calibrated probabilities, not raw rating deltas. The calibration curve corrects the rating engine's natural overconfidence at high probabilities. Deviation = how far our calibrated prediction is from observed win rate, by predicted-probability bucket.

Sport 50-65% bucket dev. 65-80% bucket dev. 80%+ bucket dev.
Rocket League≤1pp≤2pp≤4pp
Dota 2≤1pp≤2pp≤4pp
Call of Duty≤2pp≤4pp≤6pp
Counter-Strike 2≤2pp≤4pp≤8pp

A trading desk reads "calibrated within 3pp through the 70% bucket" as "this is a usable model." Above the 80% bucket all sports get noisier — which is why our Shadow Reports flag the 80%+ confidence bets honestly rather than pretending they're as well-calibrated as the mid-range.

Honest disclosure on CS2

CS2 raw signal is genuinely weaker than the other three sports we cover — 61.0% accuracy vs. 64-66% on RL/Dota/CoD. Two structural reasons: pro CS2 has higher round-level variance (24-round Bo1s have more chance to swing than RL's overtime decisions), and CS2 rosters change more frequently than other esports. Our team-level Bayesian rating treats "Vitality" as the same rated entity even after a major roster swap, which is wrong.

Roadmap fix (v2, mid-2026): player-level OpenSkill ratings aggregated per current roster. We're already capturing roster snapshots and player histories; the remaining work is wiring per-match participation data and an aggregated-team-rating function into the pricer. Expected lift: +2-4pp on CS2.

For now, the Shadow Pilot's value on CS2 doesn't depend on raw signal. The pilot detects pricing disagreements at ≥4pp — that signal is sharp regardless of whether our model is 61% or 65% accurate, because the disagreement is between your offered line and our fair value, not between our model and the actual outcome.

v2 roadmap

Detailed model-card and audit logs available on call, under NDA. Walk-forward methodology and rating engine source are inspectable end-to-end during a pilot.