Pipeline at a glance
- Match data from a licensed esports stats provider, ingested hourly. Roster changes, results, scores, tournament tier all tracked.
- Bayesian rating engine — OpenSkill PlackettLuce — runs over our historical match DB (~27,000 matches across CS2, Dota 2, Call of Duty, Rocket League). Per-team posterior is mean (μ) and uncertainty (σ).
- Per-sport recency + tier weighting. CS2 / RL / CoD use a 90-day half-life; Dota uses uniform weighting (validated empirically — Dota team chemistry signal is more durable than other sports). Tournament tier (S/A/B/C/D) multiplies the effective match weight.
- Roster-change σ bumps — we detect lineup changes via the stats feed and inflate σ for affected teams until the next 5 matches of new-roster signal accumulates.
- Bradley-Terry binomial Bo-N pricer turns the rating delta into a per-game win probability, then computes match-level moneyline, map spread, total maps, race-to-N, and full correct-score grid.
- Per-sport calibration curves (piecewise-linear isotonic regression) correct the rating engine's natural overconfidence in the 60-85% probability bucket. Curves are re-fit weekly on rolling 6-month windows.
- Fair-value output — pre-event implied probability per
side, with optional book margin applied. The Shadow Pilot uses
margin = 0so we compare your offered line to a true fair value.
What we measure in the Shadow Report
| Metric | Definition |
|---|---|
| fair_value_decimal | Decimal odds implied by our calibrated true probability |
| pricing_disagreement_pp | (1 / fair_decimal − 1 / offered_decimal) × 100. Positive ⇒ your offer was longer than fair (potentially soft). Negative ⇒ your offer was shorter (potentially overpriced to the bettor). |
| leak_pp | Same as pricing_disagreement_pp but only computed on rows where the bet was actually struck (not just offered) and the result favored the bettor. Real money lost vs. fair, not theoretical. |
| stake_weighted_* | Same as the unweighted versions, weighted by bet stake (when you provide it). Reflects actual book impact rather than line counts. |
| pinnacle_disagreement_pp | Same disagreement formula but vs. Pinnacle's closing line as a third-party check on our fair value. We surface this so you can verify our number is sane. |
| benchmark_coverage_pct | % of audited rows for which a Pinnacle close was available. We flag this honestly per report. |
Time-split discipline
What we do not do
- Live in-play game-state pricing. We do not ingest
per-round / per-event esports data feeds (kills, gold-leads, draft state).
Our
/quote/liveis a series-state reprice — given a current map score in a Bo3/Bo5/Bo7, we reprice the rest of the series. Not a tick-level in-play product. - Settlement / liability / risk-management decisions. We are a decision-support layer. You retain full control of margin, line suspension, max stake, customer limits, and final exposure.
- Player-prop pricing for every esport. Currently limited to CS2 (kills) and Dota 2 (kills/last-hits). RL and CoD player props are on the roadmap.
- Soccer, basketball, or non-esports markets. We are deliberately esports-only.
Backtest snapshot (walk-forward, last 6 months)
Walk-forward backtest replays every match in chronological order — for each match, we use the rating state before that match to predict the outcome, then update ratings after recording the prediction. No look-ahead.
| Sport | Matches | Accuracy | Brier | Log loss | Verdict |
|---|---|---|---|---|---|
| Rocket League | 5,743 | 66.3% | 0.218 | 0.631 | Moderate signal |
| Dota 2 | 7,318 | 65.4% | 0.223 | 0.649 | Moderate signal |
| Call of Duty | 1,546 | 64.5% | 0.227 | 0.659 | Moderate signal |
| Counter-Strike 2 | 15,498 | 61.0% | 0.233 | 0.660 | Weak raw / strong calibrated |
Lower Brier and log-loss = better. Random baseline is 0.250 / 0.693. All four sports beat random meaningfully.
Calibration quality (more important than raw accuracy)
Production serves calibrated probabilities, not raw rating deltas. The calibration curve corrects the rating engine's natural overconfidence at high probabilities. Deviation = how far our calibrated prediction is from observed win rate, by predicted-probability bucket.
| Sport | 50-65% bucket dev. | 65-80% bucket dev. | 80%+ bucket dev. |
|---|---|---|---|
| Rocket League | ≤1pp | ≤2pp | ≤4pp |
| Dota 2 | ≤1pp | ≤2pp | ≤4pp |
| Call of Duty | ≤2pp | ≤4pp | ≤6pp |
| Counter-Strike 2 | ≤2pp | ≤4pp | ≤8pp |
A trading desk reads "calibrated within 3pp through the 70% bucket" as "this is a usable model." Above the 80% bucket all sports get noisier — which is why our Shadow Reports flag the 80%+ confidence bets honestly rather than pretending they're as well-calibrated as the mid-range.
Honest disclosure on CS2
Roadmap fix (v2, mid-2026): player-level OpenSkill ratings aggregated per current roster. We're already capturing roster snapshots and player histories; the remaining work is wiring per-match participation data and an aggregated-team-rating function into the pricer. Expected lift: +2-4pp on CS2.
For now, the Shadow Pilot's value on CS2 doesn't depend on raw signal. The pilot detects pricing disagreements at ≥4pp — that signal is sharp regardless of whether our model is 61% or 65% accurate, because the disagreement is between your offered line and our fair value, not between our model and the actual outcome.
v2 roadmap
- Q3 2026: player-level CS2 ratings (above)
- Q3 2026: League of Legends + Valorant ingestion (Riot's GRID licensing path under review)
- Q4 2026: SOC 2 Type 1 audit
- Q4 2026: stake-weighted backtest reporting (today the backtest is unit-weighted; with customer stake data we can compute realized $-EV against the offered line, not just hit rate)
- 2027: per-map CS2 / Valorant ratings for live between-map repricing
Detailed model-card and audit logs available on call, under NDA. Walk-forward methodology and rating engine source are inspectable end-to-end during a pilot.