How we price — Shadow Pilot

Pipeline at a glance

Match data from a licensed esports stats provider, ingested hourly. Roster changes, results, scores, tournament tier all tracked.
Bayesian rating engine — OpenSkill PlackettLuce — runs over our historical match DB (~27,000 matches across CS2, Dota 2, Call of Duty, Rocket League). Per-team posterior is mean (μ) and uncertainty (σ).
Per-sport recency + tier weighting. CS2 / RL / CoD use a 90-day half-life; Dota uses uniform weighting (validated empirically — Dota team chemistry signal is more durable than other sports). Tournament tier (S/A/B/C/D) multiplies the effective match weight.
Roster-change σ bumps — we detect lineup changes via the stats feed and inflate σ for affected teams until the next 5 matches of new-roster signal accumulates.
Bradley-Terry binomial Bo-N pricer turns the rating delta into a per-game win probability, then computes match-level moneyline, map spread, total maps, race-to-N, and full correct-score grid.
Per-sport calibration curves (piecewise-linear isotonic regression) correct the rating engine's natural overconfidence in the 60-85% probability bucket. Curves are re-fit weekly on rolling 6-month windows.
Fair-value output — pre-event implied probability per side, with optional book margin applied. The Shadow Pilot uses margin = 0 so we compare your offered line to a true fair value.

What we measure in the Shadow Report

Metric	Definition
fair_value_decimal	Decimal odds implied by our calibrated true probability
pricing_disagreement_pp	(1 / fair_decimal − 1 / offered_decimal) × 100. Positive ⇒ your offer was longer than fair (potentially soft). Negative ⇒ your offer was shorter (potentially overpriced to the bettor).
leak_pp	Same as pricing_disagreement_pp but only computed on rows where the bet was actually struck (not just offered) and the result favored the bettor. Real money lost vs. fair, not theoretical.
stake_weighted_*	Same as the unweighted versions, weighted by bet stake (when you provide it). Reflects actual book impact rather than line counts.
pinnacle_disagreement_pp	Same disagreement formula but vs. Pinnacle's closing line as a third-party check on our fair value. We surface this so you can verify our number is sane.
benchmark_coverage_pct	% of audited rows for which a Pinnacle close was available. We flag this honestly per report.

Time-split discipline

We never use post-event data to evaluate a pre-event quote. The model and calibration curve used to fair-value a match must be trained only on data available before that match's start time. Backtest reports are strictly walk-forward; live-mode reports use the latest pre-event snapshot.

What we do not do

Live in-play game-state pricing. We do not ingest per-round / per-event esports data feeds (kills, gold-leads, draft state). Our /quote/live is a series-state reprice — given a current map score in a Bo3/Bo5/Bo7, we reprice the rest of the series. Not a tick-level in-play product.
Settlement / liability / risk-management decisions. We are a decision-support layer. You retain full control of margin, line suspension, max stake, customer limits, and final exposure.
Player-prop pricing for every esport. Currently limited to CS2 (kills) and Dota 2 (kills/last-hits). RL and CoD player props are on the roadmap.
Soccer, basketball, or non-esports markets. We are deliberately esports-only.

Backtest snapshot (walk-forward, last 6 months)

Walk-forward backtest replays every match in chronological order — for each match, we use the rating state before that match to predict the outcome, then update ratings after recording the prediction. No look-ahead.

Sport	Matches	Accuracy	Brier	Log loss	Verdict
Rocket League	5,743	66.3%	0.218	0.631	Moderate signal
Dota 2	7,318	65.4%	0.223	0.649	Moderate signal
Call of Duty	1,546	64.5%	0.227	0.659	Moderate signal
Counter-Strike 2	15,498	61.0%	0.233	0.660	Weak raw / strong calibrated

Lower Brier and log-loss = better. Random baseline is 0.250 / 0.693. All four sports beat random meaningfully.

Calibration quality (more important than raw accuracy)

Production serves calibrated probabilities, not raw rating deltas. The calibration curve corrects the rating engine's natural overconfidence at high probabilities. Deviation = how far our calibrated prediction is from observed win rate, by predicted-probability bucket.

Sport	50-65% bucket dev.	65-80% bucket dev.	80%+ bucket dev.
Rocket League	≤1pp	≤2pp	≤4pp
Dota 2	≤1pp	≤2pp	≤4pp
Call of Duty	≤2pp	≤4pp	≤6pp
Counter-Strike 2	≤2pp	≤4pp	≤8pp

A trading desk reads "calibrated within 3pp through the 70% bucket" as "this is a usable model." Above the 80% bucket all sports get noisier — which is why our Shadow Reports flag the 80%+ confidence bets honestly rather than pretending they're as well-calibrated as the mid-range.

Honest disclosure on CS2

CS2 raw signal is genuinely weaker than the other three sports we cover — 61.0% accuracy vs. 64-66% on RL/Dota/CoD. Two structural reasons: pro CS2 has higher round-level variance (24-round Bo1s have more chance to swing than RL's overtime decisions), and CS2 rosters change more frequently than other esports. Our team-level Bayesian rating treats "Vitality" as the same rated entity even after a major roster swap, which is wrong.

Roadmap fix (v2, mid-2026): player-level OpenSkill ratings aggregated per current roster. We're already capturing roster snapshots and player histories; the remaining work is wiring per-match participation data and an aggregated-team-rating function into the pricer. Expected lift: +2-4pp on CS2.

For now, the Shadow Pilot's value on CS2 doesn't depend on raw signal. The pilot detects pricing disagreements at ≥4pp — that signal is sharp regardless of whether our model is 61% or 65% accurate, because the disagreement is between your offered line and our fair value, not between our model and the actual outcome.

v2 roadmap

Q3 2026: player-level CS2 ratings (above)
Q3 2026: League of Legends + Valorant ingestion (Riot's GRID licensing path under review)
Q4 2026: SOC 2 Type 1 audit
Q4 2026: stake-weighted backtest reporting (today the backtest is unit-weighted; with customer stake data we can compute realized $-EV against the offered line, not just hit rate)
2027: per-map CS2 / Valorant ratings for live between-map repricing

Detailed model-card and audit logs available on call, under NDA. Walk-forward methodology and rating engine source are inspectable end-to-end during a pilot.