Built an MLB Projection Engine from scratch

Search

New member
Joined
Feb 22, 2026
Messages
1
Reaction score
0
I've been lurking here for a while and figured I'd share what I've been working on. I have been building an MLB projection model that doesn't try to predict who wins but rather it simulates full nine-inning games from the pitch level up.

The basic idea:

Baseball decomposes into discrete events better than any other sport. One pitcher throws to one batter, and the outcome of that matchup depends on a relatively contained set of factors. So instead of building one model to predict "who wins," I built a pipeline of interconnected models that each handle a narrow piece of the game, then connected them through thousands of simulated plate appearances.

The pipeline has four stages:

  1. Player profiles — Built from 3.4M+ pitches of Statcast data (2021-2025). The key innovation is separating skill from noise using batted-ball physics rather than outcomes. A batter making consistently hard contact at good angles who's batting .220 is projected more favorably than a .280 hitter surviving on soft contact and fortunate placement. Expected metrics (xBA, xSLG, etc.) are derived from contact quality, not box-score results.
  2. H2H engine — A dedicated ML model that takes a specific batter and specific pitcher and produces a probability distribution across all plate appearance outcomes (K, BB, 1B, 2B, 3B, HR, outs). It learns the non-linear interactions between player types. A batter who struggles with high-velocity fastballs is projected differently against a power arm vs. a control artist, even if both pitchers have similar overall lines. Platoon splits are player-specific rather than applying a blanket adjustment.
  3. Environment layer — Real-time weather (temp, humidity, barometric pressure, wind speed + direction) fed into a physics model for batted-ball carry. This enters at the batted-ball level, not as a simple run multiplier. Park factors are per-outcome (a park can suppress HRs but boost doubles). Umpire zone tendencies are included when assignments are known.
  4. Monte Carlo simulation — 5,000 full nine-inning games simulated per matchup. Each sim plays out every plate appearance with the game state evolving naturally. Runners advance based on empirically calibrated probabilities (runner speed, outfield arm, hit type). ~140K individual events per game.
Why simulate instead of using formulas?

A formula can predict expected runs or win probability, but it can't capture cascading dependencies. When a leadoff batter reaches base, the simulation plays out subsequent at-bats with a runner in scoring position, where each outcome has different consequences than with bases empty. Everything, including; win probability, run distributions, O/U lines at any number, and player props fall out of the same simulation, so they're internally consistent.

For thin samples (rookies, callups, early season):

Bayesian regression blends toward population baselines proportionally to sample size. ~100 PA crossover point where individual signal overtakes the prior. This prevents wild early-season projections while allowing breakout players to emerge as data accumulates.

Some honest limitations I'm still working on:
  • Bullpen modeling is aggregate rather than pitcher-by-pitcher for relief innings (~35-40% of total innings)
  • Individual defensive positioning isn't modeled yet
  • Baseball's irreducible randomness means even a perfect model would be "wrong" roughly 1 in 3 games
Here are the results from me testing it in 2025:
  • Overall winner accuracy: 64.5% (2,416 games)
  • High-confidence picks (≥60% win prob): 71.7% (1,267 games)
  • Brier score: 0.22 (lower is better — measures probabilistic accuracy)
  • Run total bias: −0.19 (nearly zero; slight under-projection)
Happy to answer any questions about the methodology. Still improving this thing every week and hoping to have another big season.

A couple of questions:
  • For those who've built projection systems, how are you handling bullpen modeling? That's my biggest gap right now.
  • Is anyone else using batted-ball physics for expected stats rather than Statcast's public xBA/xSLG?
 

Forum statistics

Threads
1,139,199
Messages
13,883,213
Members
104,558
Latest member
hotliveth
The RX is the sports betting industry's leading information portal for bonuses, picks, and sportsbook reviews. Find the best deals offered by a sportsbook in your state and browse our free picks section.FacebookTwitterInstagramContact Usforum@therx.com