the ledger // linehype

jun 15, 2026build · totals

we built a whole second model. it doesn’t beat vegas — and that’s worth saying.

After two misses on the win model, we changed targets: instead of who wins, predict how many runs get scored — the over/under. The theory was that park, weather, and pitching have real leverage on totals, and aren’t already baked into a team’s win-loss record the way they are in our Elo.

We built it from the ground up: each team’s offense and run-prevention, opponent-adjusted, park-adjusted, then — the big one — the starting pitcher folded into the run-prevention side, since an ace on the mound suppresses a game’s total. Walk-forward across the season, graded against both the actual totals and the market’s closing line.

The starter did exactly what we hoped. Our first version was pitcher-blind and its disagreements with the line were systematically wrong — under 50%, worse the more it disagreed, because it kept screaming "over" on games two aces were quietly turning into pitchers’ duels. Adding the starter fixed that bias and pushed our over/under calls from losing to roughly break-even.

3.63

model MAE runs (market 3.46)

~51%

o/u calls (need 52.4% to beat the vig)

honest

shown as analysis, not a tip

But "roughly break-even" isn’t an edge. To actually beat a standard −110 total you need to be right 52.4% of the time, and we land between 50 and 52. The model predicts totals nearly as well as the market — within two-tenths of a run — but it does not beat the market. The honest conclusion is the one nobody selling picks will give you: the over/under market is efficient, and our public-data model, good as it is, doesn’t crack it.

So we’ll show our projected total next to the line as analysis — useful context, not a wager. The one lever that could still move it is weather: wind and temperature swing run-scoring more than almost anything, and we don’t pull that data yet. If we find a clean source, it goes through the same gauntlet, and the result — beat or bust — lands right here.

// methodology: multiplicative run model (team offense × opponent run-prevention ÷ league), park factors, team-relative recent-form starter adjustment; historical over/under from ESPN’s odds archive. Walk-forward, no hindsight. Drafted with AI from the model’s real backtest output and reviewed before posting.

jun 15, 2026test · platoon

platoon offense: a swing and a miss — and what two of them are telling us.

Next up: platoon offense — the idea that a lineup which mashes left-handers but fades against righties should be rated differently depending on who’s starting against them. Real effect, well documented. We expected this one to land.

It hit a data wall first. The clean measure — a team’s OPS versus lefties and righties — exists, but the source only serves full-season totals. Using those to "predict" a game from May would mean feeding the model June’s results. That’s hindsight, and we don’t do that. So we built it the honest way: each team’s runs scored in games started by a lefty versus a righty, accumulated game by game, as it stood beforehand.

+0.0

accuracy pts (no move)

−0.0002

log-loss (noise)

cut

not shipped

Nothing. The honest, as-of version of platoon is just too noisy — small samples per hand, bullpens of mixed handedness muddying the runs, opponent quality bleeding in. Whatever real signal exists, our Elo and the starter rating had already eaten it.

And here’s the part worth saying out loud. That’s two straight swings — the bullpen, now platoon — that didn’t clear the bar. When good ideas keep failing to add value, that’s not bad luck; it’s a message. We think we’re near the ceiling of predicting who wins a baseball game: even the sharpest models top out around 56–57%, because so many games are coin flips. Stacking more tweaks onto win probability is chasing decimal dust.

So we’re changing direction. The next build isn’t another win-prob feature — it’s a separate totals model (over/under on runs). That’s where park, weather, and offense actually move the needle, and where they aren’t already baked into a team’s win-loss record. New target, same rules: walk-forward, no hindsight, and whatever it says lands here.

// methodology: as-of platoon built from runs scored vs LHP/RHP starters, regressed for sample size, walk-forward on top of v2. Drafted with AI from the model’s real backtest output and reviewed before posting.

jun 15, 2026test · bullpen

we tested the bullpen. it didn’t make the cut.

After starting pitchers, the obvious next lever was the bullpen — relievers throw roughly four of every nine innings, and our model knew nothing about them. So we built it and tested it, same as always: walk-forward across the full season, no hindsight.

One trap we flagged up front: a bullpen is the same group every game, so its overall quality is already inside a team’s Elo rating — crediting it again would double-count, exactly the mistake that sank our first pitcher attempt. The only thing Elo can’t see is recent bullpen form: a pen that’s been lights-out or gassed over the last two weeks. So that’s what we measured — each pen’s last-14-days versus its own season baseline.

−0.3

accuracy pts (it got worse)

+0.0006

log-loss (worse calibration)

cut

not shipped

It didn’t just fail to help — every weight we tried made the model worse. The reason: two weeks of bullpen innings is a tiny, noisy sample, mostly luck and sequencing. The "recent form" we were feeding in was closer to random than to signal, so it just added static. The pen’s real quality was already in Elo, right where we suspected.

This is the part we promised you: we’re not shipping it. A change has to earn its place, and this one didn’t. The bullpen idea isn’t dead — but the version that might work is availability and fatigue (are a team’s best relievers rested or used up tonight), which needs pitch-count data we don’t yet pull, not aggregate recent stats. Next on the bench: a hitter’s platoon matchup against the day’s starter, and a separate over/under model where park and weather actually move the needle.

// methodology: bullpen line = team pitching total minus the starter’s line, per game; recent-form deviation vs season baseline, walk-forward. Drafted with AI from the model’s real backtest output and reviewed before posting.

jun 15, 2026model · v1 → v2

we added starting pitchers. the first try failed.

It started with a bad night. On June 14 our MLB model went 5 for 14 — 36%, worse than a coin flip. One night is noise (our picks that night were near-tossups, 50–60% leans on almost every game), but it sent us to the obvious question: the model didn’t know who was pitching. In baseball, that’s supposed to be everything. So we tried to add it.

Attempt one: rate every starter by his season FIP versus the league, and nudge the team’s rating. We backtested it walk-forward across all 1,159 games this season — predicting each game using only what was known beforehand, no hindsight. The result was almost insulting in its flatness: accuracy unchanged, prediction error down by a rounding error. Adding the supposedly-most-important factor in baseball did nothing.

The diagnosis was the interesting part. We were double-counting. A team’s Elo rating already reflects how it plays with its normal rotation — a team with a great staff is already rated highly. Crediting a starter again versus the league piled the same information on twice, and it washed out. Worse, when we cranked the weight up, accuracy got worse — proof that the lazy take ("pitching is everything") is wrong once team strength is already in the model.

Attempt two fixed the logic. We rate a starter against his own team’s typical starter, not the league — so the adjustment only captures the part Elo doesn’t already know: is today’s guy better or worse than what this team usually runs out there? And we measure him by recent form (his last five starts), because a pitcher in June is not his April self.

+0.7

accuracy pts (55.0 → 55.7%)

−0.0013

log-loss (better calibrated)

stable

across nearby settings = not overfit

Modest. Real. Honest. Not the leap intuition promised — because MLB is genuinely coin-flippy, and even Vegas, with every input we’ll never have, hits only ~55–58% straight up. But it improves every metric, it makes individual game predictions more sensible (an ace-vs-spot-starter mismatch now shows up), and it’s the right direction. elo_v2 is now the MLB model.

What’s next, and where we think the bigger gains hide: an opponent-offense adjustment (a great pitcher facing a great lineup is less great), bullpen quality, and park & weather. We’ll test each the same way — walk-forward, no hindsight — and whatever the backtest says, win or lose, it lands here.

// methodology: walk-forward backtest, FIP from K/BB/HR, team-relative recent-form adjustment, weight chosen to minimize log-loss. Full detail on the methodology page. This recap was drafted with AI from the model’s real backtest output and reviewed before posting.

jun 14, 2026launch · v1

the model goes live — and so does the scoreboard on us.

We shipped elo_v1: our own power ratings, built from chess’s Elo system and every game result this season, measured live against the betting market. Where our number and the market’s disagree by five points or more, we flag an edge.

And we shipped this page alongside it on purpose. From day one, the model’s record is public — including the bad days. Through 1,159 games, v1 graded out at 55.0% straight up, which for MLB is respectable and unspectacular, exactly as advertised. The flagged edges on June 14 went 4 of 6, including a +14% call on Milwaukee that hit — a small sample, but the kind of disagreement-with-the-market that’s the whole point.

The deal we’re making: every number here is computed from public data, the methods are written down, and when the model is wrong we’ll tell you, dig into why, and adjust. That last part is not a disclaimer. It’s the product.