CO High School Baseball Projection Project

Introduction

This project tries to answer a simple question every coach, player, and fan asks in the offseason: “What’s going to happen next year?”

We built a complete projection pipeline that takes historical player statistics from MaxPreps (2022-2025), applies development curves to returning players, fills roster gaps with realistic replacement-level talent, simulates an entire season’s worth of games, and produces power rankings for 37 (and growing) Colorado 5A teams.

The system was designed for Colorado 5A baseball, specifically to project Rocky Mountain High School’s 2026 season. However, the approach is portable to any high school program with sufficient historical data.

What this is: A data-driven scouting report generator. It tells you which returning juniors are poised for breakout seasons, which roster spots need reinforcements, which teams are filled with veterans, which are rebuilding, and which games on the schedule are likely wins, toss-ups, or uphill battles.

What this is not: A crystal ball. Baseball is chaotic. A 70% win probability means you lose 3 out of 10 times. The value here is in setting realistic expectations and identifying where your team’s strengths and weaknesses lie before the first pitch of spring.

What this project doesn’t do: Every team and every player has a story: “Pedro had trouble throwing strikes last season, but he worked it out in the summer and he’s untouchable now”; or, “Dustin was hurt but now he’s hitting lasers.” There has been no attempt to put a “thumb on the scale” to address these. For every story that I know, there are 10 others. So, if you see a player in the projected rosters whose numbers look like you wouldn’t expect, look at last year’s performance and you’ll probably see the reason.

How This Project Developed

As the 2026 RMHS season was approaching, I was starting to wonder how good this team really would be and if there was a way that I could get some additional information for when I am running GameChanger broadcasts. There had to be a fast way to get that kind of information with AI agents, right? I also work in data (healthtech) and wind up working a lot with AI tools.

Chatting with AI was not helpful and trying to cajole an AI bot to generate this information in a repeatable way was not effective. However, AI tools are very good at writing code, so I worked with two AI tools (Gemini and Claude) to develop a set of python scripts to consume raw MaxPreps information and deliver the outputs.

I also used AI to fact check itself. I created an “adversarial review” prompt and gave it to both Claude and Gemini. I then had the two bots critique each other’s work. After several rounds of this, I got to the point where the results were “good enough” for a first pass. I also used these AI tools to generate complete and readable information.

The prompts and critiques are available in the documentation.

Outputs

Process produces four key deliverables:

Team Power Rankings

Every team gets an Overall Power score from 0 to 100, where 100 is the strongest team in the league. This score is simply the average of two components:

Offensive Power: How many runs can this lineup produce? We add up the Runs Created scores for the team’s top 10 hitters.
Pitching Power: How well can this staff prevent runs? We add up the Pitching Dominance scores for the team’s top 6 pitchers.

Both components are scaled so that the league’s best offense and best pitching staff each score 100. A team with an Overall Power of 50 is roughly half as strong as the top team — though as any baseball fan knows, they can still win on any given day. The rankings also show roster composition metrics that help you understand why a team is strong or weak:

Returning Players: How many real players (not generic backfill) are on the projected roster
Returning Seniors: Experienced players in their final year — often a team’s backbone
Total Varsity Years: The sum of all varsity experience on the roster — a team with 45 collective years has seen a lot more baseball than one with 20

These numbers help distinguish between a veteran-loaded squad expecting a deep playoff run and a young team that’s a year away.

Note: Unfortunately, I had to exclude Valor Christian. Their data in MaxPreps is fatally flawed and I would not be able to integrate it without a lot of manual work.

Regular Season Simulation

For each game on the schedule, we ask: “If these two teams played 1,000 times, how often would each team win?” That’s the core idea behind Monte Carlo simulation — instead of trying to calculate a precise answer, we let the computer play out the game over and over with randomized scoring (using that “clumpy innings” model described below) and see what happens. After 1,000 simulated games, we can say things like “Rocky Mountain wins 623 of them” — which becomes a 62.3% win probability. We also track the average score across all simulations to give you a projected final like “6.2-5.1.” To make this actionable, we translate win probabilities into confidence labels:

Lock (W/L): Greater than 90% or less than 10% — barring disaster, you know who wins
Solid (W/L): 65-90% or 10-35% — clear favorite, but upsets happen
Toss-up: 35-65% — bring your lucky socks

We also simulate the entire season 1,000 times to generate floor and ceiling win totals. That “Average Record: 14-9, Ceiling: 18 wins, Floor: 10 wins” tells you the range of realistic outcomes, not just the single most likely one. Note: This approach uses team-level Power Rankings to set each team’s offensive and pitching strength. It doesn’t model individual pitcher matchups or in-game strategy — think of it as “what happens when Team A’s overall roster faces Team B’s overall roster” rather than “what happens when their ace faces our cleanup hitter.”

Projected Rosters

For every team in the dataset, we build a complete 2026 roster by asking: “Who’s coming back, and how much better will they be?”

Returning Players: Any non-senior from 2025 gets carried forward with projected stats. We apply development multipliers based on historical patterns — for example, if the typical junior-to-senior sees a 20% increase in strikeouts, we apply that to each returning junior pitcher. Elite programs (Cherry Creek, Rocky Mountain, etc.) get their own multipliers since their year-round development produces different growth curves than standard programs.

Player Rankings: Each player is ranked both within their team and across the entire league for batting (using Runs Created) and pitching (using Pitching Dominance Score). This lets you quickly spot who the #1 hitter is on each team and how they stack up against the rest of the league.

Filling the Gaps: Seniors graduate. Some kids quit. Rosters need bodies. When a team doesn’t have enough projected players to field a full squad, we fill the empty spots with “generic” players — statistical profiles based on what a typical JV call-up (sophomore) looks like. Elite programs get better generic players (50th percentile) than standard programs (30th percentile), reflecting the deeper talent pools at powerhouse schools.

The result: a realistic roster for every team, not just the ones with perfect data.

Historical Player Statistics

The foundation of everything else — four years (2022-2025) of player statistics across 37 Colorado 5A teams, all pulled from MaxPreps and combined into a single file.

This is the raw material that makes projections possible. Every batting average, every ERA, every player’s class year — scraped from MaxPreps’ print-friendly stat pages and cleaned up for analysis. When we calculate development multipliers (“how much do sophomores typically improve as juniors?”), this is the dataset we’re learning from. When we build generic player profiles, we’re pulling from real sophomores in this history.

The dataset keeps growing as more teams are added. The more history we have, the more reliable the multipliers become — especially for rarer transitions like freshman-to-sophomore, where sample sizes matter.

There is a link in the documentation to a Data Dictionary that explains the outputs of all of the models in detail and what each column means.

Methodology

The Sabermetrics Foundation

This system stands on the shoulders of established baseball analytics research. Every calculation is traceable to published methodology.

Player Valuation — Runs Created (RC)

We measure offensive contribution using Bill James’ Runs Created formula, first published in the 1979 Baseball Abstract:

RC = (H + BB) × TB / (AB + BB)

Where TB (Total Bases) = 1B + (2 × 2B) + (3 × 3B) + (4 × HR). This formula correlates with actual team runs scored at r > 0.95 across MLB seasons. While more sophisticated versions exist (wOBA, wRC+), the basic formula is appropriate for high school data where sample sizes are smaller and advanced inputs are often unavailable.

Player Valuation — Pitching Dominance Score

Pitchers are evaluated using a simplified adaptation of Bill James’ Game Score metric:

Pitching_Score = (IP × 1.5) + (K × 1.0) - (BB × 1.0) - (ER × 2.0)

This weights innings (durability) and strikeouts (dominance) positively while penalizing walks and earned runs. It’s less granular than FIP (Fielding Independent Pitching) but appropriate for high school where defensive quality varies significantly and batted ball data is unavailable.

Note on Innings Pitched: Baseball uses a special notation where .1 means ⅓ of an inning and .2 means ⅔ of an inning (representing outs). For example, 10.1 IP means 10⅓ innings, not 10.1 decimal innings. The system automatically converts this notation to proper decimals for calculations (10.1 → 10.333) and converts back to baseball notation for output display.

Development Curves — Aging Methodology

Player projections use year-over-year multipliers derived from historical cohort analysis. We calculate the median ratio of Year N+1 stats to Year N stats for each class transition (Freshman→Sophomore, etc.).

The multiplier hierarchy prioritizes:

Class (Age-Based): Largest sample sizes, most representative population
Class + Tenure (Specific): Higher specificity when available
Tenure Only: Fallback, but prone to survivor bias

This ordering follows guidance from Tango, Lichtman, and Dolphin (The Book, 2006) on aging curve methodology.

We also found that Elite programs had different development curves than other programs and applied different curves. These can be seen in the Development Multipliers output. Examples:

================================================================================
KEY FINDINGS SUMMARY (Dynamically Generated)
================================================================================

1. JUNIOR → SENIOR PITCHING (N: 108 elite, 501 standard)
   - Strikeouts: Elite 1.450 vs Standard 0.975 (Elite gains 48% more strikeouts)
   - Earned Runs: Elite 0.750 vs Standard 0.899 (Elite reduces runs 15% more)
   - Walks: Elite 0.800 vs Standard 0.944 (Elite cuts walks 14% more)

2. SOPHOMORE → JUNIOR PITCHING (N: 59 elite, 299 standard)
   - Innings Pitched: Elite 1.143 vs Standard 1.362 (Standard grows 22% more)
   - Strikeouts: Elite 1.273 vs Standard 1.474 (Standard grows 20% more)

3. BATTING DEVELOPMENT
   - Soph→Jr: Hits (1.348 vs 1.485, Elite -14%), OPS (1.026 vs 1.125, Elite -10%)
   - Jr→Sr: Hits (1.231 vs 1.174, similar), OPS (1.086 vs 1.067, similar)

4. SAMPLE SIZES
   - Elite program transitions: 185
   - Standard program transitions: 957
   - Total transitions analyzed: 1142
   - Minimum N = 14 (marginally robust, interpret with caution)

5. RECOMMENDATION
   - Elite programs show meaningful pitching development advantages
   - USE elite_development_multipliers.csv for: Broomfield, Cherry Creek, Mountain Vista, Cherokee Trail, Regis Jesuit, Rocky Mountain
   - USE standard_development_multipliers.csv for all other teams

Survivor Bias Adjustment: All projected statistics are reduced by 5% to account for survivor bias in the historical multipliers. Development curves are calculated only from players who returned the following year, which excludes players who quit or were cut. This creates an upward bias that the 5% reduction partially corrects.

Replacement Level — Generic Players

Roster backfill uses percentile-based profiles rather than mean imputation. This preserves population variance and aligns with the “Replacement Level” concept in WAR calculations—roughly the 20th-30th percentile of available talent. Elite programs (based on historical state and regional championship performance) receive different quality players than non-elite teams. The percentiles below are the percentile performance of a historical sophomore across all 37 teams.

Elite teams: First generic at 50th percentile, second at 20th percentile, third at 10th percentile, additional at 10th
Standard teams: First generic at 30th percentile, second at 10th percentile, additional at 10th

Game Simulation — How We Model “Big Innings”

If you’ve watched any baseball, you know runs don’t arrive one at a time like clockwork. They come in bunches — a team might scratch out 1 run through 6 innings, then explode for 5 in the 7th. That’s baseball.

Simple statistical models (like the Poisson distribution) assume each run is an independent event — as if every half-inning resets to zero. But that doesn’t match reality. When a team strings together hits, walks, and errors, the floodgates open. One baserunner changes the entire complexion of an inning.

To capture this, we use the Negative Binomial distribution, which has a “clumpiness dial” we can tune. We set that dial (technically called the dispersion parameter) to 1.3, which produces realistic game scores — including the occasional 12-2 blowout or 1-0 pitchers’ duel that simple models miss. When we analyzed our historical data, we found that actual scoring variance was about 13 times higher than what a “runs are independent” model would predict. The Negative Binomial handles that reality; simpler models can’t.

We run each game 1,000 times with this randomized scoring model. The result: a realistic spread of outcomes, not just endless 5-4 games.

Reference: Lindsey, G.R. “An Investigation of Strategies in Baseball.” Operations Research 11.4 (1963): 477-501.

Run Expectancy — Keeping Projections Realistic

Raw team strength numbers can produce silly results. If Team A has twice the offensive firepower of Team B, does that mean they’ll score 12 runs against an average pitching staff? Of course not — baseball doesn’t scale that way. A lineup stacked with .400 hitters doesn’t score twice as many runs as one with .300 hitters.

We apply a “dampening” adjustment using square root transformation, borrowed from Bill James’ Pythagorean Expectation formula (the same math behind “expected win-loss record” you see on Baseball Reference). This compresses extreme matchups into realistic scoring ranges.

Here’s the intuition: the difference between a great offense and an average one might be 1-2 runs per game, not 6. A team with an Offense Index of 100 facing a team with a Pitching Index of 25 shouldn’t be projected to score 24 runs — but without dampening, that’s exactly what raw multiplication would give you. The square root keeps elite teams elite while preventing the math from running away from baseball reality.

Expected_Runs = League_Base × √(Off_Index) × (1 / √(Opp_Pit_Index))

Translation: Start with a typical high school game (about 6 runs per team), adjust up for good hitting, adjust down for good opposing pitching — but the square roots ensure neither adjustment gets out of hand. A team that’s “twice as good” on paper gets maybe a 40% boost, not a 100% boost.

Technical Architecture

The pipeline is structured as a sequential DAG (Directed Acyclic Graph) with clear data dependencies.

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA LAYER                                   │
├─────────────────────────────────────────────────────────────────────┤
│  data/raw/{team}/{period}/*.html    ← MaxPreps scraped pages        │
│  data/processed/{team}/{period}/    ← Per-team intermediate CSVs    │
│  data/input/                        ← Schedule files                │
│  data/output/                       ← Final deliverables            │
│    ├── historical_stats/            ← aggregated_stats.csv          │
│    ├── development_multipliers/     ← development_multipliers.csv   │
│    │                                  elite_development_multipliers.csv
│    │                                  standard_development_multipliers.csv
│    ├── generic_players/             ← generic_players.csv           │
│    ├── roster_prediction/           ← 2026_roster_prediction.csv    │
│    └── team_strength/               ← team_strength_rankings.csv    │
│                                       rocky_mountain_monte_carlo.csv│
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      ETL LAYER (src/etl/)                           │
├─────────────────────────────────────────────────────────────────────┤
│  metadata.py        → Extracts team/season from HTML utag_data      │
│  stat_extraction.py → Parses player stats using STAT_SCHEMA map     │
│  class_inference.py → Imputes missing class years via backfill      │
│  class_cleansing.py → Fixes class progression errors                │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  ANALYTICS LAYER (src/workflows/)                   │
├─────────────────────────────────────────────────────────────────────┤
│  Step 1: development_multipliers.py                                 │
│          └─ Calculates YoY ratios from historical cohorts           │
│             Generates pooled, elite, and standard multiplier files  │
│                                                                     │
│  Step 2: profile_generator.py                                       │
│          └─ Creates percentile-based generic player profiles        │
│                                                                     │
│  Step 3: roster_prediction.py                                       │
│          └─ Projects returning players + backfills rosters          │
│             Uses tiered multipliers based on program classification │
│                                                                     │
│  Step 4: team_strength_analysis.py                                  │
│          └─ Aggregates to team-level power rankings                 │
│                                                                     │
│  Step 5: game_simulator.py                                          │
│          └─ Monte Carlo simulation of season schedule               │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    MODELS LAYER (src/models/)                       │
├─────────────────────────────────────────────────────────────────────┤
│  advanced_ranking.py → RC Score + Pitching Score calculations       │
│                        Applied as Window Functions (RANK OVER)      │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    CONFIG LAYER (src/utils/)                        │
├─────────────────────────────────────────────────────────────────────┤
│  config.py → STAT_SCHEMA (column mappings), PATHS, ELITE_TEAMS      │
│  utils.py  → Identity resolution, Varsity_Year calculation,         │
│              IP notation conversion                                 │
└─────────────────────────────────────────────────────────────────────┘

Key Technical Patterns:

Concept	Python Implementation	SQL Equivalent
Filtering players	`df[df['PA'] > 10]`	`WHERE PA > 10`
Percentile ranking	`df['col'].rank(pct=True)`	`PERCENT_RANK() OVER (ORDER BY col)`
Team-level aggregation	`df.groupby('Team').agg(...)`	`GROUP BY Team`
Multiplier lookup	`df_multipliers.loc[transition]`	`JOIN multipliers ON transition`
Top-N selection	`df.nlargest(10, 'RC_Score')`	`SELECT TOP 10 ... ORDER BY RC_Score DESC`
Null handling	`df['col'].fillna(0)`	`COALESCE(col, 0)`

Pipeline Execution

Prerequisites

# Python 3.8+
pip install pandas numpy beautifulsoup4 lxml

Directory Structure

maxpreps_bb_stats/
├── data/
│   ├── raw/                    # HTML files from MaxPreps
│   │   ├── {team_name}/
│   │   │   ├── history/        # Historical seasons (2022-2024)
│   │   │   └── 2025/           # Current season
│   ├── processed/              # Per-team intermediate CSVs
│   │   └── {team_name}/{period}/
│   ├── input/                  # Schedule files
│   │   └── rocky_mountain_schedule.csv
│   └── output/                 # Final outputs (auto-generated)
│       ├── historical_stats/
│       ├── development_multipliers/
│       ├── generic_players/
│       ├── roster_prediction/
│       └── team_strength/
├── src/
│   ├── etl/
│   ├── models/
│   ├── utils/
│   └── workflows/
└── run_pipeline.py

Running the Pipeline

Full execution (ETL + Analytics):

python run_pipeline.py --period history --teams all

ETL only (skip projections):

python run_pipeline.py --period 2025 --teams all --skip-analysis

Single team:

python run_pipeline.py --period history --teams "Rocky Mountain"

Pipeline Arguments

Argument	Description	Example
`--period`	Target subfolder in raw data	`history`, `2025`
`--teams`	Teams to process (or `all`)	`"Rocky Mountain" "Fossil Ridge"`
`--skip-analysis`	Run ETL only, skip projections	Flag, no value

Expected Console Output

--- PHASE 1: ETL EXECUTION (history) ---
Scanning: Rocky Mountain/history...
   -> Running Class Inference...
   -> Saved: data/processed/Rocky Mountain/history/Rocky Mountain_history_stats.csv (847 records)
...
--- Aggregation Complete ---

==================================================
STARTING ANALYTICS CHAIN
==================================================

--- Step 1: Updating Development Models ---
Found 1,247 year-over-year player transitions.
Saved multipliers to 'data/output/development_multipliers/development_multipliers.csv'

--- Step 2: Generating Replacement Profiles ---
Success. Saved 10 generic profiles.

--- Step 3: Predicting 2026 Rosters ---
[Pipeline Log] Returning players: 420
[Pipeline Log] Backfilling 193 generic player slots.
Success! Generated projected roster with 613 players.

--- Step 4: Analyzing Team Strength ---
=== 2026 PROJECTED TEAM POWER RANKINGS ===
...

--- Step 5: Game Simulator ---
Date         Opponent                       Win %    Avg Score    ...
...
SEASON PROJECTION (1000 Sims):
Average Record: 14.2 - 8.8

Data Acquisition

Obtaining Data from MaxPreps

Each high school baseball team has a page on the MaxPreps website. This information is accurate (particularly for varsity) as it is what CHSAA uses to generate their state rankings. These pages contain detailed season and player level statistics.

However, because of the way that the MaxPreps website is structured, these statistics are difficult to scrape in an automated way. (Another project!)

In order to obtain this information, I’ve followed the following steps.

From the main baseball page of the school, (eg: https://www.maxpreps.com/co/fort-collins/rocky-mountain-lobos/baseball/)

Navigate to the appropriate season.
Click “Stats” in the upper navigation bar.
Click “Player Stats” in the lower navigation bar.
Scroll down past the second “Batting” table.
Right-Click “Print” and select open in separate tab or new window.
The page will open and subsequently a print dialog box will open. Close the print dialog box by selecting “Cancel.”
On the new stats page that is now open, select “Save As” from the page menu.
Save as “Webpage, HTML Only” (this is in Google Chrome) as opposed to “Complete Page”.
Save that file, named something like “rocky_22” in the target directory.
This is the file the ETL ingestion process will read.
Repeat for as many teams and years as you want.

Configuration

Adding Teams to ELITE_TEAMS

Edit src/utils/config.py to modify which programs receive the “elite” level backfill and development multipliers:

ELITE_TEAMS = [
    "Broomfield (CO)",
    "Cherry Creek (Greenwood Village, CO)",
    "Mountain Vista (Highlands Ranch, CO)",
    "Cherokee Trail (Aurora, CO)",
    "Regis Jesuit (Aurora, CO)",
    "Rocky Mountain (Fort Collins, CO)",
    # Add teams with consistent state tournament appearances
]

Modifying Stat Schema

The STAT_SCHEMA in config.py maps MaxPreps HTML classes to internal column names. If MaxPreps changes their markup, update the max_preps_class values:

STAT_SCHEMA = [
    {"abbreviation": "H", "max_preps_class": "hits stat dw", "stat_type": "Batting"},
    # ...
]

Limitations & Future Work

Current Limitations:

No park effects (all fields treated equally)
No strength-of-schedule adjustment in power rankings
Dispersion parameter (1.3) not calibrated to actual HS game data
No pitcher matchup modeling (uses aggregate staff strength)

Potential Enhancements:

Backtest against historical game results
Add field dimension adjustments for altitude/fence distances
Implement Bayesian regression for small-sample players
Add confidence intervals to projections (currently point estimates only)

See docs/FUTURE_OPPORTUNITIES.md for detailed implementation notes.

References

James, Bill. The Bill James Baseball Abstract (1979-1988). Self-published.
Tango, Tom, Mitchel Lichtman, and Andrew Dolphin. The Book: Playing the Percentages in Baseball. Potomac Books, 2006.
Lindsey, G.R. “An Investigation of Strategies in Baseball.” Operations Research 11.4 (1963): 477-501.
Cameron, Dave. “The Beginner’s Guide to Replacement Level.” FanGraphs, 2010.

License

This project is for educational and personal use. MaxPreps data is subject to their Terms of Service.