Data Dictionary

Key Outputs

Team Power Rankings

Purpose: Team-level power rankings aggregating individual player performance into composite indices. This shows the relative strength of teams as well as provides information about returning players and experience level. The highest ranked team in Offense or Pitching is always given a 100, interpret numbers less than 100 as the percent offense / pitching of the highest ranked team.

Location: data/output/team_strength/team_strength_rankings.csv

Column Data Type Description Calculation
Team String Full team name with location Direct from roster
Total_Power_Index Float Composite team strength (0-100 scale) (Offense_Index + Pitching_Index) / 2
Offense_Index Float Relative offensive strength (0-100) (Projected_Runs / max_Projected_Runs) × 100
Pitching_Index Float Relative pitching strength (0-100) (Pitching_dominance / max_Pitching_dominance) × 100
Projected_Runs Float Sum of RC_Score for top 10 batters SUM(TOP 10 RC_Score)
Pitching_dominance Float Sum of Pitching_Score for top 6 pitchers SUM(TOP 6 Pitching_Score)
Batters_Count Integer Number of qualified batters (RC > 0.1) in top 10 Count
Pitchers_Count Integer Number of qualified pitchers (Score > 0.1) in top 6 Count
Returning_Players Integer Real players (not generic backfill) COUNT WHERE Projection_Method NOT LIKE '%Generic%'
Returning_Seniors Integer Returning players projected as Seniors Count
Returning_Juniors Integer Returning players projected as Juniors Count
Returning_Sophs Integer Returning players projected as Sophomores Count
Total_Varsity_Years Integer Sum of Varsity_Year for all returning players SUM(Varsity_Year) WHERE Is_Returning = True
Avg_Varsity_Years Float Average experience level AVG(Varsity_Year) WHERE Is_Returning = True
Top_Hitter String Name of team’s best batter Player with highest RC_Score
Top_Hitter_RC Float RC_Score of top hitter Max RC_Score for team
Ace_Pitcher String Name of team’s best pitcher Player with highest Pitching_Score
Ace_Score Float Pitching_Score of ace Max Pitching_Score for team

Index Interpretation:

  • 100.0 = League leader (best team)
  • 50.0 = Half as strong as league leader
  • <30 = Significantly weaker program

Season Game-by-Game Simulation

Purpose: Provides game-by-game season simulation results for Rocky Mountain’s schedule. Shows win probability, projected scores, confidence of the projection, and a short narrative describing the outcome.

Location: data/output/team_strength/rocky_mountain_monte_carlo.csv

Column Data Type Description
Date String Game date
Opponent String Opposing team name
Win_Pct Float Win probability (0.0-1.0) based on 1,000 simulations
Proj_Score String Average projected score (e.g., “6.2-5.1”)
Confidence String Game classification: “Lock (W)”, “Solid (W)”, “Toss-up”, “Solid (L)”, “Lock (L)”
Analysis String Narrative explanation of matchup factors
My_Off_Idx Float Rocky Mountain’s Offense Index
My_Pit_Idx Float Rocky Mountain’s Pitching Index
Opp_Off_Idx Float Opponent’s Offense Index
Opp_Pit_Idx Float Opponent’s Pitching Index

Confidence Labels:

Label Win_Pct Range Interpretation
Lock (W) >90% Very likely win
Solid (W) 65-90% Favorable matchup
Toss-up 35-65% Could go either way
Solid (L) 10-35% Unfavorable matchup
Lock (L) <10% Very likely loss

Simulation Parameters:

  • 1,000 iterations per game
  • Negative Binomial distribution with dispersion = 1.3
  • Home field advantage = 10% scoring boost
  • Average Runs per Game in HS baseball = 6.0 per game

2026 Roster Projection

Purpose: Complete projected roster for every team with individual player statistics and rankings. This carries non-senior varsity players forward onto the 2026 roster, applies an improvement multiplier to them based on historical trends, and fills in empty roster spots with generic players also based on historical data.

Location: data/output/roster_prediction/2026_roster_prediction.csv

Column Data Type Description Source/Calculation
Team String Full team name with location (e.g., “Rocky Mountain (Fort Collins, CO)”) ETL: metadata.py
Name String Player name or generic player identifier ETL: stat_extraction.py / profile_generator.py
Season_Cleaned Integer Projected season year (e.g., 2026) Calculated: base year + 1
Class_Cleaned String Projected grade level (Freshman, Sophomore, Junior, Senior) Calculated: next_class_map lookup
Varsity_Year Integer Number of completed varsity seasons (1-4) Calculated: cumcount() + 1 from historical data
Projection_Method String Method used to project stats One of: “Class (Age-Based) - Elite”, “Class (Age-Based) - Standard”, “Class_Tenure (Specific)”, “Tenure (Experience-Based)”, “Default (1.0)”, “Backfill (Elite Step-Down)”, “Backfill (Standard Step-Down)”, “Generic Baseline”
Offensive_Rank_Team Integer Within-team batting rank (1 = best batter on team) RANK() OVER (PARTITION BY Team ORDER BY RC_Score DESC)
Pitching_Rank_Team Integer Within-team pitching rank (1 = best pitcher on team) RANK() OVER (PARTITION BY Team ORDER BY Pitching_Score DESC)

Batting Statistics:

Column Data Type Description Unit
PA Float Plate Appearances Count
AB Float At Bats Count
AVG Float Batting Average Ratio (0.000-1.000)
H Float Hits Count
2B Float Doubles Count
3B Float Triples Count
HR Float Home Runs Count
RBI Float Runs Batted In Count
R Float Runs Scored Count
SF Float Sacrifice Flies Count
BB Float Walks (Base on Balls) Count
K Float Strikeouts Count
HBP Float Hit By Pitch Count
OBP Float On-Base Percentage Ratio
SLG Float Slugging Percentage Ratio
OPS Float On-Base Plus Slugging Ratio
SB Float Stolen Bases Count

Pitching Statistics:

Column Data Type Description Unit
APP Float Appearances Count
IP Float Innings Pitched (baseball notation: X.1 = X⅓, X.2 = X⅔) Innings
ERA Float Earned Run Average Runs per 9 innings
BF Float Batters Faced Count
K_P Float Strikeouts (Pitching) Count
ER Float Earned Runs Count
H_P Float Hits Against Count
2B_P Float Doubles Against Count
3B_P Float Triples Against Count
HR_P Float Home Runs Against Count
BB_P Float Walks Against Count
BAA Float Batting Average Against Ratio

Fielding Statistics:

Column Data Type Description Unit
FP Float Fielding Percentage Ratio
TC Float Total Chances Count
PO Float Putouts Count
A Float Assists Count
E Float Errors Count
DP Float Double Plays Count

Derived/Calculated Fields:

Column Data Type Description Calculation
Is_Batter Boolean Qualifies as batter (AB ≥ 10) df['AB'].fillna(0) >= 10
Is_Pitcher Boolean Qualifies as pitcher (IP ≥ 5) df['IP'].fillna(0) >= 5
Offensive_Rank Integer League-wide batting rank RANK() OVER (ORDER BY RC_Score DESC)
Pitching_Rank Integer League-wide pitching rank RANK() OVER (ORDER BY Pitching_Score DESC)
RC_Score Float Runs Created offensive value (H + BB) × TB / (AB + BB) where TB = 1B + 2×2B + 3×3B + 4×HR
Pitching_Score Float Pitching Dominance Score (IP_decimal × 1.5) + (K_P × 1.0) - (BB_P × 1.0) - (ER × 2.0)

Special Values:

  • 9999 in Offensive_Rank/Pitching_Rank: Player doesn’t qualify for that role
  • “Generic Batter/Pitcher” in Name: Synthetic backfill player

Historical Rosters

Purpose: Historical player statistics across all teams and seasons, serving as the source data for projections. This is an aggregate of all historical MaxPreps data that was collected (currently 37 teams for four years 2022-2025). It provides all of the same statistics columns as the Roster Projection data, but based on actuals pulled from MaxPreps.

Location: data/output/historical_stats/aggregated_stats.csv

Column Data Type Description Source
Season String Raw season identifier (e.g., “23-24”) metadata.py: utag_data.year
Season_Cleaned String/Int Normalized year (e.g., “2024”) metadata.py: parsed from Season
Team String School name with location metadata.py: utag_data.schoolName
Level String Competition level (typically “Varsity”) metadata.py: utag_data.teamLevel
Source_File String Original HTML filename for data lineage ETL: filename
Name String Player display name stat_extraction.py: link text
Class String Original class year from MaxPreps stat_extraction.py: abbr.class-year
Class_Cleaned String Inferred/corrected class year class_inference.py + class_cleansing.py
Athlete_ID String (UUID) MaxPreps unique player identifier (per season) stat_extraction.py: href athleteid parameter

Statistics Columns: Same as roster prediction (PA through DP) but representing actual historical values, not projections.

Data Quality Notes:

  • Class = 'Unknown': MaxPreps didn’t provide class information; inference attempted
  • Class_Cleaned may differ from Class if corrected by progression fixer
  • Athlete_ID changes between seasons; Name-based matching used for longitudinal tracking

Supporting Files

Development Multipliers

Purpose: Year-over-year performance ratios used to project player development. These multipliers are derived from historical performance. For example, across all Juniors becoming Seniors, what was the median change in ERA? That multiplier is then applied to that age transition.

Location: data/output/development_multipliers/

The system produces THREE multiplier files:

File Description
development_multipliers.csv Pooled multipliers (all programs combined, backward compatible)
elite_development_multipliers.csv Multipliers for elite programs only
standard_development_multipliers.csv Multipliers for standard programs only

Column Structure (all three files):

Column Data Type Description
Transition String (Index) Transition identifier (e.g., “Sophomore_to_Junior”, “Freshman_Y1_to_Sophomore_Y2”)
Type String Category: “Class”, “Tenure”, or “Class_Tenure”
Sample_Size Integer Number of players in cohort (N)
Avg_Volatility Float Average standard deviation across all stat multipliers (lower = more reliable)
[Stat Columns] Float Median YoY ratio for each statistic

Transition Types Explained:

Type Example Description Reliability
Class Sophomore_to_Junior All players making this class transition Highest N, most stable
Tenure Varsity_Year1_to_Year2 Players by varsity experience Prone to survivor bias
Class_Tenure Sophomore_Y1_to_Junior_Y2 Specific combination Highest specificity, smaller N

Elite vs Standard Programs:

Elite programs (defined in config.py based on regional/state championships since 2016) show measurably different development curves than standard programs. Key differences for Junior→Senior pitching:

Metric Description Elite Effect
K_P Strikeouts Elite pitchers gain more strikeouts
ER Earned Runs Elite pitchers reduce runs more
BB_P Walks Elite pitchers cut walks more

Run development_multipliers.py to see current values with dynamic sample sizes.

Multiplier Interpretation:

  • 1.0 = No change expected
  • >1.0 = Expected improvement (e.g., 1.2 = 20% increase)
  • <1.0 = Expected decline
  • 0.0 or extreme values = Small sample noise (use with caution)

Special Handling:

  • Triples (3B), Home Runs (HR), and pitching equivalents use Laplacian smoothing to reduce noise
  • Multipliers are capped/floored when applied to prevent extreme projections

Generic Players

Purpose: Synthetic “replacement level” player profiles that are used for roster backfilling.

Location: data/output/generic_players/generic_players.csv

Column Data Type Description
Name String Descriptive identifier (e.g., “Generic Sophomore Batter (30th %ile)”)
Role String “Batter” or “Pitcher”
Class_Cleaned String Always “Sophomore” (typical call-up class)
Varsity_Year Integer Always 1 (first-year varsity)
Projection_Method String “Generic Baseline”
Percentile_Tier Float Statistical percentile (0.1, 0.2, 0.3, 0.4, 0.5)
[Stat Columns] Float Median values for that percentile bucket
AB_Original / PA_Original / IP_Original Float Original values before minimum enforcement

Percentile Tier Usage:

Tier Description Used For
0.5 (50th) Median sophomore Elite teams’ first backfill slot
0.4 (40th) Above-average starter  
0.3 (30th) Replacement level Standard teams’ first backfill slot
0.2 (20th) Below-average regular Elite teams’ second slot
0.1 (10th) Marginal player Floor for all teams

Role-Based Masking:

  • Batter profiles have all pitching columns zeroed
  • Pitcher profiles have all batting columns zeroed


Copyright © 2025 Tim Coultas

This site uses Just the Docs, a documentation theme for Jekyll.