Data Dictionary
Key Outputs
Team Power Rankings
Purpose: Team-level power rankings aggregating individual player performance into composite indices. This shows the relative strength of teams as well as provides information about returning players and experience level. The highest ranked team in Offense or Pitching is always given a 100, interpret numbers less than 100 as the percent offense / pitching of the highest ranked team.
Location: data/output/team_strength/team_strength_rankings.csv
| Column | Data Type | Description | Calculation |
|---|---|---|---|
| Team | String | Full team name with location | Direct from roster |
| Total_Power_Index | Float | Composite team strength (0-100 scale) | (Offense_Index + Pitching_Index) / 2 |
| Offense_Index | Float | Relative offensive strength (0-100) | (Projected_Runs / max_Projected_Runs) × 100 |
| Pitching_Index | Float | Relative pitching strength (0-100) | (Pitching_dominance / max_Pitching_dominance) × 100 |
| Projected_Runs | Float | Sum of RC_Score for top 10 batters | SUM(TOP 10 RC_Score) |
| Pitching_dominance | Float | Sum of Pitching_Score for top 6 pitchers | SUM(TOP 6 Pitching_Score) |
| Batters_Count | Integer | Number of qualified batters (RC > 0.1) in top 10 | Count |
| Pitchers_Count | Integer | Number of qualified pitchers (Score > 0.1) in top 6 | Count |
| Returning_Players | Integer | Real players (not generic backfill) | COUNT WHERE Projection_Method NOT LIKE '%Generic%' |
| Returning_Seniors | Integer | Returning players projected as Seniors | Count |
| Returning_Juniors | Integer | Returning players projected as Juniors | Count |
| Returning_Sophs | Integer | Returning players projected as Sophomores | Count |
| Total_Varsity_Years | Integer | Sum of Varsity_Year for all returning players | SUM(Varsity_Year) WHERE Is_Returning = True |
| Avg_Varsity_Years | Float | Average experience level | AVG(Varsity_Year) WHERE Is_Returning = True |
| Top_Hitter | String | Name of team’s best batter | Player with highest RC_Score |
| Top_Hitter_RC | Float | RC_Score of top hitter | Max RC_Score for team |
| Ace_Pitcher | String | Name of team’s best pitcher | Player with highest Pitching_Score |
| Ace_Score | Float | Pitching_Score of ace | Max Pitching_Score for team |
Index Interpretation:
- 100.0 = League leader (best team)
- 50.0 = Half as strong as league leader
- <30 = Significantly weaker program
Season Game-by-Game Simulation
Purpose: Provides game-by-game season simulation results for Rocky Mountain’s schedule. Shows win probability, projected scores, confidence of the projection, and a short narrative describing the outcome.
Location: data/output/team_strength/rocky_mountain_monte_carlo.csv
| Column | Data Type | Description |
|---|---|---|
| Date | String | Game date |
| Opponent | String | Opposing team name |
| Win_Pct | Float | Win probability (0.0-1.0) based on 1,000 simulations |
| Proj_Score | String | Average projected score (e.g., “6.2-5.1”) |
| Confidence | String | Game classification: “Lock (W)”, “Solid (W)”, “Toss-up”, “Solid (L)”, “Lock (L)” |
| Analysis | String | Narrative explanation of matchup factors |
| My_Off_Idx | Float | Rocky Mountain’s Offense Index |
| My_Pit_Idx | Float | Rocky Mountain’s Pitching Index |
| Opp_Off_Idx | Float | Opponent’s Offense Index |
| Opp_Pit_Idx | Float | Opponent’s Pitching Index |
Confidence Labels:
| Label | Win_Pct Range | Interpretation |
|---|---|---|
| Lock (W) | >90% | Very likely win |
| Solid (W) | 65-90% | Favorable matchup |
| Toss-up | 35-65% | Could go either way |
| Solid (L) | 10-35% | Unfavorable matchup |
| Lock (L) | <10% | Very likely loss |
Simulation Parameters:
- 1,000 iterations per game
- Negative Binomial distribution with dispersion = 1.3
- Home field advantage = 10% scoring boost
- Average Runs per Game in HS baseball = 6.0 per game
2026 Roster Projection
Purpose: Complete projected roster for every team with individual player statistics and rankings. This carries non-senior varsity players forward onto the 2026 roster, applies an improvement multiplier to them based on historical trends, and fills in empty roster spots with generic players also based on historical data.
Location: data/output/roster_prediction/2026_roster_prediction.csv
| Column | Data Type | Description | Source/Calculation |
|---|---|---|---|
| Team | String | Full team name with location (e.g., “Rocky Mountain (Fort Collins, CO)”) | ETL: metadata.py |
| Name | String | Player name or generic player identifier | ETL: stat_extraction.py / profile_generator.py |
| Season_Cleaned | Integer | Projected season year (e.g., 2026) | Calculated: base year + 1 |
| Class_Cleaned | String | Projected grade level (Freshman, Sophomore, Junior, Senior) | Calculated: next_class_map lookup |
| Varsity_Year | Integer | Number of completed varsity seasons (1-4) | Calculated: cumcount() + 1 from historical data |
| Projection_Method | String | Method used to project stats | One of: “Class (Age-Based) - Elite”, “Class (Age-Based) - Standard”, “Class_Tenure (Specific)”, “Tenure (Experience-Based)”, “Default (1.0)”, “Backfill (Elite Step-Down)”, “Backfill (Standard Step-Down)”, “Generic Baseline” |
| Offensive_Rank_Team | Integer | Within-team batting rank (1 = best batter on team) | RANK() OVER (PARTITION BY Team ORDER BY RC_Score DESC) |
| Pitching_Rank_Team | Integer | Within-team pitching rank (1 = best pitcher on team) | RANK() OVER (PARTITION BY Team ORDER BY Pitching_Score DESC) |
Batting Statistics:
| Column | Data Type | Description | Unit |
|---|---|---|---|
| PA | Float | Plate Appearances | Count |
| AB | Float | At Bats | Count |
| AVG | Float | Batting Average | Ratio (0.000-1.000) |
| H | Float | Hits | Count |
| 2B | Float | Doubles | Count |
| 3B | Float | Triples | Count |
| HR | Float | Home Runs | Count |
| RBI | Float | Runs Batted In | Count |
| R | Float | Runs Scored | Count |
| SF | Float | Sacrifice Flies | Count |
| BB | Float | Walks (Base on Balls) | Count |
| K | Float | Strikeouts | Count |
| HBP | Float | Hit By Pitch | Count |
| OBP | Float | On-Base Percentage | Ratio |
| SLG | Float | Slugging Percentage | Ratio |
| OPS | Float | On-Base Plus Slugging | Ratio |
| SB | Float | Stolen Bases | Count |
Pitching Statistics:
| Column | Data Type | Description | Unit |
|---|---|---|---|
| APP | Float | Appearances | Count |
| IP | Float | Innings Pitched (baseball notation: X.1 = X⅓, X.2 = X⅔) | Innings |
| ERA | Float | Earned Run Average | Runs per 9 innings |
| BF | Float | Batters Faced | Count |
| K_P | Float | Strikeouts (Pitching) | Count |
| ER | Float | Earned Runs | Count |
| H_P | Float | Hits Against | Count |
| 2B_P | Float | Doubles Against | Count |
| 3B_P | Float | Triples Against | Count |
| HR_P | Float | Home Runs Against | Count |
| BB_P | Float | Walks Against | Count |
| BAA | Float | Batting Average Against | Ratio |
Fielding Statistics:
| Column | Data Type | Description | Unit |
|---|---|---|---|
| FP | Float | Fielding Percentage | Ratio |
| TC | Float | Total Chances | Count |
| PO | Float | Putouts | Count |
| A | Float | Assists | Count |
| E | Float | Errors | Count |
| DP | Float | Double Plays | Count |
Derived/Calculated Fields:
| Column | Data Type | Description | Calculation |
|---|---|---|---|
| Is_Batter | Boolean | Qualifies as batter (AB ≥ 10) | df['AB'].fillna(0) >= 10 |
| Is_Pitcher | Boolean | Qualifies as pitcher (IP ≥ 5) | df['IP'].fillna(0) >= 5 |
| Offensive_Rank | Integer | League-wide batting rank | RANK() OVER (ORDER BY RC_Score DESC) |
| Pitching_Rank | Integer | League-wide pitching rank | RANK() OVER (ORDER BY Pitching_Score DESC) |
| RC_Score | Float | Runs Created offensive value | (H + BB) × TB / (AB + BB) where TB = 1B + 2×2B + 3×3B + 4×HR |
| Pitching_Score | Float | Pitching Dominance Score | (IP_decimal × 1.5) + (K_P × 1.0) - (BB_P × 1.0) - (ER × 2.0) |
Special Values:
9999in Offensive_Rank/Pitching_Rank: Player doesn’t qualify for that role- “Generic Batter/Pitcher” in Name: Synthetic backfill player
Historical Rosters
Purpose: Historical player statistics across all teams and seasons, serving as the source data for projections. This is an aggregate of all historical MaxPreps data that was collected (currently 37 teams for four years 2022-2025). It provides all of the same statistics columns as the Roster Projection data, but based on actuals pulled from MaxPreps.
Location: data/output/historical_stats/aggregated_stats.csv
| Column | Data Type | Description | Source |
|---|---|---|---|
| Season | String | Raw season identifier (e.g., “23-24”) | metadata.py: utag_data.year |
| Season_Cleaned | String/Int | Normalized year (e.g., “2024”) | metadata.py: parsed from Season |
| Team | String | School name with location | metadata.py: utag_data.schoolName |
| Level | String | Competition level (typically “Varsity”) | metadata.py: utag_data.teamLevel |
| Source_File | String | Original HTML filename for data lineage | ETL: filename |
| Name | String | Player display name | stat_extraction.py: link text |
| Class | String | Original class year from MaxPreps | stat_extraction.py: abbr.class-year |
| Class_Cleaned | String | Inferred/corrected class year | class_inference.py + class_cleansing.py |
| Athlete_ID | String (UUID) | MaxPreps unique player identifier (per season) | stat_extraction.py: href athleteid parameter |
Statistics Columns: Same as roster prediction (PA through DP) but representing actual historical values, not projections.
Data Quality Notes:
Class = 'Unknown': MaxPreps didn’t provide class information; inference attemptedClass_Cleanedmay differ fromClassif corrected by progression fixer- Athlete_ID changes between seasons; Name-based matching used for longitudinal tracking
Supporting Files
Development Multipliers
Purpose: Year-over-year performance ratios used to project player development. These multipliers are derived from historical performance. For example, across all Juniors becoming Seniors, what was the median change in ERA? That multiplier is then applied to that age transition.
Location: data/output/development_multipliers/
The system produces THREE multiplier files:
| File | Description |
|---|---|
development_multipliers.csv | Pooled multipliers (all programs combined, backward compatible) |
elite_development_multipliers.csv | Multipliers for elite programs only |
standard_development_multipliers.csv | Multipliers for standard programs only |
Column Structure (all three files):
| Column | Data Type | Description |
|---|---|---|
| Transition | String (Index) | Transition identifier (e.g., “Sophomore_to_Junior”, “Freshman_Y1_to_Sophomore_Y2”) |
| Type | String | Category: “Class”, “Tenure”, or “Class_Tenure” |
| Sample_Size | Integer | Number of players in cohort (N) |
| Avg_Volatility | Float | Average standard deviation across all stat multipliers (lower = more reliable) |
| [Stat Columns] | Float | Median YoY ratio for each statistic |
Transition Types Explained:
| Type | Example | Description | Reliability |
|---|---|---|---|
| Class | Sophomore_to_Junior | All players making this class transition | Highest N, most stable |
| Tenure | Varsity_Year1_to_Year2 | Players by varsity experience | Prone to survivor bias |
| Class_Tenure | Sophomore_Y1_to_Junior_Y2 | Specific combination | Highest specificity, smaller N |
Elite vs Standard Programs:
Elite programs (defined in config.py based on regional/state championships since 2016) show measurably different development curves than standard programs. Key differences for Junior→Senior pitching:
| Metric | Description | Elite Effect |
|---|---|---|
| K_P | Strikeouts | Elite pitchers gain more strikeouts |
| ER | Earned Runs | Elite pitchers reduce runs more |
| BB_P | Walks | Elite pitchers cut walks more |
Run development_multipliers.py to see current values with dynamic sample sizes.
Multiplier Interpretation:
1.0= No change expected>1.0= Expected improvement (e.g., 1.2 = 20% increase)<1.0= Expected decline0.0or extreme values = Small sample noise (use with caution)
Special Handling:
- Triples (3B), Home Runs (HR), and pitching equivalents use Laplacian smoothing to reduce noise
- Multipliers are capped/floored when applied to prevent extreme projections
Generic Players
Purpose: Synthetic “replacement level” player profiles that are used for roster backfilling.
Location: data/output/generic_players/generic_players.csv
| Column | Data Type | Description |
|---|---|---|
| Name | String | Descriptive identifier (e.g., “Generic Sophomore Batter (30th %ile)”) |
| Role | String | “Batter” or “Pitcher” |
| Class_Cleaned | String | Always “Sophomore” (typical call-up class) |
| Varsity_Year | Integer | Always 1 (first-year varsity) |
| Projection_Method | String | “Generic Baseline” |
| Percentile_Tier | Float | Statistical percentile (0.1, 0.2, 0.3, 0.4, 0.5) |
| [Stat Columns] | Float | Median values for that percentile bucket |
| AB_Original / PA_Original / IP_Original | Float | Original values before minimum enforcement |
Percentile Tier Usage:
| Tier | Description | Used For |
|---|---|---|
| 0.5 (50th) | Median sophomore | Elite teams’ first backfill slot |
| 0.4 (40th) | Above-average starter | |
| 0.3 (30th) | Replacement level | Standard teams’ first backfill slot |
| 0.2 (20th) | Below-average regular | Elite teams’ second slot |
| 0.1 (10th) | Marginal player | Floor for all teams |
Role-Based Masking:
- Batter profiles have all pitching columns zeroed
- Pitcher profiles have all batting columns zeroed