Adversarial Review Comparison - Gemini vs. Claude
Date: December 16, 2025 Subject: Comparison of AI-generated adversarial reviews for the CO High School Baseball Projection System Agents: Gemini (Google) vs. Claude (Anthropic)
- Executive Summary
This document compares two independent adversarial reviews of the codebase. The reviews were highly complementary, with almost no overlap in their “Critical” findings.
Gemini focused on Mathematical Logic & Semantics, correctly identifying that the system was performing literal decimal math on baseball notations (the “IP Decimal Bug”).
Claude focused on Data Hygiene & Integrity, correctly identifying that “Generic” templates were polluted with cross-role statistics (Batters having ERAs).
Verdict: Neither review was sufficient on its own. Together, they cover both the hidden mathematical flaws (Gemini) and the visible data structure flaws (Claude).
- Side-by-Side Finding Matrix
Category
Gemini Finding
Claude Finding
Overlap?
Critical Defect (Math)
IP Decimal Error: System treats 10.1 IP as 10.1 (decimal) instead of 10.33. Results in ~2.3% error in all rate stats.
NaN Propagation: Identified 4 players with RC_Score = 0 due to NaN values in the Runs Created formula.
No
Critical Defect (Data)
N/A
Role Contamination: “Generic Batters” contain pitching stats (ERA, IP), which pollutes team aggregation logic.
No
Statistical Bias
Survivor Bias: Aging curves exclude dropouts, inflating projections.
Survivor Bias: (Same finding).
Yes
Volatility / Noise
Replacement Floor: Generic players (10th %ile) have samples so small (12 PA) they are statistically meaningless.
Rare Event Multipliers: Huge volatility in 3B and HR multipliers due to small denominators.
Partial
Validation
Checked logic flow.
Variance/Mean Ratio: Calculated 13.4 ratio to validate the Negative Binomial simulation assumption.
No
- Deep Dive: Unique Wins
Gemini’s Key Win: The “IP Decimal” Bug
The Issue: In baseball box scores, 10.1 means “10 innings and 1 out” (10.333). In Python float, 10.1 is just 10.1.
Evidence: Gemini spotted IP: 3.02 in the output CSV. In baseball notation, .02 is impossible (outs go .1, .2, 1.0). This proved the code was doing raw division on the box score string.
Impact: Massive. Every single ERA, WHIP, and K/9 calculation in the entire system was mathematically wrong.
Status: Fixed in the current Canvas (src/utils/math_fixes.py).
Claude’s Key Win: Role Contamination
The Issue: The generic_players.csv file defines “Generic Sophomore Batter”. However, Claude noticed these rows contained valid data for IP, ERA, and K_P.
Impact: If the Team Strength calculator sums the Pitching_Index of all players, it effectively treats every replacement-level bench bat as a relief pitcher, artificially inflating the pitching depth of bad teams.
Status: Unfixed. Requires filtering logic in profile_generator.py.
- Evaluation of Completeness
Gemini Review
Strengths: Deep code analysis regarding domain-specific logic (Baseball math). Excellent detection of “invisible” logic errors that don’t throw exceptions but produce wrong numbers.
Weaknesses: Missed the NaN data holes in the output CSV. Did not run high-level statistical validation (Variance/Mean) on the simulation results.
Claude Review
Strengths: Excellent data profiling. Found specific rows (NaNs) and specific column anomalies (Batters with ERAs). Good statistical validation of the simulation distribution.
Weaknesses: Missed the fundamental domain-math error (IP format). Assumed the numbers were calculated correctly and focused on their presence/absence.
- Consolidated Recommendations
To achieve a 1.0 release quality, the following roadmap is derived from combining both reports:
Apply Math Fixes (Gemini):
Deploy correct_innings_pitched (already in Canvas) to all ingestion pipelines.
Deploy smooth_multiplier (already in Canvas) to tame the volatility found by Claude.
Apply Data Sanitization (Claude):
Role Masking: In profile_generator.py, ensure that if Role == ‘Batter’, all Pitching columns are set to 0 or NaN.
NaN Safety: Wrap the Runs Created formula in a try/except or fill NaN with 0 before calculation to prevent the “Zero RC” bug.
Address Survivor Bias (Both):
The current multipliers are optimistic. A global “Churn Penalty” (e.g., reducing all improvement multipliers by 5-8%) is a quick heuristic fix recommended by both agents to offset the lack of “dropout” data.