Adversarial Review Comparison - Claude vs Gemini
Date: December 16, 2025
Subject: Colorado High School Baseball Projection System
Purpose: Cross-validation of independent AI adversarial reviews
Executive Summary
Two independent AI agents (Claude and Gemini) conducted adversarial reviews of the same codebase using the same methodology document. This comparison reveals that both reviews identified valid issues, but each missed critical findings the other caught. The optimal outcome is a merged findings list incorporating discoveries from both reviews.
| Metric | Claude | Gemini |
|---|---|---|
| Critical Issues Found | 2 | 2 |
| Statistical Concerns | 3 | 2 |
| Code Quality Issues | 3 | 2 |
| Explicit Validation Checks | 10 | 0 |
| Provided Code Fixes | Inline snippets | Standalone module |
Key Insight: Neither review alone was complete. Claude found concrete bugs affecting specific players in the output; Gemini found a systemic mathematical error in IP handling that Claude missed entirely.
Side-by-Side Findings Comparison
Critical Issues
| Issue | Claude | Gemini | Verified? |
|---|---|---|---|
| NaN Propagation in RC Calculation | ✅ Found | ❌ Missed | Yes — 4 Denver North players have RC_Score=0 despite 8-36 hits |
| Generic Batter Profile Contamination | ✅ Found | ❌ Missed | Yes — 72 players have negative Pitching_Score including generic batters |
| IP Decimal Notation Error | ❌ Missed | ✅ Found | Yes — Output shows values like IP: 3.02 which is invalid baseball notation |
| Survivor Bias in Aging Curves | ⚠️ Partial | ✅ Complete | Yes — Both identified, Gemini’s analysis more thorough |
Statistical Concerns
| Concern | Claude | Gemini | Assessment |
|---|---|---|---|
| Rare Event Multiplier Noise (3B, HR_P) | ✅ Found | ✅ Found | Both identified; Gemini proposed Laplacian smoothing, Claude proposed Bayesian shrinkage |
| Freshman→Sophomore 2.25x Multiplier | ✅ Found | ⚠️ Covered under survivor bias | Same root cause, different framing |
| 50/50 Offense/Pitching Weighting | ❌ Missed | ✅ Found | Valid methodological question |
| Replacement Level Floor Instability | ✅ Noted prior fix | ⚠️ Outdated concern | Code already uses 30th percentile minimum |
Code Quality Issues
| Issue | Claude | Gemini | Assessment |
|---|---|---|---|
| Documentation mismatch (Top 9 vs All) | ✅ Found | ❌ Missed | README says “Top 9 batters” but code sums all qualified |
| Unused volatility metrics | ✅ Found | ❌ Missed | Opportunity for confidence intervals |
| IP output formatting artifacts | ❌ Missed | ✅ Found | Values like 3.02 should be 3.0, 3.1, or 3.2 |
| High generic roster dependency | ✅ Found | ❌ Missed | 31.5% of roster is synthetic backfill |
Detailed Analysis of Divergent Findings
Finding Claude Caught, Gemini Missed: NaN Propagation
The Bug: When a player has hits but NaN values in extra-base hit columns (2B, 3B, HR), the Total Bases calculation produces NaN, which propagates to RC_Score = 0.
Evidence:
| Player | Team | Hits | 2B | 3B | HR | RC_Score |
|---|---|---|---|---|---|---|
| S. Vital | Denver North | 28.8 | 2.0 | 0.5 | NaN | 0.0 ❌ |
| D. Sanudo | Denver North | 36.0 | 4.0 | NaN | NaN | 0.0 ❌ |
| R. Avalos | Denver North | 20.4 | 2.0 | NaN | 2.0 | 0.0 ❌ |
| R. Martinez | Denver North | 8.4 | 1.0 | NaN | NaN | 0.0 ❌ |
Why Gemini Missed It: Gemini’s review focused on methodology and formula correctness rather than tracing actual data through the pipeline. This bug only manifests when examining specific player records in the output.
Impact: Denver North’s team strength is artificially deflated because their best hitters contribute zero to offensive calculations.
Finding Gemini Caught, Claude Missed: IP Decimal Notation
The Bug: Standard baseball notation uses .1 = 1/3 inning and .2 = 2/3 inning. If the code treats 10.1 IP as the literal decimal 10.1 instead of 10.333..., all rate statistics (ERA, WHIP, K/9) will be systematically incorrect.
Evidence:
Player Z. Quimby: IP = 10.1, ER = 4, Reported ERA = 2.71
Scenario A (Literal Decimal): 7 × 4 / 10.1 = 2.77
Scenario B (Correct Conversion): 7 × 4 / 10.333 = 2.71 ✓
The source data appears correct (MaxPreps provides pre-calculated ERA).
But projected values show artifacts like IP: 3.02, which cannot exist in baseball.
Why Claude Missed It: Claude’s validation focused on RC formula correctness and pitching score correlation with ERA, but did not examine whether IP values were being converted properly before calculations. The correlation check (-0.50) passed because the direction was correct even if magnitudes had small errors.
Impact: Systematic ~2.3% error per third of an inning in all pitching rate statistics for projected players.
Finding Both Identified Differently: Survivor Bias
Claude’s Framing:
“The Freshman_Y1→Sophomore_Y2 transition shows a 2.25x multiplier for Hits. This likely reflects survivor bias: only the most elite freshmen earn varsity playing time in Year 1.”
Gemini’s Framing:
“High school attrition is non-random. Bad players quit; good players stay. By calculating multipliers only on the ‘Survivors’, the model calculates ‘Conditional Growth’ (Growth given you didn’t quit). When applied to a full current roster, it over-projects team aggregate strength.”
Assessment: Gemini’s analysis is more complete because it:
- Explains the mechanism (attrition is non-random)
- Identifies the downstream impact (team strength inflation)
- Proposes a concrete fix (Churn Rate penalty or Zero-Fill method)
Claude identified a symptom (high freshman multiplier); Gemini identified the systemic cause.
Quality of Recommendations
Claude’s Recommendations
| Issue | Recommendation | Actionability |
|---|---|---|
| NaN Propagation | Add fillna(0) before TB calculation | ✅ Direct code fix |
| Profile Contamination | Clear irrelevant stats by role | ✅ Direct code fix |
| Rare Event Multipliers | Bayesian shrinkage toward 1.0 | ⚠️ Conceptual, no code |
| Survivor Bias | Cap at 1.5x or regress toward class multiplier | ⚠️ Partial solution |
Gemini’s Recommendations
| Issue | Recommendation | Actionability |
|---|---|---|
| IP Notation | convert_ip_to_decimal() utility | ✅ Provided complete code |
| Survivor Bias | Churn Rate penalty or Zero-Fill method | ⚠️ Conceptual, no code |
| Rare Events | Laplacian Smoothing | ✅ Provided code in math_fixes.py |
| Replacement Level | Increase min PA to 25 | ✅ Simple parameter change |
Assessment: Gemini provided more complete code solutions via math_fixes.py. Claude provided more targeted inline fixes for the specific bugs found.
Validation Methodology Comparison
Claude’s Approach
Claude executed explicit validation checks and documented results:
1. ✅ RC formula produces expected values
2. ✅ Pitching-ERA correlation = -0.50
3. ✅ All multiplier transitions have N ≥ 40
4. ✅ Generic profiles have non-zero RC
5. ✅ Variance/Mean = 13.4 (validates NB distribution)
6. ✅ No zero/negative team strength indices
7. ✅ 100% of real players use Class-based projection
8. ✅ Elite teams get 50th percentile backfill
9. ✅ All teams meet roster minimums
10. ❌ NaN propagation in RC for 4 players
Gemini’s Approach
Gemini performed analytical reasoning about methodology but did not enumerate specific validation checks or trace data through the pipeline.
Assessment: Claude’s approach was more rigorous for finding data-level bugs; Gemini’s approach was more effective for finding systemic mathematical issues.
Consolidated Critical Issues List
Based on both reviews, the complete list of critical issues requiring fixes:
Priority 1: Must Fix Before Production
- NaN Propagation in RC Calculation (Claude)
- Location:
src/models/advanced_ranking.py:calculate_offensive_score() - Fix: Add
fillna(0)for 2B, 3B, HR before TB calculation
- Location:
- IP Decimal Notation Error (Gemini)
- Location: All files performing IP-based calculations
- Fix: Implement
correct_innings_pitched()frommath_fixes.py
- Generic Batter Profile Contamination (Claude)
- Location:
src/workflows/profile_generator.py - Fix: Clear pitching stats for Batter role, batting stats for Pitcher role
- Location:
Priority 2: Should Fix
- Survivor Bias Adjustment (Both)
- Apply 5-10% reduction to multipliers, or implement Zero-Fill method
- Rare Event Smoothing (Both)
- Apply Laplacian smoothing or Bayesian shrinkage to 3B, HR_P, 2B_P, 3B_P multipliers
- IP Output Formatting (Gemini)
- Round projected IP to valid baseball notation (.0, .1, .2 only)
Lessons Learned
For Future Adversarial Reviews
- Multiple reviewers catch more issues — Neither AI alone found all critical bugs
- Data tracing vs. methodology analysis — Both approaches are necessary
- Explicit validation checklists — Force examination of specific edge cases
- Code execution — Actually running validation scripts catches bugs that reasoning alone misses
Strengths by Reviewer
| Strength | Claude | Gemini |
|---|---|---|
| Finding data-level bugs | ✅ Strong | ⚠️ Weak |
| Finding mathematical errors | ⚠️ Weak | ✅ Strong |
| Providing code fixes | ⚠️ Inline only | ✅ Complete module |
| Explicit validation | ✅ Strong | ⚠️ Weak |
| Systemic analysis | ⚠️ Moderate | ✅ Strong |
| Awareness of prior fixes | ✅ Yes | ❌ No |
Conclusion
The cross-validation exercise demonstrates that adversarial reviews benefit significantly from multiple independent reviewers. Claude’s review excelled at finding concrete bugs affecting specific players but missed a systemic mathematical error. Gemini’s review excelled at identifying methodological flaws but missed data-level issues visible only through pipeline tracing.
Recommended Action: Implement fixes from both reviews:
- Merge
math_fixes.pyinto the codebase (Gemini) - Apply NaN and profile contamination fixes (Claude)
- Address survivor bias using Gemini’s more complete framework
- Run the consolidated validation checklist after fixes
The combined findings list represents a more complete picture than either review alone could provide.
Appendix: Files Referenced
| File | Source |
|---|---|
adversarial_review_report.md | Claude’s review |
adversarial_review_report_gemini.md | Gemini’s review |
math_fixes.py | Gemini’s code fixes |
aggregated_stats.csv | Input data |
2026_roster_prediction.csv | Output data |
development_multipliers.csv | Multiplier reference |
generic_players.csv | Backfill profiles |
team_strength_rankings.csv | Team rankings output |