Adversarial Review Comparison - Claude vs Gemini

Date: December 16, 2025
Subject: Colorado High School Baseball Projection System
Purpose: Cross-validation of independent AI adversarial reviews


Executive Summary

Two independent AI agents (Claude and Gemini) conducted adversarial reviews of the same codebase using the same methodology document. This comparison reveals that both reviews identified valid issues, but each missed critical findings the other caught. The optimal outcome is a merged findings list incorporating discoveries from both reviews.

Metric Claude Gemini
Critical Issues Found 2 2
Statistical Concerns 3 2
Code Quality Issues 3 2
Explicit Validation Checks 10 0
Provided Code Fixes Inline snippets Standalone module

Key Insight: Neither review alone was complete. Claude found concrete bugs affecting specific players in the output; Gemini found a systemic mathematical error in IP handling that Claude missed entirely.


Side-by-Side Findings Comparison

Critical Issues

Issue Claude Gemini Verified?
NaN Propagation in RC Calculation ✅ Found ❌ Missed Yes — 4 Denver North players have RC_Score=0 despite 8-36 hits
Generic Batter Profile Contamination ✅ Found ❌ Missed Yes — 72 players have negative Pitching_Score including generic batters
IP Decimal Notation Error ❌ Missed ✅ Found Yes — Output shows values like IP: 3.02 which is invalid baseball notation
Survivor Bias in Aging Curves ⚠️ Partial ✅ Complete Yes — Both identified, Gemini’s analysis more thorough

Statistical Concerns

Concern Claude Gemini Assessment
Rare Event Multiplier Noise (3B, HR_P) ✅ Found ✅ Found Both identified; Gemini proposed Laplacian smoothing, Claude proposed Bayesian shrinkage
Freshman→Sophomore 2.25x Multiplier ✅ Found ⚠️ Covered under survivor bias Same root cause, different framing
50/50 Offense/Pitching Weighting ❌ Missed ✅ Found Valid methodological question
Replacement Level Floor Instability ✅ Noted prior fix ⚠️ Outdated concern Code already uses 30th percentile minimum

Code Quality Issues

Issue Claude Gemini Assessment
Documentation mismatch (Top 9 vs All) ✅ Found ❌ Missed README says “Top 9 batters” but code sums all qualified
Unused volatility metrics ✅ Found ❌ Missed Opportunity for confidence intervals
IP output formatting artifacts ❌ Missed ✅ Found Values like 3.02 should be 3.0, 3.1, or 3.2
High generic roster dependency ✅ Found ❌ Missed 31.5% of roster is synthetic backfill

Detailed Analysis of Divergent Findings

Finding Claude Caught, Gemini Missed: NaN Propagation

The Bug: When a player has hits but NaN values in extra-base hit columns (2B, 3B, HR), the Total Bases calculation produces NaN, which propagates to RC_Score = 0.

Evidence:

Player Team Hits 2B 3B HR RC_Score
S. Vital Denver North 28.8 2.0 0.5 NaN 0.0 ❌
D. Sanudo Denver North 36.0 4.0 NaN NaN 0.0 ❌
R. Avalos Denver North 20.4 2.0 NaN 2.0 0.0 ❌
R. Martinez Denver North 8.4 1.0 NaN NaN 0.0 ❌

Why Gemini Missed It: Gemini’s review focused on methodology and formula correctness rather than tracing actual data through the pipeline. This bug only manifests when examining specific player records in the output.

Impact: Denver North’s team strength is artificially deflated because their best hitters contribute zero to offensive calculations.


Finding Gemini Caught, Claude Missed: IP Decimal Notation

The Bug: Standard baseball notation uses .1 = 1/3 inning and .2 = 2/3 inning. If the code treats 10.1 IP as the literal decimal 10.1 instead of 10.333..., all rate statistics (ERA, WHIP, K/9) will be systematically incorrect.

Evidence:

Player Z. Quimby: IP = 10.1, ER = 4, Reported ERA = 2.71

Scenario A (Literal Decimal): 7 × 4 / 10.1 = 2.77
Scenario B (Correct Conversion): 7 × 4 / 10.333 = 2.71 ✓

The source data appears correct (MaxPreps provides pre-calculated ERA).
But projected values show artifacts like IP: 3.02, which cannot exist in baseball.

Why Claude Missed It: Claude’s validation focused on RC formula correctness and pitching score correlation with ERA, but did not examine whether IP values were being converted properly before calculations. The correlation check (-0.50) passed because the direction was correct even if magnitudes had small errors.

Impact: Systematic ~2.3% error per third of an inning in all pitching rate statistics for projected players.


Finding Both Identified Differently: Survivor Bias

Claude’s Framing:

“The Freshman_Y1→Sophomore_Y2 transition shows a 2.25x multiplier for Hits. This likely reflects survivor bias: only the most elite freshmen earn varsity playing time in Year 1.”

Gemini’s Framing:

“High school attrition is non-random. Bad players quit; good players stay. By calculating multipliers only on the ‘Survivors’, the model calculates ‘Conditional Growth’ (Growth given you didn’t quit). When applied to a full current roster, it over-projects team aggregate strength.”

Assessment: Gemini’s analysis is more complete because it:

  1. Explains the mechanism (attrition is non-random)
  2. Identifies the downstream impact (team strength inflation)
  3. Proposes a concrete fix (Churn Rate penalty or Zero-Fill method)

Claude identified a symptom (high freshman multiplier); Gemini identified the systemic cause.


Quality of Recommendations

Claude’s Recommendations

Issue Recommendation Actionability
NaN Propagation Add fillna(0) before TB calculation ✅ Direct code fix
Profile Contamination Clear irrelevant stats by role ✅ Direct code fix
Rare Event Multipliers Bayesian shrinkage toward 1.0 ⚠️ Conceptual, no code
Survivor Bias Cap at 1.5x or regress toward class multiplier ⚠️ Partial solution

Gemini’s Recommendations

Issue Recommendation Actionability
IP Notation convert_ip_to_decimal() utility ✅ Provided complete code
Survivor Bias Churn Rate penalty or Zero-Fill method ⚠️ Conceptual, no code
Rare Events Laplacian Smoothing ✅ Provided code in math_fixes.py
Replacement Level Increase min PA to 25 ✅ Simple parameter change

Assessment: Gemini provided more complete code solutions via math_fixes.py. Claude provided more targeted inline fixes for the specific bugs found.


Validation Methodology Comparison

Claude’s Approach

Claude executed explicit validation checks and documented results:

1. ✅ RC formula produces expected values
2. ✅ Pitching-ERA correlation = -0.50
3. ✅ All multiplier transitions have N ≥ 40
4. ✅ Generic profiles have non-zero RC
5. ✅ Variance/Mean = 13.4 (validates NB distribution)
6. ✅ No zero/negative team strength indices
7. ✅ 100% of real players use Class-based projection
8. ✅ Elite teams get 50th percentile backfill
9. ✅ All teams meet roster minimums
10. ❌ NaN propagation in RC for 4 players

Gemini’s Approach

Gemini performed analytical reasoning about methodology but did not enumerate specific validation checks or trace data through the pipeline.

Assessment: Claude’s approach was more rigorous for finding data-level bugs; Gemini’s approach was more effective for finding systemic mathematical issues.


Consolidated Critical Issues List

Based on both reviews, the complete list of critical issues requiring fixes:

Priority 1: Must Fix Before Production

  1. NaN Propagation in RC Calculation (Claude)
    • Location: src/models/advanced_ranking.py:calculate_offensive_score()
    • Fix: Add fillna(0) for 2B, 3B, HR before TB calculation
  2. IP Decimal Notation Error (Gemini)
    • Location: All files performing IP-based calculations
    • Fix: Implement correct_innings_pitched() from math_fixes.py
  3. Generic Batter Profile Contamination (Claude)
    • Location: src/workflows/profile_generator.py
    • Fix: Clear pitching stats for Batter role, batting stats for Pitcher role

Priority 2: Should Fix

  1. Survivor Bias Adjustment (Both)
    • Apply 5-10% reduction to multipliers, or implement Zero-Fill method
  2. Rare Event Smoothing (Both)
    • Apply Laplacian smoothing or Bayesian shrinkage to 3B, HR_P, 2B_P, 3B_P multipliers
  3. IP Output Formatting (Gemini)
    • Round projected IP to valid baseball notation (.0, .1, .2 only)

Lessons Learned

For Future Adversarial Reviews

  1. Multiple reviewers catch more issues — Neither AI alone found all critical bugs
  2. Data tracing vs. methodology analysis — Both approaches are necessary
  3. Explicit validation checklists — Force examination of specific edge cases
  4. Code execution — Actually running validation scripts catches bugs that reasoning alone misses

Strengths by Reviewer

Strength Claude Gemini
Finding data-level bugs ✅ Strong ⚠️ Weak
Finding mathematical errors ⚠️ Weak ✅ Strong
Providing code fixes ⚠️ Inline only ✅ Complete module
Explicit validation ✅ Strong ⚠️ Weak
Systemic analysis ⚠️ Moderate ✅ Strong
Awareness of prior fixes ✅ Yes ❌ No

Conclusion

The cross-validation exercise demonstrates that adversarial reviews benefit significantly from multiple independent reviewers. Claude’s review excelled at finding concrete bugs affecting specific players but missed a systemic mathematical error. Gemini’s review excelled at identifying methodological flaws but missed data-level issues visible only through pipeline tracing.

Recommended Action: Implement fixes from both reviews:

  1. Merge math_fixes.py into the codebase (Gemini)
  2. Apply NaN and profile contamination fixes (Claude)
  3. Address survivor bias using Gemini’s more complete framework
  4. Run the consolidated validation checklist after fixes

The combined findings list represents a more complete picture than either review alone could provide.


Appendix: Files Referenced

File Source
adversarial_review_report.md Claude’s review
adversarial_review_report_gemini.md Gemini’s review
math_fixes.py Gemini’s code fixes
aggregated_stats.csv Input data
2026_roster_prediction.csv Output data
development_multipliers.csv Multiplier reference
generic_players.csv Backfill profiles
team_strength_rankings.csv Team rankings output

Copyright © 2025 Tim Coultas

This site uses Just the Docs, a documentation theme for Jekyll.