Adversarial Review Comparison - Claude vs Gemini

Date: December 16, 2025
Subject: Colorado High School Baseball Projection System
Purpose: Cross-validation of independent AI adversarial reviews

Executive Summary

Two independent AI agents (Claude and Gemini) conducted adversarial reviews of the same codebase using the same methodology document. This comparison reveals that both reviews identified valid issues, but each missed critical findings the other caught. The optimal outcome is a merged findings list incorporating discoveries from both reviews.

Metric	Claude	Gemini
Critical Issues Found	2	2
Statistical Concerns	3	2
Code Quality Issues	3	2
Explicit Validation Checks	10	0
Provided Code Fixes	Inline snippets	Standalone module

Key Insight: Neither review alone was complete. Claude found concrete bugs affecting specific players in the output; Gemini found a systemic mathematical error in IP handling that Claude missed entirely.

Side-by-Side Findings Comparison

Critical Issues

Issue	Claude	Gemini	Verified?
NaN Propagation in RC Calculation	✅ Found	❌ Missed	Yes — 4 Denver North players have RC_Score=0 despite 8-36 hits
Generic Batter Profile Contamination	✅ Found	❌ Missed	Yes — 72 players have negative Pitching_Score including generic batters
IP Decimal Notation Error	❌ Missed	✅ Found	Yes — Output shows values like `IP: 3.02` which is invalid baseball notation
Survivor Bias in Aging Curves	⚠️ Partial	✅ Complete	Yes — Both identified, Gemini’s analysis more thorough

Statistical Concerns

Concern	Claude	Gemini	Assessment
Rare Event Multiplier Noise (3B, HR_P)	✅ Found	✅ Found	Both identified; Gemini proposed Laplacian smoothing, Claude proposed Bayesian shrinkage
Freshman→Sophomore 2.25x Multiplier	✅ Found	⚠️ Covered under survivor bias	Same root cause, different framing
50/50 Offense/Pitching Weighting	❌ Missed	✅ Found	Valid methodological question
Replacement Level Floor Instability	✅ Noted prior fix	⚠️ Outdated concern	Code already uses 30th percentile minimum

Code Quality Issues

Issue	Claude	Gemini	Assessment
Documentation mismatch (Top 9 vs All)	✅ Found	❌ Missed	README says “Top 9 batters” but code sums all qualified
Unused volatility metrics	✅ Found	❌ Missed	Opportunity for confidence intervals
IP output formatting artifacts	❌ Missed	✅ Found	Values like 3.02 should be 3.0, 3.1, or 3.2
High generic roster dependency	✅ Found	❌ Missed	31.5% of roster is synthetic backfill

Detailed Analysis of Divergent Findings

Finding Claude Caught, Gemini Missed: NaN Propagation

The Bug: When a player has hits but NaN values in extra-base hit columns (2B, 3B, HR), the Total Bases calculation produces NaN, which propagates to RC_Score = 0.

Evidence:

Player	Team	Hits	2B	3B	HR
S. Vital	Denver North	28.8	2.0	0.5	NaN
D. Sanudo	Denver North	36.0	4.0	NaN	NaN
R. Avalos	Denver North	20.4	2.0	NaN	2.0
R. Martinez	Denver North	8.4	1.0	NaN	NaN

Why Gemini Missed It: Gemini’s review focused on methodology and formula correctness rather than tracing actual data through the pipeline. This bug only manifests when examining specific player records in the output.

Impact: Denver North’s team strength is artificially deflated because their best hitters contribute zero to offensive calculations.

Finding Gemini Caught, Claude Missed: IP Decimal Notation

The Bug: Standard baseball notation uses .1 = 1/3 inning and .2 = 2/3 inning. If the code treats 10.1 IP as the literal decimal 10.1 instead of 10.333..., all rate statistics (ERA, WHIP, K/9) will be systematically incorrect.

Evidence:

Player Z. Quimby: IP = 10.1, ER = 4, Reported ERA = 2.71

Scenario A (Literal Decimal): 7 × 4 / 10.1 = 2.77
Scenario B (Correct Conversion): 7 × 4 / 10.333 = 2.71 ✓

The source data appears correct (MaxPreps provides pre-calculated ERA).
But projected values show artifacts like IP: 3.02, which cannot exist in baseball.

Why Claude Missed It: Claude’s validation focused on RC formula correctness and pitching score correlation with ERA, but did not examine whether IP values were being converted properly before calculations. The correlation check (-0.50) passed because the direction was correct even if magnitudes had small errors.

Impact: Systematic ~2.3% error per third of an inning in all pitching rate statistics for projected players.

Finding Both Identified Differently: Survivor Bias

Claude’s Framing:

“The Freshman_Y1→Sophomore_Y2 transition shows a 2.25x multiplier for Hits. This likely reflects survivor bias: only the most elite freshmen earn varsity playing time in Year 1.”

Gemini’s Framing:

“High school attrition is non-random. Bad players quit; good players stay. By calculating multipliers only on the ‘Survivors’, the model calculates ‘Conditional Growth’ (Growth given you didn’t quit). When applied to a full current roster, it over-projects team aggregate strength.”

Assessment: Gemini’s analysis is more complete because it:

Explains the mechanism (attrition is non-random)
Identifies the downstream impact (team strength inflation)
Proposes a concrete fix (Churn Rate penalty or Zero-Fill method)

Claude identified a symptom (high freshman multiplier); Gemini identified the systemic cause.

Quality of Recommendations

Claude’s Recommendations

Issue	Recommendation	Actionability
NaN Propagation	Add `fillna(0)` before TB calculation	✅ Direct code fix
Profile Contamination	Clear irrelevant stats by role	✅ Direct code fix
Rare Event Multipliers	Bayesian shrinkage toward 1.0	⚠️ Conceptual, no code
Survivor Bias	Cap at 1.5x or regress toward class multiplier	⚠️ Partial solution

Gemini’s Recommendations

Issue	Recommendation	Actionability
IP Notation	`convert_ip_to_decimal()` utility	✅ Provided complete code
Survivor Bias	Churn Rate penalty or Zero-Fill method	⚠️ Conceptual, no code
Rare Events	Laplacian Smoothing	✅ Provided code in `math_fixes.py`
Replacement Level	Increase min PA to 25	✅ Simple parameter change

Assessment: Gemini provided more complete code solutions via math_fixes.py. Claude provided more targeted inline fixes for the specific bugs found.

Validation Methodology Comparison

Claude’s Approach

Claude executed explicit validation checks and documented results:

✅ RC formula produces expected values
✅ Pitching-ERA correlation = -0.50
✅ All multiplier transitions have N ≥ 40
✅ Generic profiles have non-zero RC
✅ Variance/Mean = 13.4 (validates NB distribution)
✅ No zero/negative team strength indices
✅ 100% of real players use Class-based projection
✅ Elite teams get 50th percentile backfill
✅ All teams meet roster minimums
❌ NaN propagation in RC for 4 players

Gemini’s Approach

Gemini performed analytical reasoning about methodology but did not enumerate specific validation checks or trace data through the pipeline.

Assessment: Claude’s approach was more rigorous for finding data-level bugs; Gemini’s approach was more effective for finding systemic mathematical issues.

Consolidated Critical Issues List

Based on both reviews, the complete list of critical issues requiring fixes:

Priority 1: Must Fix Before Production

NaN Propagation in RC Calculation (Claude)
- Location: src/models/advanced_ranking.py:calculate_offensive_score()
- Fix: Add fillna(0) for 2B, 3B, HR before TB calculation
IP Decimal Notation Error (Gemini)
- Location: All files performing IP-based calculations
- Fix: Implement correct_innings_pitched() from math_fixes.py
Generic Batter Profile Contamination (Claude)
- Location: src/workflows/profile_generator.py
- Fix: Clear pitching stats for Batter role, batting stats for Pitcher role

Priority 2: Should Fix

Survivor Bias Adjustment (Both)
- Apply 5-10% reduction to multipliers, or implement Zero-Fill method
Rare Event Smoothing (Both)
- Apply Laplacian smoothing or Bayesian shrinkage to 3B, HR_P, 2B_P, 3B_P multipliers
IP Output Formatting (Gemini)
- Round projected IP to valid baseball notation (.0, .1, .2 only)

Lessons Learned

For Future Adversarial Reviews

Multiple reviewers catch more issues — Neither AI alone found all critical bugs
Data tracing vs. methodology analysis — Both approaches are necessary
Explicit validation checklists — Force examination of specific edge cases
Code execution — Actually running validation scripts catches bugs that reasoning alone misses

Strengths by Reviewer

Strength	Claude	Gemini
Finding data-level bugs	✅ Strong	⚠️ Weak
Finding mathematical errors	⚠️ Weak	✅ Strong
Providing code fixes	⚠️ Inline only	✅ Complete module
Explicit validation	✅ Strong	⚠️ Weak
Systemic analysis	⚠️ Moderate	✅ Strong
Awareness of prior fixes	✅ Yes	❌ No

Conclusion

The cross-validation exercise demonstrates that adversarial reviews benefit significantly from multiple independent reviewers. Claude’s review excelled at finding concrete bugs affecting specific players but missed a systemic mathematical error. Gemini’s review excelled at identifying methodological flaws but missed data-level issues visible only through pipeline tracing.

Recommended Action: Implement fixes from both reviews:

Merge math_fixes.py into the codebase (Gemini)
Apply NaN and profile contamination fixes (Claude)
Address survivor bias using Gemini’s more complete framework
Run the consolidated validation checklist after fixes

The combined findings list represents a more complete picture than either review alone could provide.

Appendix: Files Referenced

File	Source
`adversarial_review_report.md`	Claude’s review
`adversarial_review_report_gemini.md`	Gemini’s review
`math_fixes.py`	Gemini’s code fixes
`aggregated_stats.csv`	Input data
`2026_roster_prediction.csv`	Output data
`development_multipliers.csv`	Multiplier reference
`generic_players.csv`	Backfill profiles
`team_strength_rankings.csv`	Team rankings output