Adversarial Review Comparison - Gemini vs. Claude

Date: December 16, 2025 Subject: Comparison of AI-generated adversarial reviews for the CO High School Baseball Projection System Agents: Gemini (Google) vs. Claude (Anthropic)

  1. Executive Summary

This document compares two independent adversarial reviews of the codebase. The reviews were highly complementary, with almost no overlap in their “Critical” findings.

Gemini focused on Mathematical Logic & Semantics, correctly identifying that the system was performing literal decimal math on baseball notations (the “IP Decimal Bug”).

Claude focused on Data Hygiene & Integrity, correctly identifying that “Generic” templates were polluted with cross-role statistics (Batters having ERAs).

Verdict: Neither review was sufficient on its own. Together, they cover both the hidden mathematical flaws (Gemini) and the visible data structure flaws (Claude).

  1. Side-by-Side Finding Matrix

Category

Gemini Finding

Claude Finding

Overlap?

Critical Defect (Math)

IP Decimal Error: System treats 10.1 IP as 10.1 (decimal) instead of 10.33. Results in ~2.3% error in all rate stats.

NaN Propagation: Identified 4 players with RC_Score = 0 due to NaN values in the Runs Created formula.

No

Critical Defect (Data)

N/A

Role Contamination: “Generic Batters” contain pitching stats (ERA, IP), which pollutes team aggregation logic.

No

Statistical Bias

Survivor Bias: Aging curves exclude dropouts, inflating projections.

Survivor Bias: (Same finding).

Yes

Volatility / Noise

Replacement Floor: Generic players (10th %ile) have samples so small (12 PA) they are statistically meaningless.

Rare Event Multipliers: Huge volatility in 3B and HR multipliers due to small denominators.

Partial

Validation

Checked logic flow.

Variance/Mean Ratio: Calculated 13.4 ratio to validate the Negative Binomial simulation assumption.

No

  1. Deep Dive: Unique Wins

Gemini’s Key Win: The “IP Decimal” Bug

The Issue: In baseball box scores, 10.1 means “10 innings and 1 out” (10.333). In Python float, 10.1 is just 10.1.

Evidence: Gemini spotted IP: 3.02 in the output CSV. In baseball notation, .02 is impossible (outs go .1, .2, 1.0). This proved the code was doing raw division on the box score string.

Impact: Massive. Every single ERA, WHIP, and K/9 calculation in the entire system was mathematically wrong.

Status: Fixed in the current Canvas (src/utils/math_fixes.py).

Claude’s Key Win: Role Contamination

The Issue: The generic_players.csv file defines “Generic Sophomore Batter”. However, Claude noticed these rows contained valid data for IP, ERA, and K_P.

Impact: If the Team Strength calculator sums the Pitching_Index of all players, it effectively treats every replacement-level bench bat as a relief pitcher, artificially inflating the pitching depth of bad teams.

Status: Unfixed. Requires filtering logic in profile_generator.py.

  1. Evaluation of Completeness

Gemini Review

Strengths: Deep code analysis regarding domain-specific logic (Baseball math). Excellent detection of “invisible” logic errors that don’t throw exceptions but produce wrong numbers.

Weaknesses: Missed the NaN data holes in the output CSV. Did not run high-level statistical validation (Variance/Mean) on the simulation results.

Claude Review

Strengths: Excellent data profiling. Found specific rows (NaNs) and specific column anomalies (Batters with ERAs). Good statistical validation of the simulation distribution.

Weaknesses: Missed the fundamental domain-math error (IP format). Assumed the numbers were calculated correctly and focused on their presence/absence.

  1. Consolidated Recommendations

To achieve a 1.0 release quality, the following roadmap is derived from combining both reports:

Apply Math Fixes (Gemini):

Deploy correct_innings_pitched (already in Canvas) to all ingestion pipelines.

Deploy smooth_multiplier (already in Canvas) to tame the volatility found by Claude.

Apply Data Sanitization (Claude):

Role Masking: In profile_generator.py, ensure that if Role == ‘Batter’, all Pitching columns are set to 0 or NaN.

NaN Safety: Wrap the Runs Created formula in a try/except or fill NaN with 0 before calculation to prevent the “Zero RC” bug.

Address Survivor Bias (Both):

The current multipliers are optimistic. A global “Churn Penalty” (e.g., reducing all improvement multipliers by 5-8%) is a quick heuristic fix recommended by both agents to offset the lack of “dropout” data.


Copyright © 2025 Tim Coultas

This site uses Just the Docs, a documentation theme for Jekyll.