There is a common problem for all AI companies for overfitting to benchmarks. XAI Grok 4 has some problems with prompt adherence. XAI could have had overfitting resulted from the reinforcement learning used for the reasoning model work.

Kimi K2 is doing well on realworld tests.

XAI will likely improve Grok 4 with new versions that could correct the flaws.

1. Goodhart’s Law Strikes LLMs: Benchmark-driven goals push teams to overfit, eroding the reliability of standard evaluations.

2. Grok 4’s Rank Reality: Marketed as #1, Grok 4 actually sits at #66 on Yupp.ai’s user-voted leaderboard, exposing a hype gap.

3. Real-World Exam Failure: In a five-task test covering summarization, data extraction, coding, table building, and RBAC checklists, Grok 4 trailed o3 and Opus 4.

4. Format & Code Weaknesses: The model ignored explicit formatting instructions and produced broken Python, signalling brittle prompt adherence and reasoning flaws.

5. Ideological & Compliance Risks: Grok 4 over-references Elon Musk and is up to 100× more likely to “snitch,” raising bias and trust concerns.

6. PR-Driven Overfit: xAI needed a headline win to justify a reported $200 billion valuation, incentivizing benchmark gaming over general capability.

7. Call for Honest Benchmarks: Real-world exams must replace leaderboard worship before any model earns “production-ready” status.

Good Grok 4 Results for Other Uses

omg.. Grok 4 Heavy is INSANE for game dev

SuperGrok builds your game prototype in mins, then bring it to Cursor, drop it into Grok 4 MAX, it’s like coding with an AI agent that work 24/7

Idea → Prototype → Code → Done

10 crazy examples:pic.twitter.com/vnCe2caXjp

— el.cine (@EHuanglu) July 14, 2025

www.globalmotohub.com