There is a common problem for all AI companies for overfitting to benchmarks. XAI Grok 4 has some problems with prompt adherence. XAI could have had overfitting resulted from the reinforcement learning used for the reasoning model work.
Kimi K2 is doing well on realworld tests.
XAI will likely improve Grok 4 with new versions that could correct the flaws.
1. Goodhart’s Law Strikes LLMs: Benchmark-driven goals push teams to overfit, eroding the reliability of standard evaluations.
2. Grok 4’s Rank Reality: Marketed as #1, Grok 4 actually sits at #66 on Yupp.ai’s user-voted leaderboard, exposing a hype gap.
3. Real-World Exam Failure: In a five-task test covering summarization, data extraction, coding, table building, and RBAC checklists, Grok 4 trailed o3 and Opus 4.
4. Format & Code Weaknesses: The model ignored explicit formatting instructions and produced broken Python, signalling brittle prompt adherence and reasoning flaws.
5. Ideological & Compliance Risks: Grok 4 over-references Elon Musk and is up to 100× more likely to “snitch,” raising bias and trust concerns.
6. PR-Driven Overfit: xAI needed a headline win to justify a reported $200 billion valuation, incentivizing benchmark gaming over general capability.
7. Call for Honest Benchmarks: Real-world exams must replace leaderboard worship before any model earns “production-ready” status.
Good Grok 4 Results for Other Uses
omg.. Grok 4 Heavy is INSANE for game dev
SuperGrok builds your game prototype in mins, then bring it to Cursor, drop it into Grok 4 MAX, it’s like coding with an AI agent that work 24/7
Idea → Prototype → Code → Done
10 crazy examples:pic.twitter.com/vnCe2caXjp
— el.cine (@EHuanglu) July 14, 2025
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.