Elon Musk continues his habit of audacious product launches. During a livestream on X he confirmed that xAI’s newest model Grok 4 will roll out to developers and X Premium Plus subscribers this week. Musk went further stating that by 2026 the system will create entirely new technology and perhaps uncover fresh laws of physics.
How Grok 4 performs today
Independent benchmark house Artificial Analysis was given early API access. Its headline Intelligence Index puts Grok 4 on 73 ahead of OpenAI o3 at 70 Google Gemini 2 point 5 Pro at 70 and Anthropic Claude 4 Opus at 64. Detailed results show
- All time record on GPQA Diamond with 88 percent
- Best ever score on Humanity’s Last Exam text only subset with 24 percent beating the previous 21 percent held by Gemini 2 point 5
- Joint highest MMLU Pro result at 87 percent alongside AIME 2024
The context window is 256 thousand tokens larger than GPT 4o’s 128 thousand but smaller than Gemini 2 point 5 Pro’s one million token capacity. Pricing matches Grok 3 at three dollars per million input tokens and fifteen dollars per million output tokens keeping parity with Claude 4 Sonnet while under cutting GPT 4o’s recent eight dollar rate.
Lofty promises about new physics
Musk has a record of scientific hyperbole. Last year he predicted superhuman AI by the end of 2025. Even so today’s leading models still hallucinate basic facts. The claim that Grok 4 will generate novel physical theory within twelve months is therefore viewed with scepticism by mainstream researchers.
A history of bias and toxic responses
Grok 3 was criticised for extremist and conspiratorial answers. In May it regurgitated Holocaust denial tropes before xAI blamed a rogue employee and patched the behaviour. Earlier tests logged anti immigrant stereotypes and racially loaded jokes. xAI says Grok 4 now employs a two tier moderation stack. Whether that will stand up to real world red teaming remains to be seen.
Strategic context
- Microsoft Azure already hosts Grok 3 and is expected to add Grok 4 this quarter.
- Google and Anthropic plan Gemini 3 and Claude 5 later this year both targeting higher reasoning scores and larger context windows.
- OpenAI o3 could receive a mid cycle upgrade.
If Grok 4 sustains benchmark leadership and avoids new bias scandals xAI may secure serious enterprise revenue through its paid API. Failure would reinforce perceptions that Musk’s models trade robustness for rhetorical flair.
Where will this take Chipper
Grok 4’s scores signal progress in multi step reasoning but the leap from benchmarks to world changing inventions is vast. For Musk’s prophecy to hold the model will need reliable safeguards transparent reasoning tools and collaboration with domain experts. Equally critical will be whether xAI can purge the racist residues seen in earlier versions. The next twelve months will reveal whether Grok 4 becomes a landmark in AI progress or simply another headline in the long list of Muskian exaggerations.









