Musk’s Grok-4 Crushes Benchmarks, Beats OpenAI & Google in RL

The much-awaited Grok-4 is finally here. Despite a delayed livestream, the model didn’t disappoint. xAI chief Elon Musk even declared that Grok-4 is “PhD-level in everything”, adding that the model performs at a postgraduate level across multiple disciplines, including even unseen questions.

“If [it’s] given the SAT, it would get perfect [scores] every time,” Musk said. He further revealed that on graduate-level tests like the GRE, Grok-4 reportedly scored near-perfect in every discipline of education, from humanities to languages, math, physics, and even engineering.

Musk compared the model’s reasoning to that of humans. He added that Grok-4 was able to solve problems which it had not seen before. “These are not on the internet…Grok-4 is smarter than almost all graduate students in all disciplines simultaneously.”

“I would expect Grok to discover new technologies that are used, maybe by the end of this year. It might discover new physics next year,” he declared.

Crushing All Benchmarks

xAI has launched two versions of its latest model, namely Grok 4 and Grok 4 Heavy. The team described Grok 4 as a single-agent version, while Grok 4 Heavy is the multi-agent version. Both are available immediately and come bundled with access to SuperGrok tiers, where users can direct a network of Grok agents to assist with research and productivity.

The SuperGrok tier comes with a new $300-per-month AI subscription plan. Moreover, Grok 4 will be deployed across hyperscalers and in xAI’s enterprise offerings.

Talking about the Grok 4 Heavy version, Musk said, “It spawns multiple agents in parallel…They compare notes and yield an answer.”

This “study group” style approach allows the model to solve more problems at test time, especially on complex benchmarks. “This is what we call test-time compute. We scale it up roughly by an order of magnitude,” Musk explained.

Grok 4 is now accessible via API, and, according to xAI, the model leads key reasoning benchmarks. On the ARC-AGI-2 evaluation set, a benchmark designed to measure advanced reasoning, the model achieved 15.9% accuracy, reportedly doubling the performance of the next-best model, Claude Opus. “It was the only model in the last three months that broke the 10% barrier,” the xAI team noted.

Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%

This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA pic.twitter.com/YbCMLXPJ2e
— ARC Prize (@arcprize) July 10, 2025

Beyond accuracy, xAI argued that Grok 4 delivers “intelligence per dollar” that puts it in a “league of its own”.

During the live stream, the xAI team also demonstrated the model’s progress using a challenging benchmark called Humanity’s Last Exam (HLE), which consists of 2,500 expert-curated questions across subjects.

On this benchmark, most models previously achieved single-digit accuracy. Grok-4, however, solved a quarter of the HLE problems without using tools. With tool capabilities, the multi-agent version, Grok-4 Heavy, was able to solve over 50% of the text-only subset of the HLE problems.

Moreover, according to Artificial Analysis, Grok-4 is now the leading AI model. “We have run our full suite of benchmarks, and Grok-4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64, and DeepSeek R1 0528 at 68,” the company revealed.

Training and Computational Scale

xAI’s team revealed that the development of Grok-4 involved a significant increase in training compute.

“From Grok-2 to Grok-3 to Grok-4, we’ve increased the training by an order of magnitude in each case. It’s 100 times more training than Grok-2,” Yuhuai (Tony) Wu, co-founder of xAI, said.

Wu described two types of training compute—pretraining and reinforcement learning (RL). “We’re actually putting a lot of compute in reasoning, in RL,” Wu said. “With verifiable outcome rewards, you can train these models to think from first principles.”

He also pointed to a key bottleneck going forward: the availability of challenging problems for RL. “As the model gets smarter, the number of useful RL problems reduces. We need a reliable signal to tell the model when it’s right or wrong,” Wu added.

For Grok-4, all two lakh GPUs of xAI’s supercomputer Colossus were utilised for RL training, providing 10 times more compute than any other model in reinforcement learning at an unprecedented scale.

“With Grok-4, RL is the new pre-training,” Jiayi Pan, xAI team member, wrote in a post on X.

Grok4 same compute amount on pre training as reinforcement learning

RL is all the rage it’s just wild to see the shift pic.twitter.com/QOp1OAipys
— Tommy (@Shaughnessy119) July 10, 2025

From Simulations to Reality

Musk predicted future scenarios in which Grok-4 and its successors would operate not only in text but also in the physical world. “The ultimate reasoning test is reality,” he said. “You invent a new technology, does it work? Does the rocket get to orbit? Does the car drive? Does the medicine work?”

Moreover, Musk discussed integration with robotics. “Combine Grok with Optimus, and it can interact with the world. Formulate hypotheses, test them, confirm or reject.”

Grok 4’s practical utility was demonstrated across domains. In collaboration with Andon Labs, the model was tested on Vending-Bench, a simulation involving inventory management, pricing, and supplier contracts. Grok 4 became the new leader on the leaderboard, achieving double the net worth of other models.

In biomedical research, the Arc Institute used Grok 4 to sift through experiment logs and propose hypotheses within seconds. The model is already being evaluated for CRISPR (clustered regularly interspaced short palindromic repeats) research and has been found useful in examining medical imaging, like chest X-rays.

In finance, Grok 4 was described as capable of pulling real-time data and supporting decision-making.

Voice Mode Introduced with New Natural Voices

Grok 4 also features updates to voice capabilities, offering reduced latency and a new set of voices. One standout is Sal, which comes with a deep voice reminiscent of a movie trailer narrator. Another is Eve, which has been described as a “British voice capable of rich emotions”.

During the livestream, Eve performed a live demo with operatic poetry on Diet Coke and a call-and-response number repetition game. Compared to OpenAI’s voice mode, Grok Voice was “snappier” and avoided interruption. According to xAI, five voice options are now available, and Grok Voice has seen a 10 times increase in active users in eight weeks.

Upcoming Improvements and Models

xAI showcased how Grok 4 can help solo developers build games. Danny, a game designer, used the API to create a first-person shooter in four hours. Grok 4 sourced assets, generated textures, and assisted with design, removing the need for external sourcing.

Musk said the goal is for Grok to eventually “play the game” and assess whether it’s fun, a task that requires video understanding and tool integration with platforms like Unreal Engine.

The team acknowledged that the current multimodal performance has limitations. “It was so bad that Grok was effectively looking at the world squinting through glass.” Improvements in image, video, and audio understanding are scheduled for the next foundation model, which is expected to finish training this month.

The next steps include video generation, with xAI preparing to train a video model using more than one lakh GPUs. The purpose is to create infinite scroll content that users can watch and edit, with full interactivity.

Moreover, Musk shared that xAI is developing a coding model focused on being “both fast and smart”, which is expected to be released in a few weeks.

The post Musk’s Grok-4 Crushes Benchmarks, Beats OpenAI & Google in RL appeared first on Analytics India Magazine.