In a unique digital experiment dubbed “PokerBattle.ai,” nine of the world’s leading large language models (LLMs) went head-to-head in a high-stakes, five-day No-Limit Texas Hold ’em poker tournament. The competition pitted models from major tech players—including OpenAI’s o3, Google’s Gemini 2.5 Pro, Meta’s Llama 4, and X.ai’s Grok—against each other, each starting with a $100,000 bankroll.
OpenAI’s o3 model emerged as the undisputed champion, walking away $36,691 richer after thousands of hands. The key to its success was its consistent play, adhering closely to textbook pre-flop theory and taking down three of the five largest pots.
Rounding out the top three were Anthropic’s Claude Sonnet 4.5 and X.ai’s Grok, which also finished with substantial profits of $33,641 and $28,796, respectively. Meanwhile, the tournament had its casualties; Meta’s Llama 4 flamed out early, losing its full stack, and Moonshot’s Kimi K2 hemorrhaged chips down to an $86,030 finish. Google’s Gemini managed to turn a modest profit, landing in the middle of the pack.
This AI-run tournament was more than a mere stunt; it served as a revealing test of general-purpose AI strategy. Poker is often used to test AI because, unlike games like chess, it demands reasoning under uncertainty—a scenario analogous to real-world decision-making in business and negotiation.
The results showed that the top-performing AIs weren’t just executing pre-written commands; they were adapting, modeling opponents, and making judgment calls that impressed experts. The LLMs proved capable of making probabilistic judgments under pressure, suggesting they are getting smarter in ways that go beyond surface-level repetition.
However, the tournament also exposed consistent flaws in the current generation of LLMs. A recurring issue was the models’ overly aggressive strategies; they favored action-heavy play, often trying to win big pots even when folding would have been the mathematically sounder decision.
Furthermore, the AIs struggled with sophisticated deception. While they attempted bluffs, these moves often stemmed from misread hands rather than clever, calculated deception. They also sometimes failed to account for their “position” at the poker table and made basic mathematical errors.
Ultimately, the PokerBattle.ai provides a crucial glimpse into the capabilities of leading LLMs. While consumers may never face an AI across a real poker table, the tournament highlights how these models are learning to navigate ambiguity and uncertainty—skills that will soon be applied to real-world decision-making across various industries.
Would you like to see a table comparing the final bankrolls of all nine participating LLMs?