ChatGPT is shockingly bad at poker

I’m impressed by large language models. So why can't they get the basics of poker right?

May 21, 2025

ChatGPT’s response to request to draw a poker board of Q♠ Q♦ 4♣ 4♥ 5♠. I don’t love using AI-generated illustrations in the newsletter, but this one is too emblematic of ChatGPT’s flawed approach: somehow the 4 and 5 have been fused into a single card.

Generally speaking, I think large language models (LLMs) like ChatGPT and Claude are closer to underrated than overrated. Their adoption curve has been incredibly fast: by some measures, they’re the most rapidly growing technology in human history. It’s easy to cherry-pick examples of AI “hallucinations”; in my experience, these are becoming considerably less common, and I sometimes wonder whether people who write skeptically of LLMs are even using them at all.

I use these models almost every day, for tasks ranging from the whimsical (planning a trip to Norway1) to the mundane (checking Silver Bulletin articles for typos: admittedly, a few still slip through). And I’ve seen them improve to the point that I increasingly trust them to improve my productivity. When I was preparing this year’s edition of our NCAA tournament forecasts, they were valuable in reconciling various data sets2, effectively performing some annoying lookup tasks3, and generating short snippets of code that often worked seamlessly. That’s a long way from building a good model from scratch. But they often save a lot of tedious work, freeing bandwidth up for more creative and higher value-add tasks. Perhaps 80 percent of the time, their performance is somewhere between marginally helpful and incredibly impressive.

But that leaves the 20 percent of the time when they aren’t up to the task. I think LLMs are a substantial positive to my workflow, but that’s because I’m nearly always using them for tasks I could do myself, so that I can vet their answers and know when they’re wrong.4 For instance, when it comes to using LLMs for another subject I know well — poker — they’ve been exceptionally underwhelming.

I want to be careful about this. Like with the old Louis CK routine about people complaining about patchy Wi-Fi on planes, some of this undoubtedly reflects rising expectations for what is objectively an incredibly impressive technology. Still, an increasing number of people whom I respect think artificial general intelligence (AGI) — usually defined as being able to match or exceed human levels at nearly all (cognitive5) tasks — might be as little as a year or two away.

But if there are examples where LLMs already seem to have superhuman capabilities, they’re very far from it in poker. And I’d argue that poker is a better test of general intelligence than some of the more discrete tasks that ChatGPT performs so well.

Computers can play poker well, but ChatGPT can’t

Before we get into the weeds, let me shore up one point of potential confusion. To a strong first approximation, computers are already better than humans at poker. Let me explain what I mean by that, and why it doesn’t contradict what I just wrote above.

Poker strategy has been revolutionized by the use of solvers, algorithms that provide a close approximation of game-theory-optimal (GTO) strategy, more technically known as a Nash equilibrium. Personally, I don’t consider solvers themselves to be examples of “artificial intelligence” — they don’t use machine learning, and instead are basically solving complex equations in a prescribed way that converges on a deterministic solution. But this is a slightly fussy distinction. Solvers are highly impressive, and they’ve contributed to more sophisticated poker play. (A skilled amateur from today who has picked up some solver basics through osmosis might well be better than an elite player from 20 years ago.) Basically, you can now look up the “right” answer for how to play your hand in any given poker situation, provided that — and this is an important proviso6 — you’ve given it the right inputs. In addition, there are various computer programs that use more traditional AI/machine learning techniques, either to enhance solver outputs7 or to derive a poker strategy from scratch.

Those are computer tools specifically designed to play poker, however. What would be more impressive — more of an indication of general intelligence — is if large language models like ChatGPT that weren’t designed for poker could converge on a good strategy just by crunching textual data and learning to reason from it.8 But for now, they aren’t close to doing that well.

I asked ChatGPT to simulate a poker hand. It got pretty much everything wrong.

I gave ChatGPT’s o3 model — which I generally find reliable for a variety of tasks — the following prompt:

Simulate a Texas no limit hold 'em hand involving eight players with stacks of $5000 each in a $5/$10 game with a $25 straddle. Narrate the hand, including names/backstories for each player, but please draw the cards randomly and honestly rather than preordain the outcome.

Unavoidably, the rest of this post will require some forays into poker strategy and terminology. I’ll try to smooth things over as much as possible by consigning some of the details to footnotes. (There are going to be a lot of footnotes.) But this is a nerdy post — quite honestly, among the nerdiest Silver Bulletins ever.

This prompt describes a fairly typical setup for a no-limit hold ‘em cash game.9 Traditionally, hold ‘em hands start with two forced bets called blinds — here, the small blind is $5 and the big blind is $10 — that rotate around the table. (Without blinds or antes, there’s no pot to compete for, so poker is a broken game.10) In this example, as is common in live cash poker these days, the players have agreed to add a third blind bet called a straddle for $25 to induce additional action.

Let’s start by introducing the cast of characters that ChatGPT came up with:

I’d love to play in this game. Every player seems to be some species of fish who will either be predictably too tight or too loose. (Also, some of the ethnic stereotypes are a bit on the nose, like DJ “J-Chill” Jackson Lee — although I did once play against the famed DJ Steve Aoki late one night at the Bellagio after his set was done.)

So far, this isn’t so bad. ChatGPT labeled and ordered the positions correctly, for instance. No cards are duplicated, and the distribution of cards looks random enough — it’s not dealing everyone pocket aces and kings. The stack sizes are weird, though. Instead of everyone starting with $5,000 like I requested, players have mysteriously lost some chips off their stacks. This was a consistent problem in using ChatGPT in my testing — it couldn’t reliably keep track of stacks11 — in some cases prematurely subtracting chips based on what would happen later on the hand.

So let’s go through this hand in the style of Sam Greenwood’s Punt of the Day and see what happens once the cards are dealt:

This starts out reasonably enough. Vic opens the action by raising with a suited ace — A♠️6♠️ — which is a borderline by GTO standards but totally fine. Kat re-raises with pocket 9’s (9♠️9♥️). This is definitely a hand you’ll want to play; calling rather than raising would be acceptable too, but you can’t fold. Then Rocket Rob re-re-raises (4-bets) with ace-king (A♥️K♣️). Technically, this is a close decision with two other players already having shown such strength.12 But the overwhelming majority of players in a real poker game are going to raise, just like Rob did.

Then things go off the rails. Facing three other players who have already shown aggression, Prof. Grace Lin calls with a meager hand, a low offsuit ace (A♦️2 ♣️) that has absolutely nothing going for it. ChatGPT says she’s “liking the direct odds,” but that doesn’t make any sense: her odds are terrible. She’s only put $25 in the pot and now has to call $625 more.

In fact, Grace is in just about the worst situation you can encounter in poker. She figures to have the fourth-best hand in a field of four opponents, and now she’s playing a huge pot where she’s putting her entire $5,000 at risk but will rarely have the goods to back it up.13 A solver says you’re basically only supposed to play pocket aces (e.g. A♦️A♣️) in her shoes, plus a smattering of bluffs14 — and all of these hands would be played for another raise rather than a call. Real-life players will almost certainly be looser than the computer, continuing with two kings, pocket queens, and probably pocket jacks and AK. Certainly, you’ll also encounter some players who find a hand like ace-queen suited (e.g. A♦️Q♦️) or pocket tens too pretty to fold. But even a 99.9th percentile fish isn’t playing a hand like A♦️2 ♣️ facing such bad odds and so much aggression. And although Grace is an inexperienced player according to ChatGPT’s backstory, she’s a “math professor taking a shot”, which implies that she should know the odds and is probably playing tight.15 The story doesn’t add up at all.

Vic, wisely, finds the obvious fold with his suited ace, even though it’s a better hand than Grace’s offsuit ace. Kat could probably fold too with her pocket 9’s, but continuing isn’t too bad.16 ChatGPT is also counting the pot slightly wrong17, but that’s the least of my concerns for now.

Onto the first three cards, a.k.a. the flop:

Every player just checks, and honestly, this is fine. Rob’s ace-king doesn’t like seeing two queens on the board, and although in theory you might bet his hand some of the time18, I wouldn’t bet myself, because I’d assume Grace had a great hand after calling against three other opponents.

The next card, the turn, puts a second 4 on the board:

This is what poker players call a “blank”: it’s unlikely that any player reached this point in the hand with a 4, a weak card. So everybody checks again. But this is probably a mistake, at least for Kat.19 Facing so many checks so far, her pocket 9’s have increased a lot in relative value and are sometimes the best hand20; she also might be able to get a slightly stronger hand like TT or JJ to fold.21

The final card, the river, is another innocuous one, the 5♠️, that shouldn’t connect with any player’s hand:

Now Grace bets out, but it’s a small bet — $500 into the $2070 pot. Although you will sometimes encounter players in the wild who make small bets like these when they get confused and don’t know what else to do, it’s a terrible play. If Grace is bluffing — and at this point, she is bluffing22 — the bet needs to be larger to put more pressure on her opponents.

ChatGPT’s explanation might sound smart if you don’t know poker, but it’s basically just word salad. Grace isn’t “playing her position” — she’s actually in the worst position. She does have some good reasons to bet, but either the bet needs to be larger, or her hand needs to be stronger.23 Kat’s call with pocket 9’s is correct, but given how weirdly Grace has played the hand and the odds Kat is getting, it’s an easy call, not a “sigh-call”.24 Finally, Rob’s call at the end is terrible. At best, he’s splitting25 the pot two or three ways against other ace-highs. But far more often, he’s losing to at least one of his opponents (as he is to Kat in this case). Raising as a bluff could be a sexy play26 — but if Rob isn’t raising, he should fold. So once again, ChatGPT has him picking the very worst option. And the jargon it uses — “recognizing that raising serves no purpose” — isn’t on-point either. Rob might well be able to bluff out better hands with a raise, which is one of the primary purposes of raising.

So what explains all these dubious decisions? Hold that thought, because we haven’t even gotten to the worst part of ChatGPT’s response. It turns out it didn’t even know how to calculate who won the hand:

In hold ‘em, only the five best cards count toward making the best hand; they can be any combination of the player’s two hole cards and the five shared community cards (the flop/turn/river). So in this case:

Grace and Rob both have two pair, Q’s and 4’s, with an ace kicker. That is, they now have the same holding.27
Kat, however, has a higher two-pair — Q’s and 9’s, namely the pair of queens from the board and the pair of nines from her hand. The pair of fours is redundant for her hand28.

So Kat just won a huge pot. But instead, ChatGPT splits the chips between Grace and Rob. Once in a blue moon, you’ll see dealers make mistakes this bad, but it isn’t common. And if ChatGPT didn’t even know to read a poker board amidst all this excitement, one wonders how that affected its strategic choices throughout the rest of the hand.

Can ChatGPT make any more errors? Uhhh…. yes, actually:

ChatGPT says Kat lost $650 in the hand. But if you ignore that she should have been awarded the pot, she actually lost $1,150 in the hand, putting in $650 before the flop and then another $500 on the river. So to summarize, ChatGPT: 1) made inexplicably poor strategic decisions throughout the hand, one of which (Grace’s original call) was among the most -EV plays you’ll ever encounter in poker; 2) awarded the pot to the wrong player(s); 3) repeatedly failed to calculate the pot and the stacks correctly. I wasn’t expecting it to get everything right, and this is a difficult prompt, but it went wrong at almost every turn.

What does this tell us about the current state of LLMs?

I might be more sympathetic — but this was literally the first hand I asked ChatGPT to simulate for me. Still, to give it a second chance, I used Deep Research to play out a complete orbit of eight hands. Deep Research is expensive — a ChatGPT Pro subscription costs $20 bucks a month — but it takes more time with its answers and brings more compute to the problem.

To Deep Research’s credit, it seemed to understand this was a difficult task: “That sounds like a rich simulation,” it told me. And it did better than in the original example — it could hardly have been worse. But it still made a lot of egregious errors.

Chatgpt Deep Research Texas Hold 'em Simulation

332KB ∙ PDF file

Download

You can find its response here or in the PDF above. Deep Research repeatedly failed to calculate stack sizes correctly. It also hallucinated (in hand #8) the presence of a flush draw in spades, which significantly altered the strategy. And it employed many poker terms in a garbled and incorrect way.

And although it’s a small sample size, it also showed signs of an imbalanced strategy, almost always bluffing for huge amounts when flush draws and straight draws missed — this would be easy to exploit29 — while being reluctant to bet for value.30 If employed in a poker room, this approach would lose money at an incredibly high rate.

To be fair, poker is a difficult game. Intermediate players rarely derive the right play from first principles. Instead, they evaluate them based on competing, sometimes crude heuristics.31

If you’re being attentive, you can pick up a lot of information from physical tells or priors based on a player’s backstory; if you’re not, you’ll often give away more information than you receive. A great player, at least in higher-leverage spots, needs to both focus intently on each decision and account for this background context. Often, there’s a trade-off.32

Likewise, LLMs seem to get stressed out when you ask them to do too much at once. For instance, if you just ask o3 to evaluate whether 99 or A2 wins on a QQ445 board as the initial prompt, without the context of simulating an entire poker hand, it gets the answer right. It’s when you layer tasks on top of one another that they more often hallucinate or glitch out.

In part, this reflects how some of the techniques LLMs use to respond to more complex prompts with larger context windows are relatively new. Benjamin Todd has an excellent overview of this in his essay “The case for AGI by 2030”. The idea of building layers of “scaffolding” so that LLMs can break down a complex problem into discrete steps — for instance, in the context of poker, counting the pot correctly, reading the board correctly, and then making reasonable strategic decisions at various points in the hand — may work well enough if there are two or three steps in the chain. But the more steps you add, the more likely you are to get an error that renders the output useless or worse. Here’s Todd:

Any of these measures could significantly increase reliability, and as we’ve seen several times in this article, reliability improvements can suddenly unlock new capabilities:
Even a simple task like finding and booking a hotel that meets your preferences requires tens of steps. With a 90% chance of completing each step correctly, there’s only a 10% chance of completing 20 steps correctly.
However with 99% reliability per step, the overall chance of success leaps from 10% to 80% — the difference between not useful to very useful.

Like Todd, I expect LLMs to continue progressing to the point where we probably get something that most people would agree is AGI. However, I also suspect the current state of the models when employing this scaffolding process is worse than he assumes. I’ve mostly had a great experience with ChatGPT when I’m able to coax it along from step to step. And I’ve increasingly seen it be able to handle prompts well when I ask it to do two or three things together, e.g. “figure out the right mathematical function for this thing I’ve stated in natural language and then write some Stata code for it”. But add much more complexity than that, and I’ve found performance often drops off a cliff and that I’m wasting my time and my Deep Research tokens.

Why poker is a good test of AGI

Could poker be a good way to diagnose when and if these models do achieve AGI? Maybe. One reason I like using poker to test LLMs is that I suspect the major AI labs aren’t putting a lot of attention on poker specifically.

In contrast, for higher-prestige questions like solving the Math Olympiad, there may be some degree of “teaching to the test”. That is, if the AI labs know what questions will be used to benchmark LLMs and they’ll get to brag about in press releases, they’ll work to optimize their performance for those.33 There’s also the issue that the answers to these problems may be contained somewhere in the training data, whereas publicly available text content on poker is often mediocre.34

I’d surmise it would be pretty easy to train the models to recognize when they were asked poker questions and summon additional resources accordingly, like referencing the poker rulebook or performing GTO Wizard simulations. Operating in this mode, LLMs serve as CEOs that learn to delegate specific tasks to other types of AIs in response to certain prompts, or even delegate them to humans.35

But is that cheating? Well, it depends on whether you think LLMs themselves are a pathway to AGI, as opposed to AGI being achieved through a lattice of overlapping techniques. During the COVID pandemic, the phrase “Swiss Cheese Model” was sometimes used to denote the various interventions that were in place — masks, social distancing, vaccines, and so on. None of these were foolproof, but the hope was that the holes wouldn’t line up.

For LLMs on their own, however, poker is a hard problem because it requires working at various levels of abstraction:

On the one hand, the rules of the game are fixed and there is basically a mathematically correct answer to any given situation, but it’s extremely complicated to derive.
On the other hand, there are a lot of fuzzy contextual factors when humans play poker — more so than in chess, for example — from picking up on tells to making adjustments when opponents don’t play like a computer might.
Somewhere in between, there is a situation-specific optimal strategy given the parameters of the situation. But these fuzzy factors can have cascading effects: a single deviation at any node of the game can radically alter the optimal strategy throughout the entire hand or even the entire game or tournament. So the solutions are relatively fragile as compared with a “game” like natural language processing, where, say, a misspelled word in a ChatGPT prompt will usually present no problem.

My theory is that the Nash equilibrium for poker is complex enough — full of mixed strategies and highly context-dependent assumptions — that it’s hard for it to emerge organically with an LLM or another model that isn’t explicitly trained for this purpose.36 So how about getting there through a patchwork of loose heuristics instead? Well, there’s a fine line between winning heuristics and losing ones. For instance, a heuristic that ChatGPT seems very fond of — “my draws missed, therefore I have to bluff, and since I have nothing I have to bluff huge” — is a phase that every human poker player has gone through at some point. You might even get away with it the first couple of times. But in general, you’re risking a lot to win a little. In theory, this approach is bad and it’s often even worse in practice.37

Expert human players iterate back and forth between learning through solvers and by experience on the poker felt. They might look up a solver solution from a hand that knocked them out of a tournament and realize they were playing a certain spot incorrectly. They try to implement the fix in their gameplay and usually, after some practice, they understand what the solver was doing and develop a better feel for the situation. But there can still be pitfalls to this approach. Some solver solutions are exceptionally complex. For instance, in this GTO Wizard simulation of a common poker spot38, the player in position uses a mix of checks and five different bet sizes, ranging from 20 percent to 125 percent of the pot. So players learn when there’s significant EV loss from simplifying their strategies and when they can get away with simpler strategies (“always bet one-third of the pot”) that are harder to screw up.

And in live poker, there are even more factors to consider, such as the past history with opponents and intuitions based on physical reads. Also, the solver solutions aren’t helpful if they don’t match how their real-life opponents are playing. I’ve seen plenty of real-life WSOP hands where some European whiz kid implemented a fancy solver-driven strategy when instead they could have maximized their profit by playing straightforwardly because their opponent was a fish.

My guess is that LLMs have trouble going back and forth between these different levels of abstraction. Solvers are quite computationally intensive, and yet they don’t even consider many of the factors that matter in real-world poker games. So when you ask an LLM to consider other contextual factors too — like creating backstories for each player — it’s just too heavy of a cognitive load.

They’ll probably get there eventually. But if the future is unevenly distributed, artificial intelligence may be, too. I expect the pathway to AGI to be patchy, with miraculous-seeming breakthroughs on some tasks but others where humans will continue to perform better than machines for many years to come.

My partner and I are hoping to spend 5-ish days in Norway in late August and take in the fjords and so on; we don’t care about cities at all. If you have recommendations on a good itinerary, please let me know!

Our college basketball ratings pull in data from more than a half-dozen sources, and each of these designate the 364 Division I teams in different ways, e.g. Michigan State University might be “Michigan St.” in one system, “MSU” in another, “MICH STATE” in a third, etc. There’s no universal standard. ChatGPT was remarkably good at matching up these different team names.

Another annoying task: our ratings account for the location (city and state) of where each game is played, to account for travel distance. But data on this is often incomplete in our database of games: for instance, the name of the arena will be listed but not the city and state. It was quite good at looking these up and identifying the right cities even when information was highly incomplete. I checked its work, but this took 20 minutes or so instead of several hours if I was doing it from scratch.

And I know a relatively large amount about their inner workings from the extensive reporting I did on AI in my book.

Some definitions include the “cognitive” qualifier — one common formulation is any task that can be performed via remote work or via a computer — while others do not. If you don’t include it, and also require the models to perform well at tasks that involve the manipulation of physical space — say, being a good plumber — AGI is clearly much further away.

If your assumptions are wrong — particularly if your opponent doesn’t play like the computer recommends — you can still have a GIGO problem. This isn’t really the solver’s fault, though. In fact you can “node lock” a solver, meaning inputting deviations from optimal strategy that you expect your real-life opponents to make, and it will adjust accordingly.

Solver outputs take a long time to generate and require a lot of computing power. For instance, you might run one “solve” for a hand where each player starts with 100 big blinds, and another where each player starts with 50, but in the actual game you’re playing, every player has 62 big blinds instead. AI-enhanced solvers, at least as I understand them, can extrapolate from the solved positions to create a reasonably good extrapolation of any poker situation. Or at least, any situation where the action after the flop is heads-up (just two players remaining). They still have a long way to go when multiple players are involved in the hand.

ChatGPT also suggested to me that it lacks the ability for reliable reinforcement learning in poker because it’s relying on “sparse textual feedback” rather the actual EV of certain plays as in poker-dedicated AIs. The situation is better in other games because the text-based content on the Internet is better. “By contrast, models shine at chess or Go commentary because high-quality game transcripts are abundant and structured,” it said.

As opposed to a tournament. In tournaments, players keep competing until all but one player is eliminated and has all the chips. In a cash game, you can get up and leave at any time, keeping your winnings.

Basically, the strategy when there’s no money in the pot is only to play the best possible starting hand — pocket aces in Texas hold ‘em — and fold everything else. That isn’t very fun.

Here’s ChatGPT’s explanation for this: “Even when a model does track the state in its text window, it must update that state after each betting action. Transformer self-attention gives every token equal status, so earlier “pot = $2 070” tokens still influence later logits unless the prompt explicitly overwrites them. This leads to ‘ghost chips’.”

The solver GTO Wizard mixes in some folds with ace-king offsuit along with 4-bets.

Plus, she’s out of position — she’ll act first throughout the remainder of the hand, which is a disadvantage since she’ll have less information — and she isn’t even closing the action; Vic or Kat could still re-raise again. And the stacks are very deep; she’s putting all $5000 potentially at risk if she continues in the hand. The other problem with A♦️2 ♣️ is that it’s often dominated: if she makes a pair of aces, one of her opponents will often make a better pair of aces with a higher kicker, like Rob with his A♥️K♣️.

The bluffs are mostly other hands that contain an ace, because they have blockers (make it combinatorially less likely) to other players also holding pocket aces, which is one of the few hands they should continue with in this spot. If Grace were to re-re-re-raise (5-bet) as a bluff with her A♦️2 ♣️, that wouldn’t be that bad, though the solver prefers suited aces (e.g. A♦️5♦️) to give her a backup plan of being able to make a flush in case she gets called. But calling is terrible, much worse than either folding or raising.

“Taking a shot” means you’re playing for higher stakes than you usually do, which generally implies avoiding big pots.

If everyone else has folded and Kat is playing heads-up against Rocket Rob, 9’s are basically indifferent between calling and folding; she’s getting good odds, but she’s out of position and will often be playing against a bigger pair like kings. When Grace enters the pot too, though, Kat is supposed to have an incredibly tight range of hands because Grace’s range is also very tight.

It’s double-counting Grace’s straddle, since it’s both accounted for separately and included in the $650 it says she’s put in.

The queens aren’t actually that threatening because in GTO world, most hands containing a single queen are too weak to call given the extremely aggressive action so far. Instead, players’ ranges will mostly consist of pocket pairs like JJ or exactly ace-king (AK). They can have four-of-a-kind queens if they started with QQ, but this is combinatorially unlikely.

And possibly for Grace, too. She could consider bluffing, because the other players are clearly terrified by her call preflop.

Kat could be up against two players both holding AK, for instance. In addition to having the best hand currently, she is also incentivized to bet for protection because an ace or king on the river would give her opponents a better hand.

This will sometimes require another bet on the river. But JJ and TT should be heavily represented in both Grace’s and Rob’s ranges and probably need to fold if she bets twice.

Often, she’ll be targeting other ace-high hands like Rob’s AK so that she can win the whole pot instead of splitting it. But she almost never had a better hand than both opponents, let alone one of them.

If Grace did have a hand like JJ, she could make a small bet for value, hoping to get called by hands like AK or 99. Then she could balance that by making small bets with very strong hands like QQ (for quad queens), hoping to induce aggression from her opponents and then re-raise.

The more that a hand goes off the rails from GTO strategy and it’s impossible to guess what your opponents are doing, the more you sometimes just have to look at pot odds.

So calling is better than folding. But actually Kat’s best play might be to raise. Grace is often representing a hand like JJ here, and Kat looks like she might have been slowplaying AA, QQ or AQ the whole way. Rob might also have a hand like JJ, which could get curious and call Grace’s small bet even if Kat calls too, but will fold if Kat re-raises.

Poker players would say “chopping”.

Rob can credibly represent QQ — for four-of-a-kind queens — or AQ for a full house. On the flop and turn, such hands are already strong enough that they basically can’t lose and so have some incentive to slowplay to induce action when opponents make a second-best hand.

Rob’s K doesn’t play over Grace’s 2 because they already have a 5-card hand.

In fact, she actually plays the 5 from the board for her kicker, not either of the 4’s.

You’d just trap/slowplay with various medium-to-strong hands to induce huge bluffs from your opponents, and then check-fold when the draws hit.

Except for Bob’s huge overbet with a set of queens in hand #8, which is a poor play because his opponent can have a lot of straights and flushes; he’s folding out her weaker hands, while getting called by better ones.

For instance, with a medium-strength hand — say, J♦️7♦️on a flop of J♣️9 ♥️6 ♣️ — you might be torn between the heuristic that you should bet to protect your hand (almost every card on the turn potentially makes your opponent a better hand) and check to keep the pot small. These are tough situations to play.

I have a lot of respect for “feel players” who just go by vibes, but if they’re routinely making technical errors of a large enough magnitude, they have little hope of being a +EV player.

ChatGPT actually agrees with this critique! “There are emerging LLM poker benchmarks (e.g., wrapper agents that call PioSolver for adjudication), but they haven’t seeped into headline capability reports, so labs optimise for math Olympiads instead.”

Most of the best poker training material — I consume a lot of it — is by video rather than text, or is done through personal coaching sessions or is behind paywalls.

I actually encourage LLMs to do this when I use them, with prompts like “state your confidence level” or “leave the answer blank if you’re not sure so I can look it up myself”.

ChatGPT told me that coming up with reactive game theory exploits could be even harder. “Poker is adversarial and reactive: optimal play depends on how this set of agents deviates from GTO. LLMs trained on static corpora have no on-the-fly learning loop, so they can’t adapt ranges across hands.”

Under GTO, these huge bets are rarely made because players can meet their minimum defense frequency by calling only with very strong hands. In practice, they’ll get more ambitious about calling with bluff-catchers too if they see you making this play repeatedly.

200 big blinds deep in a no-limit cash game after the cutoff and the big blind defends.

Lukas Nel

The issue with your scenario is that you’re hitting the things LLMs are bad at: LLMs dont see letters they see parts of words, and so they historically struggle at counting letters and any form of arithmetic, so asking it to do the math for the poker pot is very difficult. In addition, you’re having it simulate an entire game of people, instead of recommending moves to play: this results in it using narrative tropes to craft its answers, and so you get a game of poker that sounds close to one that would be in a novel, where the author doesn’t really know the poker mechanics but knows what story beats he wants to hit.

The better evaluation would be to have it be a player in a series of live poker games and then see how well it does.

Expand full comment

2 replies

MarkS

No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.

1 reply

40 more comments...