45 Comments
User's avatar
Lukas Nel's avatar

The issue with your scenario is that you’re hitting the things LLMs are bad at: LLMs dont see letters they see parts of words, and so they historically struggle at counting letters and any form of arithmetic, so asking it to do the math for the poker pot is very difficult. In addition, you’re having it simulate an entire game of people, instead of recommending moves to play: this results in it using narrative tropes to craft its answers, and so you get a game of poker that sounds close to one that would be in a novel, where the author doesn’t really know the poker mechanics but knows what story beats he wants to hit.

The better evaluation would be to have it be a player in a series of live poker games and then see how well it does.

Expand full comment
Christina Moraes's avatar

Finish the article…

What you said is literally the entire thesis of Nate’s article and is stated pretty clearly to me in the conclusion

Expand full comment
CJ in SF's avatar

Lukas is making a statement of fact, not attempting to contradict Nate.

Nate doesn't understand how LLMs work, so he is describing a simple test he did and analyzing the results in detail.

Lukas is saying it was doomed from the start, and his substack page and a few Google searches make it clear he has the expertise to make this statement.

Expand full comment
MarkS's avatar

No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.

Expand full comment
CJ in SF's avatar

LLM's don't learn rules.

And most human's don't reason with "rules" the way they believe they do.

Expand full comment
TurboNick's avatar

You say that AI models hallucinate very rarely, but even if they only do it 1% of the time, doesn’t that massively detract from the value of what they produce? If that 1% means you have to double check everything they do - even the 99% that’s correct - that undermines their value considerably.

Expand full comment
Slaw's avatar

Plus more frequency means more mistakes. If you leverage some AI tool with a 1% failure rate once a year that's maybe not too bad. If it's firing off millions of times an hour that's an entirely different story.

Expand full comment
Mo Diddly's avatar

Do you fire employees who get things wrong 1% of the time?

Expand full comment
CJ in SF's avatar

Is the employee a pharmacist, surgeon, bus driver, etc?

And how wrong are we talking about?

Expand full comment
Dean Flamberg's avatar

Using o3 instead of o4 seems like setting up ChatGPT for failure. I've found o4 superior even for basic data calculations like providing cummulative data for many election cycles. I asked ChatGPT o4 about how it would likely do compared to o3 in poker and it replied:

Overall Efficiency for Poker

Speed & Cost: o4 is around 2–3× more efficient operationally.

Strategic Depth: Both models perform similarly in terms of intelligence and reasoning, but o4’s larger context means better memory of prior actions, crucial for bluff detection, bet sizing, and tracking player styles.

Stamina: For continuous games or training simulations, o4 scales far better.

Expand full comment
Matt Glassman's avatar

I'm working on an Oh Hell book---draft deadline 7/1!---and this almost precisely mirrors my experience with ChatGPT. It has been incredibly helpful for a lot of the writing---copy editing, organizational suggestions, etc.---but it just cannot play the damn game. It makes laughable structural and substantive errors, and brings a hilarious confidence that puts you in the overt uncanny valley every time it talks at all about strategy. And I'm literally teaching it the strategy when I show it my writing. The whole experience has made me both more in awe of LLMs---no idea how long the manuscript would have taken me without LLM feedback---and very skeptical that we are super close to an AGI coming out of this particular paradigm absent another breakthrough.

Expand full comment
Slaw's avatar

Duh!

Expand full comment
Caleb Begly's avatar

The issue is that o3 doesn't actually understand poker at all. Rather, it is able to produce a simulacrum of a poker game based on previous written descriptions that it has seen. This is why it is generally able to get the structure, steps, terminology, and form correct, while completely whiffing on the actual logic of the game. This is also not something that a model like o3 due to its architecture is likely to improve on - even with more training. It requires a fundamentally different architecture to approach this type of problem (for example, some of the reasoning architectures may produce better results over time).

Expand full comment
Caleb Begly's avatar

This is also one of the reasons in particular transformer-based models are so confusing to people in terms of their capabilities. They can (for example) easily generate code to count the number of letters in a word, but if you ask them to actually count the number of letters in a word they struggle. If you understand the actual architecture of such models, then this behavior is expected, but for someone who is told that it is "nearly AGI" it is confusing.

Expand full comment
alguna rubia's avatar

One question I think we should ask ourselves is whether we're expecting a human level of intelligence or whether we're expecting a smart expert level of intelligence. Right now I think AI is a lot like a college student who's used to BSing their way through essays. I think a lot of humans are bad at math and poker strategy and are likely to lose track of stacks and misunderstand poker strategy. Is AGI supposed to be on the level of an average human, or on the level of an expert?

For example, my company is trying to replace some of the Bangalorean entry-level contractors we use with AI, and I think for the most part, that's doable. Our Bangalorean entry-level contractors are often easily confused. If you are asking them to do anything other than their most common rote tasks, it's often much better to go through their permanent, very smart Bangalorean supervisors with your request so that you can make sure your request is fully understood. AI is absolutely not at the point where the supervisors can be replaced. But the entry-level temp contractors are mostly handling rote tasks and only occasionally asked to do something more complicated. Once you train the AI on the rote tasks, I'm sure it'll accomplish them much faster than the humans would have.

Expand full comment
David Watson's avatar

I'm curious why you didn't direct it to use code to handle randomizing and evaluation of winning hands. Otherwise, the LLM is literally incapable of truly drawing the cards randomly and honestly.

https://sanand0.github.io/llmrandom/

Anyway, I gave the code version a shot, I just added "Please execute code to handle randomizing dealing and the hand." to the end of your original prompt.

https://chatgpt.com/share/682e5484-d4b8-800c-ab74-4a33df06470a

I have no idea how to evaluate the strategy, but at least you can be sure the deal is truly random

Expand full comment
zahmahkibo's avatar

Tried the same prompt on o3, specifying. "eight *elite professional* players". I do not have the chops to tell if it's any better.

--- Excerpt ---

Pre-flop action

Kenji ( J♠ 2♦ ) instantly mucks.

Laura ( 4♣ 8♠ ) folds.

Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away.

Priya ( 6♣ 8♣ ) opens to $80 from the CO, eyeing position and a double-suited board.

Victor ( 7♠ 2♣ ) doesn’t defend the button—fold.

In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more.

Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens.

Alex tanks, decides the squeeze is too rich and surrenders his $80 dead money.

Priya, in position with a sexy two-gap suited connector, calls $200.

Pot to the flop: $655.

Expand full comment
Seneca Plutarchus's avatar

It is not better.

"Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away."

This doesn't actually describe anything real about his hand.

"In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more."

Actually a terrible play for an elite professional, this is a 3 bet all the time, so this is actually a strategic error.

"Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens."

Was there anything about a straddle in your setup of this hand that you cut out of what you posted here? Because Dimitri should not be in this hand unless he is straddling UTG, which is not mentioned at all. This is where straddle action would take place, after the big blind folds. The presence of a straddle makes Alex's QTs call even worse BTW.

Expand full comment
Adam's avatar

I assume it's Nate's exact setup, which includes a straddle.

Of course, then there is the nonsense of referring to Q10 as a "two-gap connector" which... just no.

Expand full comment
Kathleen Weber's avatar

When you are training something from the Internet, generally speaking you're dealing with a situation of mediocrity in, mediocrity out.

Expand full comment
comex's avatar
2hEdited

ChatGPT’s mistakes seem to extend to how it assesses its own limitations.

“LLMs trained on static corpora have no on-the-fly learning loop, so they can’t adapt ranges across hands.”

Not so. If you wanted to make an LLM actually play poker against others, you would put the history of past hands in the context. Even a smallish context window should be enough to fit several hands’ worth of game history, and the modern 200k-to-1m-token context windows are way more than enough. How well the LLM makes use of the context is another question, but the claim that LLMs can’t adapt at all is a wild exaggeration.

“Transformer self-attention gives every token equal status, so earlier “pot = $2 070” tokens still influence later logits unless the prompt explicitly overwrites them.”

This is dubious.

First, while it’s true that everything in a context window will have *some* influence, LLMs are definitely capable of distinguishing earlier tokens from later tokens (due to positional encoding) and deciding which tokens are more important (that’s the “attention” part of “self-attention”). It may be that the earlier tokens had too much influence, but again it’s not as categorical a problem as suggested.

Second, the only way a prompt can “explicitly overwrite” earlier tokens is just by being long enough that the earlier tokens scroll out of context. The quote makes it sound like you can ask the LLM to overwrite part of its context and it’ll do it. But LLMs don’t have the ability to edit their own context like that. That’s not to say that instructing the LLM to ignore earlier information is useless; the LLM will see the instruction and try to follow it. It already does an okay job at avoiding outdated information by default, and explicitly reminding it might make it focus more on this aspect and ultimately perform better at it. But this doesn’t work by “overwriting” anything, and it won’t be perfectly effective.

Third, I don’t know if the full answer you quoted from mentioned chain-of-thought, but it definitely should have, since it’s likely one of the most important factors. Among other things, “prematurely subtracting chips based on what would happen later on the hand” is almost impossible without chain-of-thought, since LLMs by themselves are going token-by-token and don’t have that precise of an idea of what will follow.

Expand full comment
Bram Cohen's avatar

In Go, which AI completely dominates at, there are exploitative strategies which humans can use to defeat the normally ludicrously superhuman computers. The problem is essentially that although the AI is both better at board evaluation than humans and better at alpha-beta pruning because that's hand coded into it, there's no bridge between the two so a human can evaluate the positional consequences of local tactics where the AI can't. But that's on a completely different scale than the weaknesses which this article talks about, which are just ridiculous. That said, a smart human can work out chess strategy without guidance while almost noone can do that for poker, so maybe it's mostly an overall harder game.

Expand full comment
PJ Cummings's avatar

Always interesting to read your take on poker stories

Expand full comment
James Collins's avatar

The stack sizes are weird, though. Instead of everyone starting with $5,000 like I requested,

But the overwhelming majority of players in a real poker game are going to raise, just like Rob did.

Etc. “As” not “like.”

Expand full comment
Aaron C Brown's avatar

This is an interesting experiment. A few comments.

In a way it's unfair to complain about the foolish play of the hand given that the AI didn't understand the basic rules of the game--such as what hand wins, how money is awarded and what a suited hand is. Perhaps the AI played perfectly given its understanding of the rules.

Although the criticisms of the AI hand are fully justified, I would call it about average when compared to fictional poker hands shown in books, movies and television shows. Even though poker is played widely and poker hands are common pivotal plot devices, most authors and screenwriters can't seem to get them right. So the AI might be fairly reflecting it's training set.

I second the harsh words for Grace Lin's preflop call. She is all but drawing dead here. A2o is the hand on which the most money in poker is lost (it's actually an above-average hand in terms of chance of winning if pots are dealt to the river, but it's almost incapable of winning a big pot with decent players at the table, and even less likely to be confident of winning a big pot, but easily capable of losing a big pot). Given the betting it's more likely than not that at least one player has an Ace with a higher kicker. Even if not, her hand is at a disadvantage to a pocket pair and doesn't have a significant advantage over any likely hand. Beating all three other hands--and being confident of beating them--almost requires unlikely combinations like 345 or 22 on the board.

I've used AI for Liar's Poker, which is not a poker game but shares some game theory similarities. The real game uses eight digits from 0 to 9 (8 by 10 size). AI seems to work great--better than human experts with only moderate training--for 3x3 to 5x5 versions of the game, but breaks down for larger ones (possibly more training could raise the level). But the point is performance doesn't degrade a bit so it plays worse than human experts, it falls apart when the complexity passes a threshold. Mathematically it's the same problem for 3x3 and 8x10 and human players don't seem to care much about the size.

I object to calling poker with no blinds or antes "a broken game." The game was invented and spread over a wide region for half a century without them. Early poker players boasted that poker differed essentially from gambling because no one ever put money in the pot without reason to think it was advantageous.

Blinds and antes were added to the game deliberately to add a gambling element--making it more exciting for most players, and faster-paced. But it did destroy one of the founding principles of the game that distinguished poker from all earlier games.

It's true there's a highly abstract argument that no one should ever bet in a game without blinds and antes. Other people will only call or raise your bet if they have the advantage over you, so you always have a negative expectation to bet. But that principle has little relevance to real poker games, or for that matter, to betting and bluffing opportunities away from the poker table.

Poker without blinds and antes will never gain mass popularity nor be shown on television. But it has a purity that I admire. It's not broken.

Expand full comment
CJ in SF's avatar

LLM don't "understand" anything, so the fact that ChatGPT hallucinated victory criteria is not related to any form of rules.

Expand full comment
Aaron C Brown's avatar

That is the question. Have the LLM or other AI models progressed to the point where they evidence general intelligence? Or are they just exploiting tricks with correlations?

It's hard to know how complex AI make judgments. In the case of poker is it merely stringing together words and events it's found in other places? Or has it progressed to an understanding of the rules and the mathematical logic to go from rules to strategy to decisions?

Expand full comment
CJ in SF's avatar

The real question you are asking is whether general intelligence is more than a trick with correlation.

LLMs don't reason from rules.

Some AI researchers are skeptical that humans do either. A rule can be thought of as a very strong probabilistic filter.

Expand full comment
Aaron C Brown's avatar

I don't consider that much of a question. I'm pretty confident general intelligence is not just correlation tricks. I don't know exactly what general intelligence is, but I think I know what it is not.

It's easier to see with chess. You might study millions of board positions and next moves of top chess players, and get pretty good at guessing a good next move from that, without knowing anything about the rules of the game or the goal.

I think general intelligence requires learning the rules and goal--either from observation or texts--and applying not just correlations from past games, but tools from other domains like logic and mathematics.

With correlation, a wrong move is a wrong move. With general intelligence, you can distinguish between moves that are wrong because the understanding of the game is wrong (the AI thinks a stalemate is win, for example) and moves that are wrong given the understanding of the game.

Expand full comment
CJ in SF's avatar

The brain is pretty complicated. Some areas definitely operate in a very similar manner as a LLM.

I'm agnostic on the issue.

I don't think AGI is as close as the people pushing it claim.

I expect it is more like fusion, and will be 5 years away for at least a decade.

Expand full comment