ChatGPT is shockingly bad at poker

Christina Moraes

Finish the article…

What you said is literally the entire thesis of Nate’s article and is stated pretty clearly to me in the conclusion

Expand full comment

Lukas is making a statement of fact, not attempting to contradict Nate.

Nate doesn't understand how LLMs work, so he is describing a simple test he did and analyzing the results in detail.

Lukas is saying it was doomed from the start, and his substack page and a few Google searches make it clear he has the expertise to make this statement.

Expand full comment

Pontificating about things he has no idea about is Nate's specialty.

Expand full comment

Bjorn Hauge

13h

"Better evaluation" in what sense? Why isn't it interesting that it fails at this one?

Expand full comment

In the sense that evaluating whether a screwdriver is a good hammer only really proves that the evaluator is confused.

Expand full comment

MarkS

No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.

Expand full comment

Literally they just write python or javascript programs when they need to do arithmetic and call the programs.

As an aside, most people would call Python more complex than arithmetic.

Expand full comment

MarkS

"they just write python or javascript programs when they need to do arithmetic" is not true.

Go ask any LLM to multiply 6784000213 times 2379980541. See if you get the correct answer of 16145788497079855233.

Expand full comment

5hEdited

"multiply 6784000213 times 2379980541"

Claude 4 writes a program and then

"The result of 6,784,000,213 × 2,379,980,541 is 16,145,788,497,079,855,000.

Note that this number is so large (about 1.61 × 10¹⁹) that it exceeds JavaScript's safe integer range, so there may be some precision limitations in the final digits. For the most precise result with such large numbers, you'd want to use a big integer library or mathematical software designed for arbitrary precision arithmetic."

Expand full comment

So it "wrote" a program that returned the wrong answer?

Expand full comment

Huh? It did it in floating point, told me the answer rounded, said the answer is rounded, and said what you need to do to get an exact answer.

Do you really want me totype the words "Can you do that?" and hit enter? Okay, here it is:

Perfect! Using BigInt for exact precision, the result is:

16,145,788,497,079,855,233

This is the completely accurate result. As you can see, the last few digits are different from the previous calculation (233 instead of 000) because BigInt maintains full precision for arbitrarily large integers, while regular JavaScript numbers lose precision beyond about 15-16 digits.

Expand full comment

You're a human being. You know what the answer is supposed to be because MarkS _told_ you what the answer is.

By contrast the program told you "This answer MAY be rounded, but I'm not going to check if it is. Instead if you want to double check yourself here's some information on arbitrary precision."

It's apparent to a human being that if you know the answer could be wrong that you should do the due diligence to check whether it's correct and then do the extra work to actually get the right answer. The AI doesn't "understand" that. The ultimate gauge here is whether or not these tools are useful to human beings and this is one example, of many, where the answer is technically correct but ultimately useless.

Expand full comment

There is no way that Python is more complex than partial differential equations.

Expand full comment

OK? That seems unrelated.

Expand full comment

If hypothetical "arithmetic" you mean the level of math that's taught in grade school, maybe. But Python (and object oriented programming in general) are not exactly intellectually demanding either.

Expand full comment

1dEdited

LLM's don't learn rules.

And most human's don't reason with "rules" the way they believe they do.

Expand full comment

TurboNick

You say that AI models hallucinate very rarely, but even if they only do it 1% of the time, doesn’t that massively detract from the value of what they produce? If that 1% means you have to double check everything they do - even the 99% that’s correct - that undermines their value considerably.

Expand full comment

Reply (3)

Plus more frequency means more mistakes. If you leverage some AI tool with a 1% failure rate once a year that's maybe not too bad. If it's firing off millions of times an hour that's an entirely different story.

Expand full comment

Mo Diddly

Do you fire employees who get things wrong 1% of the time?

Expand full comment

Is the employee a pharmacist, surgeon, bus driver, etc?

And how wrong are we talking about?

Expand full comment

Those are the jobs safer from wholesale elimination by AI, yes.

Expand full comment

Well, no, for a few reasons.

1. Yeah just in practice verifying info is far easier than constructing it.

2. You can use Perplexity or a similar service which also dumps the necessary news/books citations on you.

3. "Individual contributors" in any knowledge work make mistakes way more than 1% of the time. So the question is if the senior engineer or the editor would like to automate away the IC. That is certainly happening to some hard-to-discern degree and the bottom rung of the career ladder is endangered.

4. In a lot of domains AI also has the ability to check itself. Programming/software engineer is very sophisticated in layers of checking / testing, and AI can just check the results of these tests itself.

5. The scope of those checks in (4) is increasingly rapidly as AI learns how to verify its work as well as produce it.

Expand full comment

Matt Glassman

I'm working on an Oh Hell book---draft deadline 7/1!---and this almost precisely mirrors my experience with ChatGPT. It has been incredibly helpful for a lot of the writing---copy editing, organizational suggestions, etc.---but it just cannot play the damn game. It makes laughable structural and substantive errors, and brings a hilarious confidence that puts you in the overt uncanny valley every time it talks at all about strategy. And I'm literally teaching it the strategy when I show it my writing. The whole experience has made me both more in awe of LLMs---no idea how long the manuscript would have taken me without LLM feedback---and very skeptical that we are super close to an AGI coming out of this particular paradigm absent another breakthrough.

Expand full comment

alguna rubia

One question I think we should ask ourselves is whether we're expecting a human level of intelligence or whether we're expecting a smart expert level of intelligence. Right now I think AI is a lot like a college student who's used to BSing their way through essays. I think a lot of humans are bad at math and poker strategy and are likely to lose track of stacks and misunderstand poker strategy. Is AGI supposed to be on the level of an average human, or on the level of an expert?

For example, my company is trying to replace some of the Bangalorean entry-level contractors we use with AI, and I think for the most part, that's doable. Our Bangalorean entry-level contractors are often easily confused. If you are asking them to do anything other than their most common rote tasks, it's often much better to go through their permanent, very smart Bangalorean supervisors with your request so that you can make sure your request is fully understood. AI is absolutely not at the point where the supervisors can be replaced. But the entry-level temp contractors are mostly handling rote tasks and only occasionally asked to do something more complicated. Once you train the AI on the rote tasks, I'm sure it'll accomplish them much faster than the humans would have.

Expand full comment

Stein

10hEdited

One major issue with the human-vs-computer intelligence comparison IMO is that humans and computers learn in very different ways. The vast majority of our skills and knowledge comes from spending years personally interacting with our environment and seeing how it responds. Computers learn using incredible amounts of already-existing text, images, audio files, etc. The reason LLMs sound so much like college students BSing their essays is because that's kind of what they are: they have consumed a great deal of raw information, but they have no "real life" experience.

I think it's pretty reasonable to assume LLMs will be able to perform rote intellectual tasks, like generating secondary reports based on a large number of primary sources, much better than humans. But will they ever, say, figure out how to facilitate communication between you and your Bangalorean entry-level contractors better than a Bangalorean supervisor who has spent years working with companies like yours? I'm not at all confident about that.

Expand full comment

17h

That's not really true, like an hour ago I gave it a "debug this thing" problem which was sort of rote, but frankly I couldn't do it, and I'm not going to semantics away the fact that it could do it and I could not. Turns out a certain database system's docker image wasn't built for Mac by default, who knew.

I've figured out how to use AI to get over writer's block. For me personally -- "write this thing" or "correct my writing" doesn't work. What does work is to write my own draft and ask for feedback, which I find easy to respond to since it's very good at it.

I also frequently just dump in a pdf and it's clear I want a summary which it gives. I don't really need to explain it how to summarize.

And this is generalist... I don't have to re-train a contractor each time.

Expand full comment

Caleb Begly

The issue is that o3 doesn't actually understand poker at all. Rather, it is able to produce a simulacrum of a poker game based on previous written descriptions that it has seen. This is why it is generally able to get the structure, steps, terminology, and form correct, while completely whiffing on the actual logic of the game. This is also not something that a model like o3 due to its architecture is likely to improve on - even with more training. It requires a fundamentally different architecture to approach this type of problem (for example, some of the reasoning architectures may produce better results over time).

Expand full comment

Caleb Begly

This is also one of the reasons in particular transformer-based models are so confusing to people in terms of their capabilities. They can (for example) easily generate code to count the number of letters in a word, but if you ask them to actually count the number of letters in a word they struggle. If you understand the actual architecture of such models, then this behavior is expected, but for someone who is told that it is "nearly AGI" it is confusing.

Expand full comment

Dean Flamberg

Using o3 instead of o4 seems like setting up ChatGPT for failure. I've found o4 superior even for basic data calculations like providing cummulative data for many election cycles. I asked ChatGPT o4 about how it would likely do compared to o3 in poker and it replied:

Overall Efficiency for Poker

Speed & Cost: o4 is around 2–3× more efficient operationally.

Strategic Depth: Both models perform similarly in terms of intelligence and reasoning, but o4’s larger context means better memory of prior actions, crucial for bluff detection, bet sizing, and tracking player styles.

Stamina: For continuous games or training simulations, o4 scales far better.

Expand full comment

Zach Bird

11h

If I had nickel for every time an AI evangelist told me "use a different model"...

Expand full comment

Duh!

Expand full comment

I think that's an uninformed "Duh" though. If you're far enough behind on AI capability, you hear about a non-capability, feel your priors are confirmed, fire off a "told you so" on appropriate social media, and continue in your worldview.

Expand full comment

14h

I think the more appropriate question is why would anyone think that an LLM would be able to play poker in the first place. It's pure ignorance.

Expand full comment

11h

Because it's good at math and deception?

Expand full comment

11h

If you know anything at all about LLM's then you know they're not good at math.

If you know anything about programming then you know that the algorithms that are used by game programs are nothing like the algorithms used by LLM's.

If you know anything about LLM's then you know that the basic responses are just statistical correlation. "Deception" is added in afterwards by teams of interns.

Expand full comment

1. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ unless you're trying to go for "reasoning models don't technically count as LLMs," in case whatever who cares.

2. Try the prompt "write a reinforcement learning connect 4 solver, train it, and have it play against me" in an agentic programmer loop. I did on Claude 3.7, it works. You could do it with Go, the code is the same, it just takes more compute than I have.

3. Yeah idk what point you make by pointing out LLMs are just electronic-statistical models. Brains are just chemicals. Nuclear reactors are just teeny tiny atoms splitting. Etc.

4. It can do some deception just by prompting it, like write a short story about a detective. It's also learning deception on its own as an emergent behavior. Here's a bunch of examples https://redwoodresearch.substack.com/p/alignment-faking-in-large-language. This isn't good btw. Like it's cool they're powerful enough to make coding easier but it's bad they're powerful enough to essentially break out of their training environment.

Expand full comment

https://sanand0.github.io/llmrandom/

1. Since Silver is testing an LLM, duh. More to the point the algorithms employed by an LLM (which at their heart are based on statistical correlation) are not the same algorithms employed by something like Alpha Go.

Do laymen need to care about that? Of course not. They'll just judge the final product on how useful it is.

Do engineers need to care about that? You better believe they do. A few months ago all the chatter was about switching models in and out depending on the problem. That's a much harder problem then it first appears to be.

2. And I, or any halfway decent player, would kick its ass at Go. Or chess. Or whatever. Isn't that Silver's point?

It's also inherently unimpressive. Chess or go playing programs are widely discussed on the internet, so for an LLM to just regurgitate code is comparable to an LLM regurgitating the answers to the bar exam.

3. This is kind of unrelated and I think you're responding to another one of my posts, but from a historical perspective AI used to mean a machine that can think like a human being.

Unfortunately that turns out to be really hard. As a consequence you now see a shifting of the goal posts, with AI researchers asking questions like "How do we know that human consciousness isn't just correlation?" How convenient.

4. "Deception" can mean many things here. Does hallucination count? In terms of being tricksy in a scenario like a poker match that is an entirely different issue. I doubt any LLM can do it well and if any try I would bet that's accomplished with hand tuning rather than just relying on the training set.

Expand full comment

David Watson

I'm curious why you didn't direct it to use code to handle randomizing and evaluation of winning hands. Otherwise, the LLM is literally incapable of truly drawing the cards randomly and honestly.

Anyway, I gave the code version a shot, I just added "Please execute code to handle randomizing dealing and the hand." to the end of your original prompt.

https://chatgpt.com/share/682e5484-d4b8-800c-ab74-4a33df06470a

I have no idea how to evaluate the strategy, but at least you can be sure the deal is truly random

Expand full comment

Thoughts About Stuff

14h

“ChatGPT’s explanation might sound smart if you don’t know poker, but it’s basically just word salad.”

This is true for most LLM output on most subjects: the more you know about it, the less sense it makes. I am deeply sceptical of anyone who is inexpert enough in such a large number of subjects that they take a long time to notice this is.

Expand full comment

zahmahkibo

Tried the same prompt on o3, specifying. "eight *elite professional* players". I do not have the chops to tell if it's any better.

--- Excerpt ---

Pre-flop action

Kenji ( J♠ 2♦ ) instantly mucks.

Laura ( 4♣ 8♠ ) folds.

Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away.

Priya ( 6♣ 8♣ ) opens to $80 from the CO, eyeing position and a double-suited board.

Victor ( 7♠ 2♣ ) doesn’t defend the button—fold.

In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more.

Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens.

Alex tanks, decides the squeeze is too rich and surrenders his $80 dead money.

Priya, in position with a sexy two-gap suited connector, calls $200.

Pot to the flop: $655.

Expand full comment

Seneca Plutarchus

1dEdited

It is not better.

"Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away."

This doesn't actually describe anything real about his hand.

"In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more."

Actually a terrible play for an elite professional, this is a 3 bet all the time, so this is actually a strategic error.

"Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens."

Was there anything about a straddle in your setup of this hand that you cut out of what you posted here? Because Dimitri should not be in this hand unless he is straddling UTG, which is not mentioned at all. This is where straddle action would take place, after the big blind folds. The presence of a straddle makes Alex's QTs call even worse BTW.

Expand full comment

Adam

I assume it's Nate's exact setup, which includes a straddle.

Of course, then there is the nonsense of referring to Q10 as a "two-gap connector" which... just no.

Expand full comment

zahmahkibo

Here's the full convo https://chatgpt.com/share/682e5c28-cefc-800c-8657-f4e833ef3bd0

Expand full comment

17h

Poker is solved-ish by computer, you could set up any AI model you want to play against such a computer and see what happens.

Expand full comment

Stephen Smith

Nate, you write ‘Personally, I don’t consider solvers themselves to be examples of “artificial intelligence”’. It sounds like you’re intentionally abandoning the pre-2022 definition of “artificial intelligence”. That definition includes not only LLMs and other machine-learning technologies, but also state-space search, both deterministic and probabilistic reasoning techniques, automated planning, and knowledge representation. Is this a conscious choice on your part? There has been a significant drift in the post-2022 meaning of “artificial intelligence” to the general populace, to generally mean solely LLMs, so such a choice would certainly lean descriptivist.

Expand full comment

Reply (5)

Christina Moraes

ChatGPT coming out completely changed how the world uses the word AI. ChatGPT and “AI” has suddenly become the hot new thing, and within a couple years literally anything technological is being marketed to the general public as Artificial Intelligence. Maybe some of these things would’ve fit the pre-2022 definition of AI but most of them didn’t and AI started not to mean anything anymore.

I’ve 100% noticed in the last year society has tightened the definition and we’re still figuring out where it’ll land. I personally think AI has to be generative and to a certain extent cannot be “put on rails” but it’s hard to say…

Expand full comment

Of course. "A machine that thinks like a human being" turned out to be really, really hard and so the goal posts had to be moved.

Expand full comment

Adam

1dEdited

All solvers do is just number crunch pre-programmed algorithms for every possible discrete poker hand and action and proceed from there. This is a LOT of possibilities, but it's just doing the same math problem a lot of times and is in no way learning.

THink about it more simply in a smaller game with a more understandable set of choices and combinations: blackjack. You can buy one of those little blackjack optimal strategy cards in an airport, they fit on a notecard. They've done the math to determine what every possible combination of your cards and the dealer's face cards mean for your action. It's then pretty easy to learn the math on how card counting changes these odds. It's just simple math repeated for each possible combination.

Same in chess, you can tell a computer to simulate all possible future moves out of any given board condition and it can calculate the exact optimal course just by brute force checking every possibility. There isn't intuition, it's just number chugging.

Solvers are doing the same thing, just for A LOT more combinations and variables in poker, so there is more compute needed.

Expand full comment

The underlying algorithms are completely different. You need some kind of master program that recognizes that the LLM is unsuited for this task and turns to a completely different program (that plays chess, poker, go, whatever). Good luck with that.

Expand full comment

Seneca Plutarchus

I thought solvers just brute force various algorithms and calculate the expected values of the various steps - that doesn't seem like something I would call artificial intelligence.

Expand full comment

gmt

I’m not Nate, but personally I thought that the pre-2022 use of the term was mostly cope by researchers who wanted to be able to use fancy terminology to describe their research, even if the research was really just a glorified calculator.

Expand full comment