121 Comments
User's avatar
Lukas Nel's avatar

The issue with your scenario is that you’re hitting the things LLMs are bad at: LLMs dont see letters they see parts of words, and so they historically struggle at counting letters and any form of arithmetic, so asking it to do the math for the poker pot is very difficult. In addition, you’re having it simulate an entire game of people, instead of recommending moves to play: this results in it using narrative tropes to craft its answers, and so you get a game of poker that sounds close to one that would be in a novel, where the author doesn’t really know the poker mechanics but knows what story beats he wants to hit.

The better evaluation would be to have it be a player in a series of live poker games and then see how well it does.

Expand full comment
Christina Moraes's avatar

Finish the article…

What you said is literally the entire thesis of Nate’s article and is stated pretty clearly to me in the conclusion

Expand full comment
CJ in SF's avatar

Lukas is making a statement of fact, not attempting to contradict Nate.

Nate doesn't understand how LLMs work, so he is describing a simple test he did and analyzing the results in detail.

Lukas is saying it was doomed from the start, and his substack page and a few Google searches make it clear he has the expertise to make this statement.

Expand full comment
VK's avatar

Pontificating about things he has no idea about is Nate's specialty.

Expand full comment
Bjorn Hauge's avatar

"Better evaluation" in what sense? Why isn't it interesting that it fails at this one?

Expand full comment
CJ in SF's avatar

In the sense that evaluating whether a screwdriver is a good hammer only really proves that the evaluator is confused.

Expand full comment
awindowcleanerme's avatar

That's not really true. ChatGPT has been able to solve pages of vector calculus problems for me correctly. The problem is about the type of problem and what counts as a decision.

Expand full comment
MarkS's avatar

No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.

Expand full comment
CJ in SF's avatar

LLM's don't learn rules.

And most human's don't reason with "rules" the way they believe they do.

Expand full comment
User's avatar
Comment deleted
May 22Edited
Comment deleted
Expand full comment
MarkS's avatar

"they just write python or javascript programs when they need to do arithmetic" is not true.

Go ask any LLM to multiply 6784000213 times 2379980541. See if you get the correct answer of 16145788497079855233.

Expand full comment
User's avatar
Comment deleted
May 22Edited
Comment deleted
Expand full comment
MarkS's avatar

Claude Sonnet 4 (the free version):

Me: multiply 6784000213 times 2379980541

Claude: 6784000213 × 2379980541 = 16,139,655,253,934,618,333

Me: Are you sure?

Claude: Let me double-check that calculation:

6,784,000,213 × 2,379,980,541

Using long multiplication or breaking this down:

6,784,000,213 × 2,379,980,541 = 16,139,655,253,934,618,333

Yes, I'm confident that's correct. The result is 16,139,655,253,934,618,333.

Expand full comment
User's avatar
Comment deleted
May 23Edited
Comment deleted
Expand full comment
MarkS's avatar

I turned nothing off, just used the free version of Claude. If the pay versions have managed to fix one small class of hallucinations, that's nice. But the LLMs still have not learned arithmetic in the way that the vast majority of human children can learn arithmetic. Your claim is that this does not matter. I claim that it does matter. In the future, we'll find out who was right.

Expand full comment
Slaw's avatar

So it "wrote" a program that returned the wrong answer?

Expand full comment
User's avatar
Comment deleted
May 22Edited
Comment deleted
Expand full comment
Slaw's avatar

You're a human being. You know what the answer is supposed to be because MarkS _told_ you what the answer is.

By contrast the program told you "This answer MAY be rounded, but I'm not going to check if it is. Instead if you want to double check yourself here's some information on arbitrary precision."

It's apparent to a human being that if you know the answer could be wrong that you should do the due diligence to check whether it's correct and then do the extra work to actually get the right answer. The AI doesn't "understand" that. The ultimate gauge here is whether or not these tools are useful to human beings and this is one example, of many, where the answer is technically correct but ultimately useless.

Expand full comment
Slaw's avatar

There is no way that Python is more complex than partial differential equations.

Expand full comment
User's avatar
Comment deleted
May 22Edited
Comment deleted
Expand full comment
Slaw's avatar

If hypothetical "arithmetic" you mean the level of math that's taught in grade school, maybe. But Python (and object oriented programming in general) are not exactly intellectually demanding either.

Expand full comment
Thoughts About Stuff's avatar

“ChatGPT’s explanation might sound smart if you don’t know poker, but it’s basically just word salad.”

This is true for most LLM output on most subjects: the more you know about it, the less sense it makes. I am deeply sceptical of anyone who is inexpert enough in such a large number of subjects that they take a long time to notice this is.

Expand full comment
TurboNick's avatar

You say that AI models hallucinate very rarely, but even if they only do it 1% of the time, doesn’t that massively detract from the value of what they produce? If that 1% means you have to double check everything they do - even the 99% that’s correct - that undermines their value considerably.

Expand full comment
Slaw's avatar

Plus more frequency means more mistakes. If you leverage some AI tool with a 1% failure rate once a year that's maybe not too bad. If it's firing off millions of times an hour that's an entirely different story.

Expand full comment
Mo Diddly's avatar

Do you fire employees who get things wrong 1% of the time?

Expand full comment
CJ in SF's avatar

Is the employee a pharmacist, surgeon, bus driver, etc?

And how wrong are we talking about?

Expand full comment
User's avatar
Comment deleted
May 22Edited
Comment deleted
Expand full comment
Slaw's avatar

If you are running thousands of calculations a second to check for stress fractures what's the acceptable failure rate? 1% of millions of operations is a lot of second guessing, and that's assuming it's even possible to winnow down the overall result set to a subset of suspect outcomes.

Expand full comment
User's avatar
Comment deleted
May 24Edited
Comment deleted
Expand full comment
Slaw's avatar

In other words, it's useless for any type of big data/operation at scale. Who would have guessed?

Oh, that's right. Maybe people who actually do this stuff for a living and know what they're talking about. I can assure you that the discussions that we have are not anywhere close to the nonsense that is proliferating through the lay public right now.

Expand full comment
User's avatar
Comment deleted
May 25Edited
Comment deleted
Expand full comment
Slaw's avatar

If you're don't understand the subject matter and you're not willing to engage in an honest discussion I would recommend not posting in the first place.

Also, you don't need to make a farewell post. I assure you I couldn't care less if you just vanished.

Expand full comment
Caleb Begly's avatar

The issue is that o3 doesn't actually understand poker at all. Rather, it is able to produce a simulacrum of a poker game based on previous written descriptions that it has seen. This is why it is generally able to get the structure, steps, terminology, and form correct, while completely whiffing on the actual logic of the game. This is also not something that a model like o3 due to its architecture is likely to improve on - even with more training. It requires a fundamentally different architecture to approach this type of problem (for example, some of the reasoning architectures may produce better results over time).

Expand full comment
Caleb Begly's avatar

This is also one of the reasons in particular transformer-based models are so confusing to people in terms of their capabilities. They can (for example) easily generate code to count the number of letters in a word, but if you ask them to actually count the number of letters in a word they struggle. If you understand the actual architecture of such models, then this behavior is expected, but for someone who is told that it is "nearly AGI" it is confusing.

Expand full comment
Matt Glassman's avatar

I'm working on an Oh Hell book---draft deadline 7/1!---and this almost precisely mirrors my experience with ChatGPT. It has been incredibly helpful for a lot of the writing---copy editing, organizational suggestions, etc.---but it just cannot play the damn game. It makes laughable structural and substantive errors, and brings a hilarious confidence that puts you in the overt uncanny valley every time it talks at all about strategy. And I'm literally teaching it the strategy when I show it my writing. The whole experience has made me both more in awe of LLMs---no idea how long the manuscript would have taken me without LLM feedback---and very skeptical that we are super close to an AGI coming out of this particular paradigm absent another breakthrough.

Expand full comment
alguna rubia's avatar

One question I think we should ask ourselves is whether we're expecting a human level of intelligence or whether we're expecting a smart expert level of intelligence. Right now I think AI is a lot like a college student who's used to BSing their way through essays. I think a lot of humans are bad at math and poker strategy and are likely to lose track of stacks and misunderstand poker strategy. Is AGI supposed to be on the level of an average human, or on the level of an expert?

For example, my company is trying to replace some of the Bangalorean entry-level contractors we use with AI, and I think for the most part, that's doable. Our Bangalorean entry-level contractors are often easily confused. If you are asking them to do anything other than their most common rote tasks, it's often much better to go through their permanent, very smart Bangalorean supervisors with your request so that you can make sure your request is fully understood. AI is absolutely not at the point where the supervisors can be replaced. But the entry-level temp contractors are mostly handling rote tasks and only occasionally asked to do something more complicated. Once you train the AI on the rote tasks, I'm sure it'll accomplish them much faster than the humans would have.

Expand full comment
Stein's avatar

One major issue with the human-vs-computer intelligence comparison IMO is that humans and computers learn in very different ways. The vast majority of our skills and knowledge comes from spending years personally interacting with our environment and seeing how it responds. Computers learn using incredible amounts of already-existing text, images, audio files, etc. The reason LLMs sound so much like college students BSing their essays is because that's kind of what they are: they have consumed a great deal of raw information, but they have no "real life" experience.

I think it's pretty reasonable to assume LLMs will be able to perform rote intellectual tasks, like generating secondary reports based on a large number of primary sources, much better than humans. But will they ever, say, figure out how to facilitate communication between you and your Bangalorean entry-level contractors better than a Bangalorean supervisor who has spent years working with companies like yours? I'm not at all confident about that.

Expand full comment
David Watson's avatar

I'm curious why you didn't direct it to use code to handle randomizing and evaluation of winning hands. Otherwise, the LLM is literally incapable of truly drawing the cards randomly and honestly.

https://sanand0.github.io/llmrandom/

Anyway, I gave the code version a shot, I just added "Please execute code to handle randomizing dealing and the hand." to the end of your original prompt.

https://chatgpt.com/share/682e5484-d4b8-800c-ab74-4a33df06470a

I have no idea how to evaluate the strategy, but at least you can be sure the deal is truly random

Expand full comment
CJ in SF's avatar

Nate's goal was to evaluate the quality of poker reasoning, so his only concern was that the hands needed to be sufficiently random that the response was not effectively a search query and replay.

Expand full comment
David Watson's avatar

true enough, but it's reasonable for it to also use python to determine winning hands and chip counts.

Expand full comment
Dean Flamberg's avatar

Using o3 instead of o4 seems like setting up ChatGPT for failure. I've found o4 superior even for basic data calculations like providing cummulative data for many election cycles. I asked ChatGPT o4 about how it would likely do compared to o3 in poker and it replied:

Overall Efficiency for Poker

Speed & Cost: o4 is around 2–3× more efficient operationally.

Strategic Depth: Both models perform similarly in terms of intelligence and reasoning, but o4’s larger context means better memory of prior actions, crucial for bluff detection, bet sizing, and tracking player styles.

Stamina: For continuous games or training simulations, o4 scales far better.

Expand full comment
Zach Bird's avatar

If I had nickel for every time an AI evangelist told me "use a different model"...

Expand full comment
Dean Flamberg's avatar

True but an article saying a specific AI vendor is bad should be using that vendor's top-line model to evaluate it.

Expand full comment
Slaw's avatar

Duh!

Expand full comment
Daniel Echlin's avatar

I think that's an uninformed "Duh" though. If you're far enough behind on AI capability, you hear about a non-capability, feel your priors are confirmed, fire off a "told you so" on appropriate social media, and continue in your worldview.

Expand full comment
Slaw's avatar

I think the more appropriate question is why would anyone think that an LLM would be able to play poker in the first place. It's pure ignorance.

Expand full comment
Daniel Echlin's avatar

Because it's good at math and deception?

Expand full comment
Slaw's avatar

If you know anything at all about LLM's then you know they're not good at math.

If you know anything about programming then you know that the algorithms that are used by game programs are nothing like the algorithms used by LLM's.

If you know anything about LLM's then you know that the basic responses are just statistical correlation. "Deception" is added in afterwards by teams of interns.

Expand full comment
Daniel Echlin's avatar

1. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ unless you're trying to go for "reasoning models don't technically count as LLMs," in case whatever who cares.

2. Try the prompt "write a reinforcement learning connect 4 solver, train it, and have it play against me" in an agentic programmer loop. I did on Claude 3.7, it works. You could do it with Go, the code is the same, it just takes more compute than I have.

3. Yeah idk what point you make by pointing out LLMs are just electronic-statistical models. Brains are just chemicals. Nuclear reactors are just teeny tiny atoms splitting. Etc.

4. It can do some deception just by prompting it, like write a short story about a detective. It's also learning deception on its own as an emergent behavior. Here's a bunch of examples https://redwoodresearch.substack.com/p/alignment-faking-in-large-language. This isn't good btw. Like it's cool they're powerful enough to make coding easier but it's bad they're powerful enough to essentially break out of their training environment.

Expand full comment
Slaw's avatar

1. Since Silver is testing an LLM, duh. More to the point the algorithms employed by an LLM (which at their heart are based on statistical correlation) are not the same algorithms employed by something like Alpha Go.

Do laymen need to care about that? Of course not. They'll just judge the final product on how useful it is.

Do engineers need to care about that? You better believe they do. A few months ago all the chatter was about switching models in and out depending on the problem. That's a much harder problem then it first appears to be.

2. And I, or any halfway decent player, would kick its ass at Go. Or chess. Or whatever. Isn't that Silver's point?

It's also inherently unimpressive. Chess or go playing programs are widely discussed on the internet, so for an LLM to just regurgitate code is comparable to an LLM regurgitating the answers to the bar exam.

3. This is kind of unrelated and I think you're responding to another one of my posts, but from a historical perspective AI used to mean a machine that can think like a human being.

Unfortunately that turns out to be really hard. As a consequence you now see a shifting of the goal posts, with AI researchers asking questions like "How do we know that human consciousness isn't just correlation?" How convenient.

4. "Deception" can mean many things here. Does hallucination count? In terms of being tricksy in a scenario like a poker match that is an entirely different issue. I doubt any LLM can do it well and if any try I would bet that's accomplished with hand tuning rather than just relying on the training set.

Expand full comment
Jabberwocky's avatar

Why is it surprising that LLM’s are bad at poker? They don’t have general intelligence, they feign written (and visual) intelligence by copying other people’s work. We are no where near close to AI, we’re just good at pretending intelligence. We’d need real learning and risk assessment to actually perform well at poker. Also, algorithmic or programmatic poker playing would be sufficiently good. Why does everyone act like AI is even necessary? A) we do have AI and B) what we do have is just a cheap (well expensive computationally) parlor trick that can replace some jobs. LLM’s aren’t even that good at coding, even though they source mostly working code. What a silly article.

Expand full comment
N Scott Cardell's avatar

Nate Silver,

     I have been to Norway many times, but the only time post covid was to take a cruise to Svalbard. So some of my information might be out of date.

     I recommend that you fly to Bergen and take the Flåm Railway and Fjord Cruise on Nærøyfjord both the rail journey and the cruise offer beautiful fjord views. (When I did this there was a bus from Bergen to top of the Flåm Railway and a rail option that, I think, involved a very short bus transfer.) Now some possibilities to consider:

1. Look into extensive Sognefjord boat tours. (Sognefjord is the lomgest fjord in Norway, Nærøyfjord is a small particulaly dramatic branch of Sognefjord.) There also bus and driving options to overlooks and some towns in the fjords.

2. I would check out the hurtigruten coastal voyages. They run pretty much daily from Bergen and you can economically purchase a limited trip getting off at ANY port. (Note it wasn't easy to do this online and I used a travel agent who had worked with Hurtigruten before.) One day takes you to Ålesund and 4 days to Tromsø. (I mention Ålesund and Tromsø because Ålesund can be a jumping off point for fjord tours and Tromsø is beautifully situated on two islands that look like a single Island split by a fjord, also the trip Bergen to Tromsø includes the best fjord views on the coastal voyage.) On the trip from Bergen to Tromsø Hurtigruten offers some tours that you can purchase, they do vary in quality and the advertising for them can be misleading. Also, you get to see some of the Lofoten Islands, which are beautiful.

3. If you need to keep to the approximately 5 days, you can fly back from Tromsø.

4. If you want to travel by rental car, there are now bridges and tunnels that let you drive to the Lofoten Islands. This is not entirely a blessing as the development there has made visiting them less interesting then it used to be.

5. Flights to Bergen on Icelandic airlines allow free stopovers in Iceland, the one day Golden circle tours are very worthwhile.

So if it were me and I had not previously seen a lot of the fjords (but somehow new what I learned on the previous trips) and my time was severely limited. I would fly Bergen (probably landing in the late afternoon or ealy evening) stay a place on or very near the old harbor. Then the next day take the Flåm Railway and Fjord Cruise on Nærøyfjord and that evening board a hurtigruten coastal voyage ship headed to Tromsø, I would consider taking one or two excursions from the ship. Then getting off in Tromsø spend a day or part of a day exploring the area around Tromsø and fly back to Bergen to catch my flight home.

Scott

N. Scott Cardell

Pullman WA

(I am leary about putting my email in a public forum. But I  subscribe to the Silver Bulletin and you can find my email there.)

Expand full comment
Stephen Smith's avatar

Nate, you write ‘Personally, I don’t consider solvers themselves to be examples of “artificial intelligence”’. It sounds like you’re intentionally abandoning the pre-2022 definition of “artificial intelligence”. That definition includes not only LLMs and other machine-learning technologies, but also state-space search, both deterministic and probabilistic reasoning techniques, automated planning, and knowledge representation. Is this a conscious choice on your part? There has been a significant drift in the post-2022 meaning of “artificial intelligence” to the general populace, to generally mean solely LLMs, so such a choice would certainly lean descriptivist.

Expand full comment
Christina Moraes's avatar

ChatGPT coming out completely changed how the world uses the word AI. ChatGPT and “AI” has suddenly become the hot new thing, and within a couple years literally anything technological is being marketed to the general public as Artificial Intelligence. Maybe some of these things would’ve fit the pre-2022 definition of AI but most of them didn’t and AI started not to mean anything anymore.

I’ve 100% noticed in the last year society has tightened the definition and we’re still figuring out where it’ll land. I personally think AI has to be generative and to a certain extent cannot be “put on rails” but it’s hard to say…

Expand full comment
Slaw's avatar

Of course. "A machine that thinks like a human being" turned out to be really, really hard and so the goal posts had to be moved.

Expand full comment
gmt's avatar

I’m not Nate, but personally I thought that the pre-2022 use of the term was mostly cope by researchers who wanted to be able to use fancy terminology to describe their research, even if the research was really just a glorified calculator.

Expand full comment
CJ in SF's avatar

The pre-2022 versions were not "glorified calculators".

Or put another way, if those were glorified calculators then so is ChatGPT, and possibly the human brain also.

Expand full comment
Adam's avatar

All solvers do is just number crunch pre-programmed algorithms for every possible discrete poker hand and action and proceed from there. This is a LOT of possibilities, but it's just doing the same math problem a lot of times and is in no way learning.

THink about it more simply in a smaller game with a more understandable set of choices and combinations: blackjack. You can buy one of those little blackjack optimal strategy cards in an airport, they fit on a notecard. They've done the math to determine what every possible combination of your cards and the dealer's face cards mean for your action. It's then pretty easy to learn the math on how card counting changes these odds. It's just simple math repeated for each possible combination.

Same in chess, you can tell a computer to simulate all possible future moves out of any given board condition and it can calculate the exact optimal course just by brute force checking every possibility. There isn't intuition, it's just number chugging.

Solvers are doing the same thing, just for A LOT more combinations and variables in poker, so there is more compute needed.

Expand full comment
Slaw's avatar

The underlying algorithms are completely different. You need some kind of master program that recognizes that the LLM is unsuited for this task and turns to a completely different program (that plays chess, poker, go, whatever). Good luck with that.

Expand full comment
Seneca Plutarchus's avatar

I thought solvers just brute force various algorithms and calculate the expected values of the various steps - that doesn't seem like something I would call artificial intelligence.

Expand full comment
Slaw's avatar

There's also been a convenient retreat on the part of AI researchers/proponents in the sense of "Maybe human consciousness is just correlation".

Expand full comment
Jim Schulz's avatar

Ditto fly in and out of Bergen but if you want to rent a car and see the countryside plus some incredible sights without traveling huge distances check these out:

Sondal - cool spot interior on a fjord below the impressive Folgefonna glacier

Priekestollen - incredible tourist spot to hike to. dangle your feet over the edge of a 1000 foot cliff above fjord after an hour’s hike.

Trolltunga - another smaller but wild rock formation above a fjord.

The roads / bridges / countryside along the route between these spots are incredible by themselves. One of the most memorable trips I’ve ever made.

Expand full comment
zahmahkibo's avatar

Tried the same prompt on o3, specifying. "eight *elite professional* players". I do not have the chops to tell if it's any better.

--- Excerpt ---

Pre-flop action

Kenji ( J♠ 2♦ ) instantly mucks.

Laura ( 4♣ 8♠ ) folds.

Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away.

Priya ( 6♣ 8♣ ) opens to $80 from the CO, eyeing position and a double-suited board.

Victor ( 7♠ 2♣ ) doesn’t defend the button—fold.

In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more.

Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens.

Alex tanks, decides the squeeze is too rich and surrenders his $80 dead money.

Priya, in position with a sexy two-gap suited connector, calls $200.

Pot to the flop: $655.

Expand full comment
Seneca Plutarchus's avatar

It is not better.

"Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away."

This doesn't actually describe anything real about his hand.

"In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more."

Actually a terrible play for an elite professional, this is a 3 bet all the time, so this is actually a strategic error.

"Sofia ( J♣ 3♦ ) elects to fold the big blind.

Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens."

Was there anything about a straddle in your setup of this hand that you cut out of what you posted here? Because Dimitri should not be in this hand unless he is straddling UTG, which is not mentioned at all. This is where straddle action would take place, after the big blind folds. The presence of a straddle makes Alex's QTs call even worse BTW.

Expand full comment
Adam's avatar

I assume it's Nate's exact setup, which includes a straddle.

Of course, then there is the nonsense of referring to Q10 as a "two-gap connector" which... just no.

Expand full comment
Daniel Echlin's avatar

Poker is solved-ish by computer, you could set up any AI model you want to play against such a computer and see what happens.

Expand full comment
awindowcleanerme's avatar

Everyone here is saying LLMs can't do math. That's false and not the problem.

Expand full comment