The issue with your scenario is that you’re hitting the things LLMs are bad at: LLMs dont see letters they see parts of words, and so they historically struggle at counting letters and any form of arithmetic, so asking it to do the math for the poker pot is very difficult. In addition, you’re having it simulate an entire game of people, instead of recommending moves to play: this results in it using narrative tropes to craft its answers, and so you get a game of poker that sounds close to one that would be in a novel, where the author doesn’t really know the poker mechanics but knows what story beats he wants to hit.
The better evaluation would be to have it be a player in a series of live poker games and then see how well it does.
Lukas is making a statement of fact, not attempting to contradict Nate.
Nate doesn't understand how LLMs work, so he is describing a simple test he did and analyzing the results in detail.
Lukas is saying it was doomed from the start, and his substack page and a few Google searches make it clear he has the expertise to make this statement.
No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.
"The result of 6,784,000,213 × 2,379,980,541 is 16,145,788,497,079,855,000.
Note that this number is so large (about 1.61 × 10¹⁹) that it exceeds JavaScript's safe integer range, so there may be some precision limitations in the final digits. For the most precise result with such large numbers, you'd want to use a big integer library or mathematical software designed for arbitrary precision arithmetic."
Huh? It did it in floating point, told me the answer rounded, said the answer is rounded, and said what you need to do to get an exact answer.
Do you really want me totype the words "Can you do that?" and hit enter? Okay, here it is:
Perfect! Using BigInt for exact precision, the result is:
16,145,788,497,079,855,233
This is the completely accurate result. As you can see, the last few digits are different from the previous calculation (233 instead of 000) because BigInt maintains full precision for arbitrarily large integers, while regular JavaScript numbers lose precision beyond about 15-16 digits.
You're a human being. You know what the answer is supposed to be because MarkS _told_ you what the answer is.
By contrast the program told you "This answer MAY be rounded, but I'm not going to check if it is. Instead if you want to double check yourself here's some information on arbitrary precision."
It's apparent to a human being that if you know the answer could be wrong that you should do the due diligence to check whether it's correct and then do the extra work to actually get the right answer. The AI doesn't "understand" that. The ultimate gauge here is whether or not these tools are useful to human beings and this is one example, of many, where the answer is technically correct but ultimately useless.
If hypothetical "arithmetic" you mean the level of math that's taught in grade school, maybe. But Python (and object oriented programming in general) are not exactly intellectually demanding either.
You say that AI models hallucinate very rarely, but even if they only do it 1% of the time, doesn’t that massively detract from the value of what they produce? If that 1% means you have to double check everything they do - even the 99% that’s correct - that undermines their value considerably.
Plus more frequency means more mistakes. If you leverage some AI tool with a 1% failure rate once a year that's maybe not too bad. If it's firing off millions of times an hour that's an entirely different story.
1. Yeah just in practice verifying info is far easier than constructing it.
2. You can use Perplexity or a similar service which also dumps the necessary news/books citations on you.
3. "Individual contributors" in any knowledge work make mistakes way more than 1% of the time. So the question is if the senior engineer or the editor would like to automate away the IC. That is certainly happening to some hard-to-discern degree and the bottom rung of the career ladder is endangered.
4. In a lot of domains AI also has the ability to check itself. Programming/software engineer is very sophisticated in layers of checking / testing, and AI can just check the results of these tests itself.
5. The scope of those checks in (4) is increasingly rapidly as AI learns how to verify its work as well as produce it.
I'm working on an Oh Hell book---draft deadline 7/1!---and this almost precisely mirrors my experience with ChatGPT. It has been incredibly helpful for a lot of the writing---copy editing, organizational suggestions, etc.---but it just cannot play the damn game. It makes laughable structural and substantive errors, and brings a hilarious confidence that puts you in the overt uncanny valley every time it talks at all about strategy. And I'm literally teaching it the strategy when I show it my writing. The whole experience has made me both more in awe of LLMs---no idea how long the manuscript would have taken me without LLM feedback---and very skeptical that we are super close to an AGI coming out of this particular paradigm absent another breakthrough.
One question I think we should ask ourselves is whether we're expecting a human level of intelligence or whether we're expecting a smart expert level of intelligence. Right now I think AI is a lot like a college student who's used to BSing their way through essays. I think a lot of humans are bad at math and poker strategy and are likely to lose track of stacks and misunderstand poker strategy. Is AGI supposed to be on the level of an average human, or on the level of an expert?
For example, my company is trying to replace some of the Bangalorean entry-level contractors we use with AI, and I think for the most part, that's doable. Our Bangalorean entry-level contractors are often easily confused. If you are asking them to do anything other than their most common rote tasks, it's often much better to go through their permanent, very smart Bangalorean supervisors with your request so that you can make sure your request is fully understood. AI is absolutely not at the point where the supervisors can be replaced. But the entry-level temp contractors are mostly handling rote tasks and only occasionally asked to do something more complicated. Once you train the AI on the rote tasks, I'm sure it'll accomplish them much faster than the humans would have.
One major issue with the human-vs-computer intelligence comparison IMO is that humans and computers learn in very different ways. The vast majority of our skills and knowledge comes from spending years personally interacting with our environment and seeing how it responds. Computers learn using incredible amounts of already-existing text, images, audio files, etc. The reason LLMs sound so much like college students BSing their essays is because that's kind of what they are: they have consumed a great deal of raw information, but they have no "real life" experience.
I think it's pretty reasonable to assume LLMs will be able to perform rote intellectual tasks, like generating secondary reports based on a large number of primary sources, much better than humans. But will they ever, say, figure out how to facilitate communication between you and your Bangalorean entry-level contractors better than a Bangalorean supervisor who has spent years working with companies like yours? I'm not at all confident about that.
That's not really true, like an hour ago I gave it a "debug this thing" problem which was sort of rote, but frankly I couldn't do it, and I'm not going to semantics away the fact that it could do it and I could not. Turns out a certain database system's docker image wasn't built for Mac by default, who knew.
I've figured out how to use AI to get over writer's block. For me personally -- "write this thing" or "correct my writing" doesn't work. What does work is to write my own draft and ask for feedback, which I find easy to respond to since it's very good at it.
I also frequently just dump in a pdf and it's clear I want a summary which it gives. I don't really need to explain it how to summarize.
And this is generalist... I don't have to re-train a contractor each time.
The issue is that o3 doesn't actually understand poker at all. Rather, it is able to produce a simulacrum of a poker game based on previous written descriptions that it has seen. This is why it is generally able to get the structure, steps, terminology, and form correct, while completely whiffing on the actual logic of the game. This is also not something that a model like o3 due to its architecture is likely to improve on - even with more training. It requires a fundamentally different architecture to approach this type of problem (for example, some of the reasoning architectures may produce better results over time).
This is also one of the reasons in particular transformer-based models are so confusing to people in terms of their capabilities. They can (for example) easily generate code to count the number of letters in a word, but if you ask them to actually count the number of letters in a word they struggle. If you understand the actual architecture of such models, then this behavior is expected, but for someone who is told that it is "nearly AGI" it is confusing.
Using o3 instead of o4 seems like setting up ChatGPT for failure. I've found o4 superior even for basic data calculations like providing cummulative data for many election cycles. I asked ChatGPT o4 about how it would likely do compared to o3 in poker and it replied:
Overall Efficiency for Poker
Speed & Cost: o4 is around 2–3× more efficient operationally.
Strategic Depth: Both models perform similarly in terms of intelligence and reasoning, but o4’s larger context means better memory of prior actions, crucial for bluff detection, bet sizing, and tracking player styles.
Stamina: For continuous games or training simulations, o4 scales far better.
I think that's an uninformed "Duh" though. If you're far enough behind on AI capability, you hear about a non-capability, feel your priors are confirmed, fire off a "told you so" on appropriate social media, and continue in your worldview.
If you know anything at all about LLM's then you know they're not good at math.
If you know anything about programming then you know that the algorithms that are used by game programs are nothing like the algorithms used by LLM's.
If you know anything about LLM's then you know that the basic responses are just statistical correlation. "Deception" is added in afterwards by teams of interns.
2. Try the prompt "write a reinforcement learning connect 4 solver, train it, and have it play against me" in an agentic programmer loop. I did on Claude 3.7, it works. You could do it with Go, the code is the same, it just takes more compute than I have.
3. Yeah idk what point you make by pointing out LLMs are just electronic-statistical models. Brains are just chemicals. Nuclear reactors are just teeny tiny atoms splitting. Etc.
4. It can do some deception just by prompting it, like write a short story about a detective. It's also learning deception on its own as an emergent behavior. Here's a bunch of examples https://redwoodresearch.substack.com/p/alignment-faking-in-large-language. This isn't good btw. Like it's cool they're powerful enough to make coding easier but it's bad they're powerful enough to essentially break out of their training environment.
1. Since Silver is testing an LLM, duh. More to the point the algorithms employed by an LLM (which at their heart are based on statistical correlation) are not the same algorithms employed by something like Alpha Go.
Do laymen need to care about that? Of course not. They'll just judge the final product on how useful it is.
Do engineers need to care about that? You better believe they do. A few months ago all the chatter was about switching models in and out depending on the problem. That's a much harder problem then it first appears to be.
2. And I, or any halfway decent player, would kick its ass at Go. Or chess. Or whatever. Isn't that Silver's point?
It's also inherently unimpressive. Chess or go playing programs are widely discussed on the internet, so for an LLM to just regurgitate code is comparable to an LLM regurgitating the answers to the bar exam.
3. This is kind of unrelated and I think you're responding to another one of my posts, but from a historical perspective AI used to mean a machine that can think like a human being.
Unfortunately that turns out to be really hard. As a consequence you now see a shifting of the goal posts, with AI researchers asking questions like "How do we know that human consciousness isn't just correlation?" How convenient.
4. "Deception" can mean many things here. Does hallucination count? In terms of being tricksy in a scenario like a poker match that is an entirely different issue. I doubt any LLM can do it well and if any try I would bet that's accomplished with hand tuning rather than just relying on the training set.
I'm curious why you didn't direct it to use code to handle randomizing and evaluation of winning hands. Otherwise, the LLM is literally incapable of truly drawing the cards randomly and honestly.
Anyway, I gave the code version a shot, I just added "Please execute code to handle randomizing dealing and the hand." to the end of your original prompt.
“ChatGPT’s explanation might sound smart if you don’t know poker, but it’s basically just word salad.”
This is true for most LLM output on most subjects: the more you know about it, the less sense it makes. I am deeply sceptical of anyone who is inexpert enough in such a large number of subjects that they take a long time to notice this is.
"Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away."
This doesn't actually describe anything real about his hand.
"In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more."
Actually a terrible play for an elite professional, this is a 3 bet all the time, so this is actually a strategic error.
"Sofia ( J♣ 3♦ ) elects to fold the big blind.
Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens."
Was there anything about a straddle in your setup of this hand that you cut out of what you posted here? Because Dimitri should not be in this hand unless he is straddling UTG, which is not mentioned at all. This is where straddle action would take place, after the big blind folds. The presence of a straddle makes Alex's QTs call even worse BTW.
Nate, you write ‘Personally, I don’t consider solvers themselves to be examples of “artificial intelligence”’. It sounds like you’re intentionally abandoning the pre-2022 definition of “artificial intelligence”. That definition includes not only LLMs and other machine-learning technologies, but also state-space search, both deterministic and probabilistic reasoning techniques, automated planning, and knowledge representation. Is this a conscious choice on your part? There has been a significant drift in the post-2022 meaning of “artificial intelligence” to the general populace, to generally mean solely LLMs, so such a choice would certainly lean descriptivist.
ChatGPT coming out completely changed how the world uses the word AI. ChatGPT and “AI” has suddenly become the hot new thing, and within a couple years literally anything technological is being marketed to the general public as Artificial Intelligence. Maybe some of these things would’ve fit the pre-2022 definition of AI but most of them didn’t and AI started not to mean anything anymore.
I’ve 100% noticed in the last year society has tightened the definition and we’re still figuring out where it’ll land. I personally think AI has to be generative and to a certain extent cannot be “put on rails” but it’s hard to say…
All solvers do is just number crunch pre-programmed algorithms for every possible discrete poker hand and action and proceed from there. This is a LOT of possibilities, but it's just doing the same math problem a lot of times and is in no way learning.
THink about it more simply in a smaller game with a more understandable set of choices and combinations: blackjack. You can buy one of those little blackjack optimal strategy cards in an airport, they fit on a notecard. They've done the math to determine what every possible combination of your cards and the dealer's face cards mean for your action. It's then pretty easy to learn the math on how card counting changes these odds. It's just simple math repeated for each possible combination.
Same in chess, you can tell a computer to simulate all possible future moves out of any given board condition and it can calculate the exact optimal course just by brute force checking every possibility. There isn't intuition, it's just number chugging.
Solvers are doing the same thing, just for A LOT more combinations and variables in poker, so there is more compute needed.
The underlying algorithms are completely different. You need some kind of master program that recognizes that the LLM is unsuited for this task and turns to a completely different program (that plays chess, poker, go, whatever). Good luck with that.
I thought solvers just brute force various algorithms and calculate the expected values of the various steps - that doesn't seem like something I would call artificial intelligence.
I’m not Nate, but personally I thought that the pre-2022 use of the term was mostly cope by researchers who wanted to be able to use fancy terminology to describe their research, even if the research was really just a glorified calculator.
I have been to Norway many times, but the only time post covid was to take a cruise to Svalbard. So some of my information might be out of date.
I recommend that you fly to Bergen and take the Flåm Railway and Fjord Cruise on Nærøyfjord both the rail journey and the cruise offer beautiful fjord views. (When I did this there was a bus from Bergen to top of the Flåm Railway and a rail option that, I think, involved a very short bus transfer.) Now some possibilities to consider:
1. Look into extensive Sognefjord boat tours. (Sognefjord is the lomgest fjord in Norway, Nærøyfjord is a small particulaly dramatic branch of Sognefjord.) There also bus and driving options to overlooks and some towns in the fjords.
2. I would check out the hurtigruten coastal voyages. They run pretty much daily from Bergen and you can economically purchase a limited trip getting off at ANY port. (Note it wasn't easy to do this online and I used a travel agent who had worked with Hurtigruten before.) One day takes you to Ålesund and 4 days to Tromsø. (I mention Ålesund and Tromsø because Ålesund can be a jumping off point for fjord tours and Tromsø is beautifully situated on two islands that look like a single Island split by a fjord, also the trip Bergen to Tromsø includes the best fjord views on the coastal voyage.) On the trip from Bergen to Tromsø Hurtigruten offers some tours that you can purchase, they do vary in quality and the advertising for them can be misleading. Also, you get to see some of the Lofoten Islands, which are beautiful.
3. If you need to keep to the approximately 5 days, you can fly back from Tromsø.
4. If you want to travel by rental car, there are now bridges and tunnels that let you drive to the Lofoten Islands. This is not entirely a blessing as the development there has made visiting them less interesting then it used to be.
5. Flights to Bergen on Icelandic airlines allow free stopovers in Iceland, the one day Golden circle tours are very worthwhile.
So if it were me and I had not previously seen a lot of the fjords (but somehow new what I learned on the previous trips) and my time was severely limited. I would fly Bergen (probably landing in the late afternoon or ealy evening) stay a place on or very near the old harbor. Then the next day take the Flåm Railway and Fjord Cruise on Nærøyfjord and that evening board a hurtigruten coastal voyage ship headed to Tromsø, I would consider taking one or two excursions from the ship. Then getting off in Tromsø spend a day or part of a day exploring the area around Tromsø and fly back to Bergen to catch my flight home.
Scott
N. Scott Cardell
Pullman WA
(I am leary about putting my email in a public forum. But I subscribe to the Silver Bulletin and you can find my email there.)
Very interesting. Nate's work on this issue may help refine the problems with inaccuracies and hallucinations of the AI boiling all around us. I'm not at all sure if this is a monster coming at us to destroy us or whether it's vaporware, like fusion energy or nanotech, neither of which went anywhere after a huge build-up. Or, maybe just a really, really good search engine.
I'm in love with the Perplexity AI (which, annoyingly, is privately held ---) and use it daily and even pay the monthly fee for more in-depth searches. Not sure if that is necessary: you get five deep searches free, they want people using it. However, it does have a percentage of inaccuracies. I asked if a certain Latinist is still alive (P.L. Chambers) and Perplexity disclaimed all knowledge of the existence of this person. I own three of her books right now so I rephrased the question a little, and up came all the info, and the AI even researched death records, it claimed. (Didn't find any for her so Perplexity thinks she's still with us.) I find that rewording the question or adding sentences to the query makes a lot of difference. It seems to be important to already know something about the subject ---- so that I'll know if the answer is wrong entirely. The AIs like elaborate queries and mine are sometimes several sentences.
I tend to think experiments like this show why LLMs will be insufficient to get to AGI. The results are even more hysterical when you try to use ChatGPT to play chess. It not only isn't a very good chess player, it can't even stop itself from making obviously illegal moves.
LLMs have some really impressive capabilities but they're really bad at building a "mental model" of a system that has to function in a rigid way, where small details being wrong makes the whole thing non-sensical. And that's a really important part of intelligence.
i'm curious whether ChatGPT is any better at sports gambling than it is at poker. A lot of sites offer various profit boosts in different scenarios and I would like to know when these are plus EV bets assuming I select the outcome at random. This is fairly easy on a -110/-110 bet but gets a lot more complicated with the same game parlays these boosts are typically offered on. Using 4o, I have not had any confidence in the results.
The issue with your scenario is that you’re hitting the things LLMs are bad at: LLMs dont see letters they see parts of words, and so they historically struggle at counting letters and any form of arithmetic, so asking it to do the math for the poker pot is very difficult. In addition, you’re having it simulate an entire game of people, instead of recommending moves to play: this results in it using narrative tropes to craft its answers, and so you get a game of poker that sounds close to one that would be in a novel, where the author doesn’t really know the poker mechanics but knows what story beats he wants to hit.
The better evaluation would be to have it be a player in a series of live poker games and then see how well it does.
Finish the article…
What you said is literally the entire thesis of Nate’s article and is stated pretty clearly to me in the conclusion
Lukas is making a statement of fact, not attempting to contradict Nate.
Nate doesn't understand how LLMs work, so he is describing a simple test he did and analyzing the results in detail.
Lukas is saying it was doomed from the start, and his substack page and a few Google searches make it clear he has the expertise to make this statement.
Pontificating about things he has no idea about is Nate's specialty.
"Better evaluation" in what sense? Why isn't it interesting that it fails at this one?
In the sense that evaluating whether a screwdriver is a good hammer only really proves that the evaluator is confused.
No LLM has ever managed to learn the rules of arithmetic. Why anyone believes that they are capable of learning the rules of far more complex systems is a great mystery to me.
Literally they just write python or javascript programs when they need to do arithmetic and call the programs.
As an aside, most people would call Python more complex than arithmetic.
"they just write python or javascript programs when they need to do arithmetic" is not true.
Go ask any LLM to multiply 6784000213 times 2379980541. See if you get the correct answer of 16145788497079855233.
"multiply 6784000213 times 2379980541"
Claude 4 writes a program and then
"The result of 6,784,000,213 × 2,379,980,541 is 16,145,788,497,079,855,000.
Note that this number is so large (about 1.61 × 10¹⁹) that it exceeds JavaScript's safe integer range, so there may be some precision limitations in the final digits. For the most precise result with such large numbers, you'd want to use a big integer library or mathematical software designed for arbitrary precision arithmetic."
So it "wrote" a program that returned the wrong answer?
Huh? It did it in floating point, told me the answer rounded, said the answer is rounded, and said what you need to do to get an exact answer.
Do you really want me totype the words "Can you do that?" and hit enter? Okay, here it is:
Perfect! Using BigInt for exact precision, the result is:
16,145,788,497,079,855,233
This is the completely accurate result. As you can see, the last few digits are different from the previous calculation (233 instead of 000) because BigInt maintains full precision for arbitrarily large integers, while regular JavaScript numbers lose precision beyond about 15-16 digits.
You're a human being. You know what the answer is supposed to be because MarkS _told_ you what the answer is.
By contrast the program told you "This answer MAY be rounded, but I'm not going to check if it is. Instead if you want to double check yourself here's some information on arbitrary precision."
It's apparent to a human being that if you know the answer could be wrong that you should do the due diligence to check whether it's correct and then do the extra work to actually get the right answer. The AI doesn't "understand" that. The ultimate gauge here is whether or not these tools are useful to human beings and this is one example, of many, where the answer is technically correct but ultimately useless.
There is no way that Python is more complex than partial differential equations.
OK? That seems unrelated.
If hypothetical "arithmetic" you mean the level of math that's taught in grade school, maybe. But Python (and object oriented programming in general) are not exactly intellectually demanding either.
LLM's don't learn rules.
And most human's don't reason with "rules" the way they believe they do.
You say that AI models hallucinate very rarely, but even if they only do it 1% of the time, doesn’t that massively detract from the value of what they produce? If that 1% means you have to double check everything they do - even the 99% that’s correct - that undermines their value considerably.
Plus more frequency means more mistakes. If you leverage some AI tool with a 1% failure rate once a year that's maybe not too bad. If it's firing off millions of times an hour that's an entirely different story.
Do you fire employees who get things wrong 1% of the time?
Is the employee a pharmacist, surgeon, bus driver, etc?
And how wrong are we talking about?
Those are the jobs safer from wholesale elimination by AI, yes.
Well, no, for a few reasons.
1. Yeah just in practice verifying info is far easier than constructing it.
2. You can use Perplexity or a similar service which also dumps the necessary news/books citations on you.
3. "Individual contributors" in any knowledge work make mistakes way more than 1% of the time. So the question is if the senior engineer or the editor would like to automate away the IC. That is certainly happening to some hard-to-discern degree and the bottom rung of the career ladder is endangered.
4. In a lot of domains AI also has the ability to check itself. Programming/software engineer is very sophisticated in layers of checking / testing, and AI can just check the results of these tests itself.
5. The scope of those checks in (4) is increasingly rapidly as AI learns how to verify its work as well as produce it.
I'm working on an Oh Hell book---draft deadline 7/1!---and this almost precisely mirrors my experience with ChatGPT. It has been incredibly helpful for a lot of the writing---copy editing, organizational suggestions, etc.---but it just cannot play the damn game. It makes laughable structural and substantive errors, and brings a hilarious confidence that puts you in the overt uncanny valley every time it talks at all about strategy. And I'm literally teaching it the strategy when I show it my writing. The whole experience has made me both more in awe of LLMs---no idea how long the manuscript would have taken me without LLM feedback---and very skeptical that we are super close to an AGI coming out of this particular paradigm absent another breakthrough.
One question I think we should ask ourselves is whether we're expecting a human level of intelligence or whether we're expecting a smart expert level of intelligence. Right now I think AI is a lot like a college student who's used to BSing their way through essays. I think a lot of humans are bad at math and poker strategy and are likely to lose track of stacks and misunderstand poker strategy. Is AGI supposed to be on the level of an average human, or on the level of an expert?
For example, my company is trying to replace some of the Bangalorean entry-level contractors we use with AI, and I think for the most part, that's doable. Our Bangalorean entry-level contractors are often easily confused. If you are asking them to do anything other than their most common rote tasks, it's often much better to go through their permanent, very smart Bangalorean supervisors with your request so that you can make sure your request is fully understood. AI is absolutely not at the point where the supervisors can be replaced. But the entry-level temp contractors are mostly handling rote tasks and only occasionally asked to do something more complicated. Once you train the AI on the rote tasks, I'm sure it'll accomplish them much faster than the humans would have.
One major issue with the human-vs-computer intelligence comparison IMO is that humans and computers learn in very different ways. The vast majority of our skills and knowledge comes from spending years personally interacting with our environment and seeing how it responds. Computers learn using incredible amounts of already-existing text, images, audio files, etc. The reason LLMs sound so much like college students BSing their essays is because that's kind of what they are: they have consumed a great deal of raw information, but they have no "real life" experience.
I think it's pretty reasonable to assume LLMs will be able to perform rote intellectual tasks, like generating secondary reports based on a large number of primary sources, much better than humans. But will they ever, say, figure out how to facilitate communication between you and your Bangalorean entry-level contractors better than a Bangalorean supervisor who has spent years working with companies like yours? I'm not at all confident about that.
That's not really true, like an hour ago I gave it a "debug this thing" problem which was sort of rote, but frankly I couldn't do it, and I'm not going to semantics away the fact that it could do it and I could not. Turns out a certain database system's docker image wasn't built for Mac by default, who knew.
I've figured out how to use AI to get over writer's block. For me personally -- "write this thing" or "correct my writing" doesn't work. What does work is to write my own draft and ask for feedback, which I find easy to respond to since it's very good at it.
I also frequently just dump in a pdf and it's clear I want a summary which it gives. I don't really need to explain it how to summarize.
And this is generalist... I don't have to re-train a contractor each time.
The issue is that o3 doesn't actually understand poker at all. Rather, it is able to produce a simulacrum of a poker game based on previous written descriptions that it has seen. This is why it is generally able to get the structure, steps, terminology, and form correct, while completely whiffing on the actual logic of the game. This is also not something that a model like o3 due to its architecture is likely to improve on - even with more training. It requires a fundamentally different architecture to approach this type of problem (for example, some of the reasoning architectures may produce better results over time).
This is also one of the reasons in particular transformer-based models are so confusing to people in terms of their capabilities. They can (for example) easily generate code to count the number of letters in a word, but if you ask them to actually count the number of letters in a word they struggle. If you understand the actual architecture of such models, then this behavior is expected, but for someone who is told that it is "nearly AGI" it is confusing.
Using o3 instead of o4 seems like setting up ChatGPT for failure. I've found o4 superior even for basic data calculations like providing cummulative data for many election cycles. I asked ChatGPT o4 about how it would likely do compared to o3 in poker and it replied:
Overall Efficiency for Poker
Speed & Cost: o4 is around 2–3× more efficient operationally.
Strategic Depth: Both models perform similarly in terms of intelligence and reasoning, but o4’s larger context means better memory of prior actions, crucial for bluff detection, bet sizing, and tracking player styles.
Stamina: For continuous games or training simulations, o4 scales far better.
If I had nickel for every time an AI evangelist told me "use a different model"...
Duh!
I think that's an uninformed "Duh" though. If you're far enough behind on AI capability, you hear about a non-capability, feel your priors are confirmed, fire off a "told you so" on appropriate social media, and continue in your worldview.
I think the more appropriate question is why would anyone think that an LLM would be able to play poker in the first place. It's pure ignorance.
Because it's good at math and deception?
If you know anything at all about LLM's then you know they're not good at math.
If you know anything about programming then you know that the algorithms that are used by game programs are nothing like the algorithms used by LLM's.
If you know anything about LLM's then you know that the basic responses are just statistical correlation. "Deception" is added in afterwards by teams of interns.
1. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ unless you're trying to go for "reasoning models don't technically count as LLMs," in case whatever who cares.
2. Try the prompt "write a reinforcement learning connect 4 solver, train it, and have it play against me" in an agentic programmer loop. I did on Claude 3.7, it works. You could do it with Go, the code is the same, it just takes more compute than I have.
3. Yeah idk what point you make by pointing out LLMs are just electronic-statistical models. Brains are just chemicals. Nuclear reactors are just teeny tiny atoms splitting. Etc.
4. It can do some deception just by prompting it, like write a short story about a detective. It's also learning deception on its own as an emergent behavior. Here's a bunch of examples https://redwoodresearch.substack.com/p/alignment-faking-in-large-language. This isn't good btw. Like it's cool they're powerful enough to make coding easier but it's bad they're powerful enough to essentially break out of their training environment.
1. Since Silver is testing an LLM, duh. More to the point the algorithms employed by an LLM (which at their heart are based on statistical correlation) are not the same algorithms employed by something like Alpha Go.
Do laymen need to care about that? Of course not. They'll just judge the final product on how useful it is.
Do engineers need to care about that? You better believe they do. A few months ago all the chatter was about switching models in and out depending on the problem. That's a much harder problem then it first appears to be.
2. And I, or any halfway decent player, would kick its ass at Go. Or chess. Or whatever. Isn't that Silver's point?
It's also inherently unimpressive. Chess or go playing programs are widely discussed on the internet, so for an LLM to just regurgitate code is comparable to an LLM regurgitating the answers to the bar exam.
3. This is kind of unrelated and I think you're responding to another one of my posts, but from a historical perspective AI used to mean a machine that can think like a human being.
Unfortunately that turns out to be really hard. As a consequence you now see a shifting of the goal posts, with AI researchers asking questions like "How do we know that human consciousness isn't just correlation?" How convenient.
4. "Deception" can mean many things here. Does hallucination count? In terms of being tricksy in a scenario like a poker match that is an entirely different issue. I doubt any LLM can do it well and if any try I would bet that's accomplished with hand tuning rather than just relying on the training set.
I'm curious why you didn't direct it to use code to handle randomizing and evaluation of winning hands. Otherwise, the LLM is literally incapable of truly drawing the cards randomly and honestly.
https://sanand0.github.io/llmrandom/
Anyway, I gave the code version a shot, I just added "Please execute code to handle randomizing dealing and the hand." to the end of your original prompt.
https://chatgpt.com/share/682e5484-d4b8-800c-ab74-4a33df06470a
I have no idea how to evaluate the strategy, but at least you can be sure the deal is truly random
“ChatGPT’s explanation might sound smart if you don’t know poker, but it’s basically just word salad.”
This is true for most LLM output on most subjects: the more you know about it, the less sense it makes. I am deeply sceptical of anyone who is inexpert enough in such a large number of subjects that they take a long time to notice this is.
Tried the same prompt on o3, specifying. "eight *elite professional* players". I do not have the chops to tell if it's any better.
--- Excerpt ---
Pre-flop action
Kenji ( J♠ 2♦ ) instantly mucks.
Laura ( 4♣ 8♠ ) folds.
Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away.
Priya ( 6♣ 8♣ ) opens to $80 from the CO, eyeing position and a double-suited board.
Victor ( 7♠ 2♣ ) doesn’t defend the button—fold.
In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more.
Sofia ( J♣ 3♦ ) elects to fold the big blind.
Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens.
Alex tanks, decides the squeeze is too rich and surrenders his $80 dead money.
Priya, in position with a sexy two-gap suited connector, calls $200.
Pot to the flop: $655.
It is not better.
"Marco ( 8♥ 5♦ ) flings his suited wheel-draw hand away."
This doesn't actually describe anything real about his hand.
"In the small blind Alex peeks at Q♠ T♠, likes the suited Broadway texture, and calls $75 more."
Actually a terrible play for an elite professional, this is a 3 bet all the time, so this is actually a strategic error.
"Sofia ( J♣ 3♦ ) elects to fold the big blind.
Dmitri re-checks A♥ T♥, slides out a $280 3-bet, typical of his 11 % 3-bet range versus LP opens."
Was there anything about a straddle in your setup of this hand that you cut out of what you posted here? Because Dimitri should not be in this hand unless he is straddling UTG, which is not mentioned at all. This is where straddle action would take place, after the big blind folds. The presence of a straddle makes Alex's QTs call even worse BTW.
I assume it's Nate's exact setup, which includes a straddle.
Of course, then there is the nonsense of referring to Q10 as a "two-gap connector" which... just no.
Here's the full convo https://chatgpt.com/share/682e5c28-cefc-800c-8657-f4e833ef3bd0
Poker is solved-ish by computer, you could set up any AI model you want to play against such a computer and see what happens.
Nate, you write ‘Personally, I don’t consider solvers themselves to be examples of “artificial intelligence”’. It sounds like you’re intentionally abandoning the pre-2022 definition of “artificial intelligence”. That definition includes not only LLMs and other machine-learning technologies, but also state-space search, both deterministic and probabilistic reasoning techniques, automated planning, and knowledge representation. Is this a conscious choice on your part? There has been a significant drift in the post-2022 meaning of “artificial intelligence” to the general populace, to generally mean solely LLMs, so such a choice would certainly lean descriptivist.
ChatGPT coming out completely changed how the world uses the word AI. ChatGPT and “AI” has suddenly become the hot new thing, and within a couple years literally anything technological is being marketed to the general public as Artificial Intelligence. Maybe some of these things would’ve fit the pre-2022 definition of AI but most of them didn’t and AI started not to mean anything anymore.
I’ve 100% noticed in the last year society has tightened the definition and we’re still figuring out where it’ll land. I personally think AI has to be generative and to a certain extent cannot be “put on rails” but it’s hard to say…
Of course. "A machine that thinks like a human being" turned out to be really, really hard and so the goal posts had to be moved.
All solvers do is just number crunch pre-programmed algorithms for every possible discrete poker hand and action and proceed from there. This is a LOT of possibilities, but it's just doing the same math problem a lot of times and is in no way learning.
THink about it more simply in a smaller game with a more understandable set of choices and combinations: blackjack. You can buy one of those little blackjack optimal strategy cards in an airport, they fit on a notecard. They've done the math to determine what every possible combination of your cards and the dealer's face cards mean for your action. It's then pretty easy to learn the math on how card counting changes these odds. It's just simple math repeated for each possible combination.
Same in chess, you can tell a computer to simulate all possible future moves out of any given board condition and it can calculate the exact optimal course just by brute force checking every possibility. There isn't intuition, it's just number chugging.
Solvers are doing the same thing, just for A LOT more combinations and variables in poker, so there is more compute needed.
The underlying algorithms are completely different. You need some kind of master program that recognizes that the LLM is unsuited for this task and turns to a completely different program (that plays chess, poker, go, whatever). Good luck with that.
I thought solvers just brute force various algorithms and calculate the expected values of the various steps - that doesn't seem like something I would call artificial intelligence.
I’m not Nate, but personally I thought that the pre-2022 use of the term was mostly cope by researchers who wanted to be able to use fancy terminology to describe their research, even if the research was really just a glorified calculator.
There's also been a convenient retreat on the part of AI researchers/proponents in the sense of "Maybe human consciousness is just correlation".
Nate Silver,
I have been to Norway many times, but the only time post covid was to take a cruise to Svalbard. So some of my information might be out of date.
I recommend that you fly to Bergen and take the Flåm Railway and Fjord Cruise on Nærøyfjord both the rail journey and the cruise offer beautiful fjord views. (When I did this there was a bus from Bergen to top of the Flåm Railway and a rail option that, I think, involved a very short bus transfer.) Now some possibilities to consider:
1. Look into extensive Sognefjord boat tours. (Sognefjord is the lomgest fjord in Norway, Nærøyfjord is a small particulaly dramatic branch of Sognefjord.) There also bus and driving options to overlooks and some towns in the fjords.
2. I would check out the hurtigruten coastal voyages. They run pretty much daily from Bergen and you can economically purchase a limited trip getting off at ANY port. (Note it wasn't easy to do this online and I used a travel agent who had worked with Hurtigruten before.) One day takes you to Ålesund and 4 days to Tromsø. (I mention Ålesund and Tromsø because Ålesund can be a jumping off point for fjord tours and Tromsø is beautifully situated on two islands that look like a single Island split by a fjord, also the trip Bergen to Tromsø includes the best fjord views on the coastal voyage.) On the trip from Bergen to Tromsø Hurtigruten offers some tours that you can purchase, they do vary in quality and the advertising for them can be misleading. Also, you get to see some of the Lofoten Islands, which are beautiful.
3. If you need to keep to the approximately 5 days, you can fly back from Tromsø.
4. If you want to travel by rental car, there are now bridges and tunnels that let you drive to the Lofoten Islands. This is not entirely a blessing as the development there has made visiting them less interesting then it used to be.
5. Flights to Bergen on Icelandic airlines allow free stopovers in Iceland, the one day Golden circle tours are very worthwhile.
So if it were me and I had not previously seen a lot of the fjords (but somehow new what I learned on the previous trips) and my time was severely limited. I would fly Bergen (probably landing in the late afternoon or ealy evening) stay a place on or very near the old harbor. Then the next day take the Flåm Railway and Fjord Cruise on Nærøyfjord and that evening board a hurtigruten coastal voyage ship headed to Tromsø, I would consider taking one or two excursions from the ship. Then getting off in Tromsø spend a day or part of a day exploring the area around Tromsø and fly back to Bergen to catch my flight home.
Scott
N. Scott Cardell
Pullman WA
(I am leary about putting my email in a public forum. But I subscribe to the Silver Bulletin and you can find my email there.)
Very interesting. Nate's work on this issue may help refine the problems with inaccuracies and hallucinations of the AI boiling all around us. I'm not at all sure if this is a monster coming at us to destroy us or whether it's vaporware, like fusion energy or nanotech, neither of which went anywhere after a huge build-up. Or, maybe just a really, really good search engine.
I'm in love with the Perplexity AI (which, annoyingly, is privately held ---) and use it daily and even pay the monthly fee for more in-depth searches. Not sure if that is necessary: you get five deep searches free, they want people using it. However, it does have a percentage of inaccuracies. I asked if a certain Latinist is still alive (P.L. Chambers) and Perplexity disclaimed all knowledge of the existence of this person. I own three of her books right now so I rephrased the question a little, and up came all the info, and the AI even researched death records, it claimed. (Didn't find any for her so Perplexity thinks she's still with us.) I find that rewording the question or adding sentences to the query makes a lot of difference. It seems to be important to already know something about the subject ---- so that I'll know if the answer is wrong entirely. The AIs like elaborate queries and mine are sometimes several sentences.
I tend to think experiments like this show why LLMs will be insufficient to get to AGI. The results are even more hysterical when you try to use ChatGPT to play chess. It not only isn't a very good chess player, it can't even stop itself from making obviously illegal moves.
LLMs have some really impressive capabilities but they're really bad at building a "mental model" of a system that has to function in a rigid way, where small details being wrong makes the whole thing non-sensical. And that's a really important part of intelligence.
i'm curious whether ChatGPT is any better at sports gambling than it is at poker. A lot of sites offer various profit boosts in different scenarios and I would like to know when these are plus EV bets assuming I select the outcome at random. This is fairly easy on a -110/-110 bet but gets a lot more complicated with the same game parlays these boosts are typically offered on. Using 4o, I have not had any confidence in the results.