
A few weeks after Donald Trump’s second presidential win, I took the train up from London (where I was living at the time) to Oxford to attend a conference on polls and forecasts of the 2024 election. Most of the attendees were pollsters or academics, but I also watched presentations from Aaru and Electric Twin, two companies that do what is interchangeably called synthetic sampling, silicon sampling, or creating synthetic audiences. Sans startup jargon, that means they use large language models (LLMs) to simulate responses to public opinion polls by having AI agents take on the role of survey respondents.
I had already heard of Aaru thanks to some articles with eye-catching headlines like “No people, no problem: AI chatbots predict elections better than humans” in the months leading up to Election Day. The guys behind the company were making some big, some might even say far-fetched claims, such as: “within two years, we will simulate the entire globe — from the way crops are grown in Ukraine to how that impacts production of oil in Iraq, trade through the strait of Malacca, and elections for the mayor of Baltimore.” When Semafor asked Aaru’s cofounders — Cameron Fink and Ned Koh — about my boss, they said “we respect all those who came before us.” Nate (as he so often does) shared his thoughts on Twitter:
Fink and Koh were relatively good-natured about this back-and-forth when we spoke at Oxford. They even offered to mail me one of the t-shirts featuring Nate’s quote they apparently had made. I never took them up on the offer, which I now somewhat regret.
These synthetic sampling companies fell off my radar for a while, but they do still exist. In fact, Aaru recently received a $1 billion valuation. Is what they’re doing anywhere close to the most important frontier in AI development? Not by a longshot, especially when Anthropic just developed a model so adept at exploiting software vulnerabilities that it’s only being released to 40 companies.
Still, silicon sampling is increasingly finding its way into public polling. Axios reported in March that “a majority of people trust their own doctors and nurses” based on findings from Aaru — without mentioning that the “people” in that sentence were actually LLMs. Around the same time, the Public Sentiment Institute “boosted” their online sample of 373 real survey respondents with 114 AI agents.1 (Spoiler alert: even the co-founder of Electric Twin doesn’t think that’s a particularly defensible approach.) Polling companies like Qualtrics and Ipsos are also developing synthetic data panels.
So, what should we make of these … “polls”? Let’s get one thing out of the way: whatever they are, they’re not polls in the way that term is usually defined.
You can’t replace polls with AI
On one hand, using LLMs to essentially make up fake survey respondents sort of sounds like the dumbest idea ever, one that will only imperfectly replicate real polls while introducing all sorts of biases. On the other hand, with LLMs improving at a remarkable, perhaps even alarming rate, maybe I’m a dinosaur at the ripe old age of 24 because I still want to rely on polls that talk to actual people.
I’m not going to argue that synthetic samples are completely useless. In fact, as I’ll return to later, there is evidence that some techniques can replicate topline survey results quickly and cheaply. But the marketing from certain companies can be slightly optimistic. “No traditional poll will exist by the time the next general election occurs,” said Fink in 2024. We’re just 206 days away from the midterms, and based on the fact that I still have to collect a bunch of polls every day, I’d say he should have run that prediction by a sample of AI agents before the interview.2
To see why synthetic samples can’t replace polls, here’s a quick primer on how they work. The simplest version of these models involves taking a LLM (like ChatGPT or Claude), giving it a demographic profile (e.g., a white, college-educated woman who lives in Utah and makes $70k a year), and then asking it to respond to a survey question. You repeat that process a few thousand times using different demographic profiles and end up with a sample of synthetic survey responses.
The actual models used by private companies are more sophisticated than this, usually because they incorporate more demographic characteristics for each agent and provide them with extra information. Aaru, for example, feeds agents a diet of news and information they’d be likely to consume, while Electric Twin incorporates their customers’ proprietary data about the audience they’re trying to replicate. The way Ben Warner, the co-founder of Electric Twin, explained it to me was “we have a large amount of data on… for instance, 5,000 people. Can we make an accurate prediction of how they would respond to another question?”
Still, without any reference to cost, speed, or accuracy, it should be obvious why synthetic samples can’t replace polls. Polling is fundamentally a data collection process. We might use surveys to make predictions by feeding them into election forecasts, but the main purpose of a poll isn’t prediction, it’s gathering new data about what people think and how they feel. Silicon sampling, on the other hand, produces no new data. It’s simply a model: you input LLM training data, demographic prompts, and a bunch of other information, and it spits out a prediction for what a poll would say.
We love models here, but models aren’t polls. That difference is an important philosophical sticking point for most pollsters I talk to. “I think politics should stay away from [synthetic sampling], because we’re trying to… represent the voice of the people,” said Natalie Jackson, a vice president at GQR Insights. Democratic pollster John Hagner told me “I think I’m just incredibly skeptical of this idea. I don’t think it’s research. At that point, you’re asking the machine to tell you what you already believe.” Hagner has seen some presentations of early synthetic sampling experiments, but so far, “if it’s being used in a campaign, people are keeping it incredibly quiet.”3
But Eli, I hear you saying, aren’t polls themselves increasingly governed by modeling decisions? Indeed they are: pollsters’ choices on which sampling method to use, how to define their likely voter models, and how to weight their samples can and do lead to dramatic differences in the results they publish. Aaru even referenced these limitations in the methodology statement included with that maternal mortality “poll” — although I’m using the term “methodology statement” loosely here, because it doesn’t really explain how the model works at all.
We can ignore the (frankly preposterous) implication that synthetic sampling isn’t subject to a separate set of biases. The important point is that there’s still a meaningful difference between using weighting and other statistical techniques on actual polling data and using a model to predict what a poll would say. The latter is far closer to election forecasts or techniques like MRP — potentially useful models, but not a replacement for polls.4
To be fair, other synthetic sampling companies are perfectly happy with the distinction between polls and models. Warner compared polling and synthetic sampling to different tools in a toolbox. “The mistake I think we make is we think that these new tools should either work in exactly the same way or somehow replace these old tools,” he said. “Rather than thinking of it as, okay, so we’ve always had the hammer, we’ve always had the screwdriver, now we’ve got a saw. But don’t use a saw to try [to] do the job of a hammer.”
A quick comment from Nate
Eli didn’t ask me for a comment — rather rude of him, don’t you think? But since I’m editing this story, I figured I’d add a few quick thoughts rather than putting words in his mouth.
Beyond the frequently misleading marketing, what bothers me the most about the AI “poll” hype is that as AI tools make statistical inference cheaper and/or better (note that these are not synonyms) that actually increases the comparative value of collecting original data. You might be able to train a model to make a reasonable estimate of what some hard-to-reach poll respondent would say — say, a young Black man who voted for Trump. (Such a person checks a number of boxes for a voter who is usually hard to reach in surveys.) Indeed, this is closely related to what models like the Silver Bulletin forecast already do. They essentially smooth out the kinks in noisy survey data by making inferences based on past voting patterns or national polls or surveys of other states.
But you don’t actually know what these voters think unless you’re reaching them directly. If there’s a shift in opinion among this subgroup, you’re not going to detect it. So if I were running a campaign, I’d invest more in going the extra mile to find a representative sample of these voters. And then I’d hire some smart quants — perhaps with help from Claude et. al. — to figure out the implications for campaign strategy based on that proprietary data that my competitors didn’t have access to. -Nate Silver
Are these models any good?
If synthetic surveys are just a new type of model, the next obvious question is whether the models are at least accurate. The answer to that question depends very much on who you ask.
On one end of the spectrum, you have the maximalist argument that synthetic sampling is better and more accurate than actual polls. “It’s an incredibly challenging problem to go to someone and say ‘hey, we’re going to be more accurate at predicting human behavior than you, even when you talk to your customers directly,” Koh recently told CNBC. In his view, synthetic sampling isn’t a saw to polling’s hammer, it’s “magic.”
There’s certainly evidence that synthetic samples can replicate certain survey toplines. But if Aaru does have any examples of their approach outperforming the polls, they’re keeping those to themselves.5 Aaru’s 2024 election model, for example, had Kamala Harris leading in Michigan, Nevada, Pennsylvania, and Wisconsin on November 4th. And although they’ve since taken down their forecast page, they gave Harris a 50.5 percent chance of winning the race on November 2nd.6
After the election, Fink told Semafor he was happy enough with those results because they were “within margin of error,” a term that is meaningless when applied to a “sample” of AI agents. And of course, Aaru says their models have improved since 2024, so supposedly now they’d be more accurate than the polls? Still, their stronger argument is on cost: “We are significantly faster and cheaper than traditional polling, and still more accurate,” said Fink. The first two claims there are undeniably true, but the third brings us to the opposite end of the spectrum.
Both Jackson and Hagner are skeptical that these models are reliable for anything far beyond replicating common survey toplines. “I just… don’t think the machines are what we want when we’re looking for nuanced views. My example on this is people in Arizona and Nevada in 2024 who voted for Trump and voted for expanding abortion in their states on ballot initiatives,” said Jackson. Hagner identified a similar issue: “the reports that have come through at the meetings that I’ve been at are that the early experiments on this, they cannot get respondents to be as racist or sexist or, frankly, as negative as human respondents.”
Academic research mostly agrees on this point. While there are some papers that show promising results when using LLMs to replicate polling data, most show that LLMs suffer from various quirks like producing too few “don’t know” responses and can seriously overpredict the favorability of politicians like Donald Trump and Kamala Harris. They also seem to struggle with too little variation between demographic subgroups, so the difference in predicted opinion between Democrats and Republicans, for example, is too small.
When I asked Warner about these studies, his response to these papers was that just because academics can’t get synthetic sampling to work doesn’t mean that the technique doesn’t work in general. “Actually, the argument is, okay, yours does not [work]. That does not mean […] for this complex set of machinery, which uses a lot of investment, a lot of time, a lot of money, you can’t get it to work.”
Cards on the table, I’m somewhat sympathetic to this argument because academics aren’t exactly great at making election forecasts. Usually, the people with skin in the game are the most accurate. Warner’s argument is that the approach Electric Twin takes — which includes, for example, making multiple predictions for each synthetic respondent using different models and prompts and subsequently averaging those to get a final prediction like a sort of ensemble forecast — produces better results than the simpler academic models.
Warner shared a comparison between his method and the method from a recent academic paper with me, and Electric Twin was indeed able to get more accurate replication. But even still, he acknowledged that synthetic sampling “is not a crystal ball.” “If you asked me, do I think using other data sources will be more accurate than asking somebody who they will vote for, I would probably say no. But if you asked me ‘would your system be useful for our turnout modeling today?’ I would say yes.”
But for better or worse, it looks like the method is already getting more popular in the market research world. Most of the clients Aaru talks about these days are businesses like EY and McDonald’s.
That’s not to say AI won’t pop up in other parts of the political polling process. Pollsters are already using it to code open-ended survey responses, and some firms, like YouGov, are testing using LLMs to ask survey respondents questions.
More worryingly, one danger to actual polls is that AI agents can be used to infiltrate online surveys. Most online polls use various checks to prevent that from happening, but there’s conflicting evidence on how effective those filters are and how prevalent AI agents currently are in online panels. If those agents ever become impossible to detect, it might spell the end of online polling, but the solution isn’t to replace all of your respondents with ChatGPT.
That particular poll obviously doesn’t meet Silver Bulletin standards for aggregation. But we exclude all Public Sentiment Institute polls from our averages because we classify them as an amateur polling firm.
You could argue that Fink meant the next presidential election, but (a) I’m also confident we’ll still have real polls in 2028 and (b) in that case he should have asked an LLM to define “general election.”
Quick caveat: that’s reporting from a Democratic pollster. It’s possible that Republicans are more willing to use AI in political campaigns.
Indeed, Silver Bulletin does not include “polls” produced by MRP in our forecasts or averages, and we think it’s extremely misleading when their practitioners describe them in a way that suggests original data had been collected among a large number of states or Congressional districts.
A recent report from Aaru and EY did show two examples of a synthetic estimate being closer than a survey to a real-world benchmark — but I’d take those findings with a grain of salt because the report reads more like an ad and didn’t involve any sort of prediction being made ahead of time.
For comparison, our odds for Harris on the same day were 48.2 percent.







Back in the day, I used to lead some research efforts on neural network speech recognition. One approach that looked promising (mostly done in labs other than ours) was to train a large model on many hours of speech, and then use that model to generate synthetic data that could be used to train a much smaller model. While the very "deep" and large models were better at learning, the smaller ones were just about as good, and could be pretty shallow and require much less computation for inference. In our lab we had also used such methods just to improve our models for speech recognition using the same networks. So I can see that it could be helpful. But there is no question that having more real data, especially in something like the political landscape that is so time variable, would be better than synthetic data, which probably just fills in a few holes in the models.
One way that these methodologies could exceed the performance of traditional polling not addressed here is in handling non-response bias. For example, while many demographics may be very hard to reach in a traditional poll, they may still have an observable footprint in other ways such as social media interaction, purchasing decisions, and news consumption. I find it quite plausible that a highly semantically aware model like an LLM could semi-reliably glean how people in the demographic might vote based on that data. It definitely won’t be perfect but I don’t find it hard to believe that it will be better than noise or no data at all.
So I agree that claiming these models know people better than themselves is quite preposterous, and there certainly will be significant sources of error, I think it is entirely reasonable that these approaches work with a lot more data than traditional polling, would could result in superior performance. What are your thoughts on this?