Real talk on models, moderation, and the misuse of academic authority

Playback speed

Share post at current time

0:00

Transcript

Real talk on models, moderation, and the misuse of academic authority

It’s time to stop being polite. Here are 36 takes on how moderation does or doesn't help candidates win — and why I trust outsiders to build election models more than academics.

Nate Silver

Aug 20, 2025

165

Transcript

It’s been a long time since we’ve had occasion to use the Model Talk banner at Silver Bulletin. But I think this post calls for it. Usually model talks are paid content, but this one is free. We do very much appreciate your subscriptions, though.

Earlier today, I conducted a Substack Live with Lakshya Jain of the excellent elections analysis site Split Ticket, which is also now partnering with the new Substack newsletter, The Argument.

Recently, Jain has been engaged in an argument over how much moderate candidates, as opposed to progressives, overperform at the ballot box. Split Ticket publishes a calculation called WAR1, which has found that moderates, controlling for other factors — mainly incumbency and the partisan makeup of the district in other races — indeed usually do a few points better. These relationships are noisy and, for reasons I’ll get into later, probably becoming less important as political partisanship devours everything in its wake. Still, I agree with Jain that there is almost certainly some signal there. Indeed, various measures of ideology or partisanship have been incorporated into our congressional forecasts for many years, and they meaningfully contribute to predictive accuracy.

However, WAR has triggered some pushback, particularly from a pair of political scientists, Adam Bonica (Stanford) and Jake Grumbach (Berkeley), who recently wrote a post critiquing Split Ticket’s work. As you’ll see, I have some fairly nuanced opinions about the moderation question itself. But as someone who’s put as much time into researching this exact question as just about anyone, I don’t take such an equivocal view of the Bonica and Grumbach post. I think they’re unfair to Jain and Split Ticket, are rhetorically manipulative at almost every turn, and show a lot of evidence of sloppy thinking and poor methodological instincts. (G. Elliot Morris has also published his own version of WAR — you probably know that I don’t think highly of Morris — but with just one or two exceptions, I’m going to focus on Jain/Split Ticket versus Bonica/Grumbach.)

So what you’ll find below are the quite extensive notes, rewritten for clarity, that I prepared for myself in advance of the conversation with Jain. They’re divided into two sections between more general comments about how much moderation really matters — and then some more specific comments about Bonica and Grumbach, sometimes referred to as “B&G”.

My views on the WAR wars

The median voter theorem (MVT) offers what should be an uncontroversial proposition: that in popularity contests — in other words, elections — undertaking more popular positions helps you to win.
To the extent voter preferences exist along an ideological spectrum, this usually implies taking positions somewhere toward the middle of it.
With that said, everyone in this debate can be squishy about the distinction between “moderate” vs. “popular” vs. “taking positions that happen to resonate well with the local electorate”. There may be issues on which the platforms offered by both major parties are well to the left or the right of the median voter, perhaps erring too far away from the direction of populism relative to what would maximize their chances of winning. Conversely, centrist elites tend to be “fiscally conservative but socially liberal.” But I’m not sure you’d win a lot of elections by, say, pledging to substantially increase immigration levels while cutting social spending to lower deficits.2
Still, if you study elections for any length of time, MVT is one of those fundamental forces you have to account for. Sometimes it’s highly relevant to your calculations. Sometimes it can be ignored or can be outweighed by other factors. But it’s always there, lurking in the background.
If you don’t like that analogy — and indeed, you should be suspicious of so-called iron laws in the social sciences — maybe the better comparison is to something like the proposition: “Exercise is good for your health”. You can object that certain types of exercise are better than others. Some types involving repeated stress might even be harmful if you do them too much. Or that other factors — diet, genetic luck — are more important. There are also some notoriously complex confounders in the data (people who are fastidious about exercising are often conscientious about other things too). But it’s a strong enough regularity that if your model finds that exercise hardly matters at all, it should be regarded with suspicion. This is basic Bayesian reasoning.
It’s not just at the district-by-district level where we find evidence of this. It’s also at the macro level. In recent elections, voters have felt like the Democratic Party has shifted too far left away from their beliefs — yes, even as compared to Trump! — and that’s corresponded with a rough era for the party electorally.
Multidimensional models of voting, in which voter preferences are distributed along some sort of Cartesian plane (e.g. with “cultural issues” on the x-axis and “economic issues” on the y-axis) are theoretically a significant improvement on a one-dimensional model. With that said, more issues have gotten compressed onto a single left/liberal/Democratic vs. right/conservative/Republican spectrum lately, which arguably makes the MVT theorem more relevant than ever.
On the other hand, voter preferences have grown more bimodal. If 45 percent of voters take a liberal position, 10 percent take a moderate stance, and the remaining 45 percent take the conservative view, it’s lonely in the middle and what MVT predicts about the benefits of centrism are less clear.
There are lots of things apart from policy that it’s perfectly rational for voters to take into account. Efficacy in getting things done, lack of scandals/corruption, intelligence to adapt to new circumstances, and so on. In New York, where I’ll have a vote in the mayoral election this November, Zohran Mamdani isn’t the closest candidate to me on some sort of ISideWith quiz. Still, I might vote for him anyway for those other reasons.
If you want to claim that the Democratic establishment right now is particularly lame, and therefore there might be a premium on outsider views of any kind, I’m sympathetic, even though this reeks of special pleading. (There are always exceptions to every rule, but it’s easy to claim that everything is an exception.) But the relationship between “establishment-ness” and centrism is messy. Outsider centrists, like Dan Osborn in Nebraska, could theoretically do as well as outsider leftists like Zohran. I also don’t agree with those who say that Kamala Harris ran to the center: there was too much left-coded baggage from her past campaigns, and she wasn’t very persuasive about explaining these shifts; voters are rightly skeptical of Etch-a-Sketch candidates.
This is one of the most critical points: there’s strong evidence even in my own work — we implemented a change to our congressional model before 2022 to reflect this — that the importance of moderation has lessened. But that’s also true for all “candidate quality” factors such as incumbency or experience. Herschel Walker lost, for instance, but not by all that much. Everything has increasingly been swallowed by partisanship.
And/but, marginal effects matter more if we can be more confident that control of the country will come down to a couple of points in a few key states or races. A 2-point moderation “bonus” might not matter much if incumbents usually win by 10 points, but if everything is a nail-biter, it matters more.
As Matt Yglesias writes, there have been a lot of efforts to define whether moderation is “overrated” or “underrated”, but that’s inherently somewhat in the eye of the beholder. Morris recently wrote that “strategic moderation in 2024 could have increased a Democrats’ vote share by 1-1.5 points”, implying that is trivial. This is already a little confusing because Morris is referring to vote share (“Trump got 56 percent of the vote in Texas”), and not the more intuitive concept of vote margin (“Trump won Texas by 14 points”). Vote margin is roughly twice as high because in a two-way race, basically any vote that’s not for your guy is for the other guy. In other words, it’s not 1 to 1.5 points of vote margin, it’s 2 to 3 points. Well, that’s huge! About as much as the assassination attempt against Trump or Biden’s disastrous debate moved the polls. Morris also writes that this might increase the chance of a candidate winning by “just” 10 percent. That’s huge, too, especially if multiplied across dozens of races.
This should probably go in the “meta” section below, but you can tell who has or hasn’t had any skin in the game when you see a statement like this. I can’t tell you how many late nights over the years I’ve spent trying to eke out an extra 1 percent or 0.1 percent or 0.01 percent of accuracy from my models. Just last night, for instance, I stayed up late to look up the altitudes of dozens of historical NFL stadiums because it might make our estimates of home-field advantage ever so slightly better for the NFL metric we’re building.
The extreme partisanship we see today, especially in party primaries, also weeds out a lot of high-leverage data points. We don’t have that many examples of how a moderate Republican or conservative Democrat would perform, because the primary electorates even in “opposite-colored” states (e.g. Democratic primaries in red states) often select more “extreme” candidates. When the Larry Hogans of the world get through to a general election, they seem to overperform, but it’s an increasingly small sample.
But I’m not sure you can read these overwhelming levels of partisanship as refuting the MVT. As the parties have become more ideologically “coherent”, voters rationally infer more from the party labels and perhaps correctly assume even Hogan or Joe Manchin types will often do the bidding of their party leaders.
Elections aren’t everything; the whole point of winning them is to steer the direction of the country by implementing your agenda. Generally, people in politics — not to mention every other walk of life — are too reluctant to admit to trade-offs. Part of what I object to is a “having your cake and eat it too” attitude. Thermostatic effects have increased in recent years; it’s hard for the president to be popular, and his ideas usually become less popular once he’s in office. Vibe-shifts are short-lived. But this coincides with parties having been ambitious about implementing their agendas. Biden actually got a lot done, especially relative to his narrow majorities, but Democrats paid a price for it. And certainly, Trump is going well beyond his narrow mandate and will probably be punished by the electorate for it, too. There being a penalty for fucking around and finding out is broadly consistent with the MVT.

Bonica, Grumbach and how to detect academic BS

Brandolini's law suggests that “the amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.” So to level the playing field, it’s sometimes necessary to look for “tells” that one side of an argument is engaging in bullshit. (Bullshit in the Harry G. Frankfurter sense: not necessarily lying, but spinning to the point of callous indifference toward the underlying facts; more generally, arguing in bad-faith.) Everybody, including me, is guilty of this sometimes. But when you see a cluster of the tells I’ll describe below, you can start to assume that rhetorical tricks, appeals to authority and other logical fallacies are propping up an argument.

Let’s start with B&G’s opening line. In boldface, they write: “The authors are political scientists.” Yep, that’s literally the first thing you see. Credentials are nice, but the first rule of credentials is that you don’t talk about your credentials. Not unless they’re specifically relevant because your opponent is making a rookie mistake. And B&G have no right to pull rank on Split Ticket, which has been every bit as much in the weeds on analyzing elections as they have. They’re like a guy who shows up in a Harvard sweatshirt to a community board meeting about the construction of a new swimming pool, who keeps reminding you at every opportunity that he went to the The Department of Architecture at the Harvard Graduate School of Design when critiquing an unobjectionable design from a well-regarded team of local architects.3 The correct reaction to this person is: “Who does this guy think he is?”, probably with a couple of f-bombs thrown in.
Specifically in the field of election forecasting — and the debates over WAR are really debates about forecasting, because the upshot is the potential performance of various types of candidates in future elections — most of the advancements have been made by self-taught tinkerer types, often with some background in statistics but not PhDs. In contrast, forecasts published by academics have a mostly bad track record. While you can find some exceptions in both buckets, I don’t think this is a coincidence. The outsiders who gain a following usually have some “street smarts” about modeling, and modeling is mostly about “street smarts” and domain knowledge and less about the book smarts you’ll gain in another year of a PhD program.
Statistical inference — knowing your way around a complicated dataset until you develop a rigorous understanding of the underlying dynamics of the system — isn’t really something that’s selected for in academia. It’s improved by having a lot of reps, but the pace of the academic publication pipeline is too slow — and unless you draw Andrew Gelman as Reviewer #2, your reviewers are unlikely to be specialists in inference either. The replication crisis suggests there’s a lot of sloppy work making it into even the top journals.
Those modeling skills are also improved by having skin in the game: by suffering a reputational loss because you published forecasts that were wrong or buggy — or even suffering a financial loss because you have experience making bets for a living. Split Ticket has this: they actually publish race ratings and probabilistic forecasts. Bonica and Grumbach do not.
Here I’m being speculative, but I imagine there is also increasing selection out of academia among smart quants with a Riverian mindset, the sort of people who have a real intuition for modeling and are competitive about it. This is because of the rise of finance and tech — most recently AI — which can offer considerably more financial upside for quants in an environment free of academic political bullshit. (It’s noteworthy that, in contrast to many previous technologies, the frontiers of AI have mostly been pushed by private companies, not academia or government.) If a smart young quant today asked me whether to pursue a career in academia or go into industry, my recommendation would be industry in 4 out of 5 cases; the exception would be risk-averse people who had a very good chance of getting tenured at a very good program and really liked the academic lifestyle.
Jain covered this well in his own piece. But Bonica and Grumbach are totally out of line — after priming their readers by pointing out that Jain presented his findings at a conference of centrist Democrats — by writing that Split Ticket WAR is a “biased metric” because moderates have better WAR scores. Imagine that I analyzed how well NFL quarterbacks perform based on their college stats. It turns out that high draft picks perform better than is predicted by their college numbers alone.4 In B&G’s way of thinking, this would imply that my metric is “biased” toward high draft picks. But really, just the opposite is true; my metric is biased against high picks because there’s useful information in their draft position that I otherwise haven’t accounted for.
There’s a lot of slipperiness in B&G’s argument because Split Ticket WAR is deliberately intended to capture the residual — what’s left over after you account for the presidential vote in each district, incumbency, and some other “fundamental” factors — whereas their model includes these factors in the regression terms. The Split Ticket residual — performance above what the “fundamentals” would suggest — tends to be higher for moderates, which means that moderation is one measure of higher “candidate quality”. This is a feature of their system, in other words, and B&G are trying to spin it as a bug.
But if you want to talk about bias, fine — let’s talk about publication bias in the academic literature. Academics are overwhelmingly more likely to be registered as Democrats than Republicans; this has always been true, but it’s especially true in an era of high educational polarization. Other things being equal, findings are more likely to be published when they imply good news for progressives. Also, you’re more likely to have a result published when it upends the conventional wisdom, and the consensus in political science for many years was that the MVT has a lot going for it and that moderation probably helps. A finding that says the existing literature is wrong in a way that implies good news for progressives is going to be catnip to journal reviewers, Substack subscribers, and others.
Track record has to count for something, too. In April, Grumbach, Bonica, and two other researchers published an analysis suggesting that, in contrast to the emerging conventional wisdom, it’s not true “non-voters are Republicans and that Trump would have won by more if everyone had voted”. They based that claim on a single survey, the CES. But that survey has since been revised based on verification against actual voter files and now finds the nonvoters leaned toward Trump after all. What’s especially telling is that Grumbach tweeted that “all of the publicly available data” (emphasis mine) says the assumption about non-voters being Trump-friendly was wrong. This is just a complete misrepresentation of the literature. In fact, as Nate Cohn wrote about in June, all studies except the CES had found non-voters to be GOP-leaning, and now CES does too.
To bolster their claim that Split Ticket is biased, B&G cite an example involving incumbency that is at best extremely confused — and probably strengthens Split Ticket’s case. “Let's say two things are true at the same time,” Bonica and Grumbach write.5 “Incumbents, on average, get a 3-point electoral bump just for being incumbents [and] incumbents also happen to be, on average, more moderate than challengers.” The passive-voiced “also happen to be” language is a tell here. How do you get to be an incumbent? By winning elections!6 So if this supposition is true — incumbents are relatively moderate — it’s further evidence that moderation is helpful electorally.
Bonica and Grumbach give away the game in one of the footnotes they write to their Substack post. “Our supplementary analyses using various model specifications consistently find effects [of moderation] under 3 percentage points, with most showing no statistically significant effect at all,” they say. Let’s unpack this. First of all, like Morris, they’re referring to vote share and not vote margin. So in terms of vote margin, the effects of moderation are up to 6 points in some of the models they tried out. That’s huge! Moreover, moderation was a statistically significant predictor in some of their models — although not most, they say. So they ran a whole bunch of model specifications and happened to go with one that showed little effect of moderation, even though their other models did. 🤔
By the way, it’s totally fine to run your model in a whole bunch of different ways. This is a sign of street smarts, actually. You don’t need to pretend that the answer was handed down to you on a tablet from Moses. You want to spend a lot of time familiarizing yourself with the data inside and out. What street smart model-builders do is develop a sense for which factors are most robust once they’ve tested out a variety of specifications. This is where a lot of the “art” of statistical inference comes in.
If you’re finding that a factor makes a large difference in some versions of your model but little or none in others, sometimes the issue is that your underlying data is dubious and you need to go back and QC the data, collect more of it, or go back to the lab and develop a more rigorous algorithm to process it.
Although the “composite ideology score” B&G cite in their blog post is opaque, Bonica has put a lot of good work into a project called CFScores, which can be used to impute ideology based on donor giving. For instance, if a voter has a record of donating to AOC and Bernie, the other candidates she gives money to are probably progressives, also. This is a clever approach. But the nature of fundraising has changed, with far more out-of-state giving and disproportionate contributions to candidates who are stars on social media. In fact, we used to use CFScores in FiveThirtyEight’s congressional forecasts, but stopped doing so before a big model refresh in 2018 because too many of these scores didn’t pass the smell test. Some Democrats, in particular, came out as much more left-wing in CFScores than you’d expect based on other measurements. Instead, we now solely used a measure based on Congressional voting, and find that candidates who break with their party more often usually overperform.
Domain knowledge — for instance, making forecasts of every Congressional district, as Split Ticket does — can help greatly for vetting different model specifications. In our video chat, Jain cited examples of candidates whose ideology seems to be misspecified by B&G. While you can’t control for every one of these, there were enough to suggest that CFScores and the other measures B&G use to measure ideology introduce a lot of noise.
Yet another tell: B&G brag about how they “used a machine learning method called a glmnet” to fit their estimates. But machine learning — ironically enough, what Jain does in his day job — is a totally inappropriate tool in a case like this. The textbook will tell you to use machine learning rather than classical inferential statistics when the following conditions hold:
1. You have an extremely large data set — millions of observations or more7 — not the dozens or hundreds or at best thousands of observations that you’re usually limited to in election analysis;
2. The relationships between the variables are ambiguous, such as in how human beings put different words together to form sentences — again, not like in politics, where the “rules” of the system are more orderly and there’s a finite list of well-specified factors (incumbency, moderation, the presidential vote, etc.) that could plausibly provide signal;
3. You care more about predictive accuracy than causal explanation. Even the OpenAI researchers I interviewed for my book referred to GPT as a “big bag of numbers”. It works — at least some of the time — and they don’t necessarily care how it works: the proof is that it works. But in this case, B&G do care about explanation, or you’d think they would. They repeatedly accuse Split Ticket of lacking transparency, while using a method that’s a black box.
Bonica and Grumbach then go on to brag about how well their model fits the past data: “Judged by the standard of predictive power, our BG-WAR model is vastly superior. Our transparent, out-of-sample model explains 92% of the variation in congressional vote shares.”. This isn’t necessarily as impressive as it sounds, because a great deal of the variation in Congressional vote shares can be explained by the overall partisanship of the district. But to the extent it’s what their model spits out, they may just be overfitting the data, as ML methods often do when inappropriately applied to sparse data sets. If you claim to have “predicted” every past election with implausible accuracy, that’s usually a sign that you’ve tortured the past data until it fit. This is another negative tell, the Allan Lichtman problem.
But at least Lichtman is actually putting himself out there and making (sometimes wrong!) predictions, which is more than you can say for Bonica and Grumbach. There’s a lot of jargon in their post about “in-sample” versus “out-of-sample” prediction. I think this discourse is usually confused. Here’s a simpler way to tell whether something is actually a prediction: it’s a claim about an event that hasn’t happened yet. It’s really as sample as that. There’s no technique that can overcome the “original sin” of knowing what the results were before you started designing your model.8
To be fair, though, it’s a stretch — possibly one designed to sound impressive to peer-reviewers — to describe the technique that B&G applied as “machine learning”. Instead, it’s a form of lasso regression, which when applied to a small dataset like this, is a kissing cousin of the much-derided stepwise regression. Basically, they’re telling their statistical software to use an algorithm to drop some of the variables from their equation. But these algorithms aren’t particularly sophisticated, aren’t really intended for use cases like this, and will generally be inferior to the choices made by an experienced researcher with more domain knowledge. On top of that, B&G didn’t really trust their ML algorithm — as their footnote makes clear, they tried out a whole bunch of specifications before settling on this one.

I understand that academics are having a difficult time under Trump, and I’m sympathetic. But how academics present themselves to the public in cases like this matters. Maybe three or four times a year, I’ll give a talk at a university or be drafted to teach a friend’s class. Basically without exception, these are great experiences, from students to faculty to staff. It’s much different than the impression you’d get of academics from social media and Substack. I don’t think it’s one of the primary factors — but perceptions of academia have profoundly declined in the social media era, perhaps because it’s clear how much academics tend to foreground their personal politics. So other academics ought to be more willing to call out examples like the Bonica and Grumbach post, where academics use the imprimatur of “science” to weigh in on politically sensitive questions with highly questionable methods.

I sometimes get branded as a centrist — I think that’s somewhat incorrect and really I’m just a liberal as that term was originally defined. But, in any event, I’m not claiming the Democratic Party would do better if it implemented a more Nate-approved agenda (federal subsidies for the construction of new poker rooms?).

You might imagine, if you like, that the proposed site of the pool is down the block from him. He’s not a neutral party. Maybe his real concern is the effect it will have on his property value or his commute to the Big City.

This actually isn’t as clear as you’d think, in part because high draft picks are more likely to get opportunities to fail — see Ryan Leaf — but we’ll get into that in our NFL coverage.

Though oddly, they also write that “this specific example doesn’t apply to the Split Ticket model”.

In our modeling, in fact, we find that appointed incumbents — like interim senators who are tapped to fill a seat when the elected senator dies or resigns — don’t really get any of the benefits that elected incumbents do.

The GPT models are trained on trillions of tokens from text collected on the Internet, for instance.

Even if a researcher “holds out” some of his data, his approach to the problem — which model specifications he finds to be more plausible, which variables he’s thinking of evaluating — are going to be affected by knowing the outcome.

Real talk on models, moderation, and the misuse of academic authority

My views on the WAR wars

Bonica, Grumbach and how to detect academic BS

Discussion about this video