It’s time to stop being polite. Here are 36 takes on how moderation does or doesn't help candidates win — and why I trust outsiders to build election models more than academics.
Interesting, I work in bioinformatics (so need data science + biology domain knowledge). We deal with the same issues. I wish there was more understanding that even in technical fields, analysis is never just mechanical number-crunching—it’s shaped by human discernment: which variables to include, how to frame the question, what assumptions to privilege. At its core, data science is as much an art of judgment as it is a science of measurement. Unlike the physical sciences, where controlled experiments and natural laws provide a stable foundation, data science deals with messy, contingent, and socially constructed data. In practice, this means models are always provisional, interpretive, and deeply dependent on human choices. In theory, too, the field was never purely positivist: it emerged as a hybrid discipline precisely because it had to combine rigorous statistical methods with the softer skills of inference, domain knowledge, and interpretive judgment. That’s what makes it powerful—but also what demands humility, a quality often missing in political data analysis.
These are great points. Now that I've actually gone back and read B&G's critique as well as Jain's response, I definitely agree with Jain and Nate that the tone of B&G's piece is off-putting and feels like they are attempting some credential-bullying -- humility is not the term that comes to mind reading it, and the implication that Split Ticket was _politically_ biased seemed mean-spirited and unfounded.
That said, I think the obnoxious presentation masked some legitimate technical points that I think got misrepresented in the video here. This comment will probably wind up being almost an article length, for fair warning, but I don't have my own Substack to post it in, so here it is.
At the risk of committing the cardinal sin of credentials, I'll lay out mine here, not to try to bully anyone into accepting my claims uncritically, but to hopefully convince people who might be reading that the EV of reading this comment (and critically evaluating it on the merits!) is worth the effort, to identify which aspects I will have the most to say about, and perhaps on the flip side also out my possible "tribal biases": I have a Ph.D., not in political science but in statistics, and spent several years working as a statistics professor, though I'm one of those folks that exited academia for industry due in part to academia's perverse incentives. As someone who has consciously rejected the academic track after trying it out, I don't think I have a reflexive instinct to defend academics -- if anything perhaps the opposite to an extent -- but I also know how easily the nuances of statistical models can get lost, both in other academic disciplines who use statistics and outside academia.
I need also to acknowledge that since I don't have the explicit equations or source code of these models, I'm having to make some educated guesses about what precisely they're doing. But there are at least two big things B&G point out that I think got a bit unfairly dismissed here, unless they were misrepresenting what the Split Ticket folks actually did.
The first is the question of how incumbency and other similar factors were controlled for. I think this is a subtler point than it was made out to be. Specifically, regressing an outcome on a known predictor and then applying a constant adjustment based on that analysis to a subsequent model of the residuals _is_ different from including both predictors in a multiple regression model. And here I'm not 100% sure how Split Ticket went about constructing their WAR stat, but if in fact they _did_ take this sequential approach, then I do think it's a problem.
The issue is that when you regress a dependent variable on a single predictor, the coefficient you get is picking up not only on variance attributable to that one predictor, but to any variance attributable to other predictors that are correlated with it (i.e., confounding variables). Anyone who has done multiple regression will be familiar with the fact that including a new predictor in a model alters the coefficients associated with the previous predictors. This is because the coefficient in a multiple regression model is _not_ (as is often mistakenly assumed) the "effect of the variable if the others were "held constant", it is an _adjustment_ to the prediction associated with the extent to which the predictor differs from the value you'd expect it to take based on its relationship with the other predictors.
In other words, if I fit a model of electoral margin that included incumbency, ideology, and fundraising, as predictors, say, the coefficient for ideology would be telling me, about the relationship between the residual electoral margin in a model that includes incumbency and fundraising and the _residual_ ideology in a model that had _that_ as the dependent variable, with incumbency and fundraising as predictors. In other words, how much nonredundant information is there in learning a candidate's ideology that can be used in improving the prediction of electoral margin relative to how well you could do if you didn't know their ideology.
In contrast, if you fit a model with incumbency and fundraising as predictors, say, and find that incumbents perform 3 points better than non-incumbents, say, after "controlling for" fundraising, some of that 3 points is _already_ attributable to ideology insofar as it's partially predictable from the other two predictors. If you now apply that 3 points as a constant adjustment in a subsequent model that regresses the residuals from the first model onto an ideology predictor, you're no longer looking at the relationship between _residual_ ideology and residual margin, you're looking at the relationship between _absolute_ ideology and residual margin, some of which was already factored in into the constant adjustment you made.
Again, I don't even know if B&G is making the right assumption about Split Ticket's methodology, and maybe they actually _didn't_ make this mistake. But even if they did, unlike what B&G seemingly assume, I don't take this issue as a sign of incompetence or political bias meant to buoy moderate candidates, but rather as a subtle nuance of multiple regression modeling that is _very easy_ to lose sight of.
So that's the first point. And I'll have to leave the others to subsequent comments to make the "article in a comment" thing a _little_ less ridiculous.
Interesting. One thing that opponents of Lakshya and Nate say that I’m sympathetic to is that they there is a lot more uncertainty about these things. Even in biology, when we deal with raw but real genomic data, there is so much uncertainty, I would be very careful to make even correlational claims about genetic variations and disease phenotypes, so I can imagine that in political data analysis, when the data itself is socially constructed, people should be a LOT more caveated and uncertain about claims they make.
Yeah. I think this attention to uncertainty is a strength of Nate's, and one of the reasons I follow him. In most of the disagreements I've seen him involved in, he's the one on the side of reminding people that politics and polling, etc. are a lot more complicated, with a lot more uncertainty, than the simpler pundit models imagine. That said, I wish he'd actually publish his model code for full transparency, as there's inevitably precision and transparency lost in the translation to a verbal explanation (though there's also transparency gained by writing up what you did in natural language). I also understand why he doesn't from a business standpoint, but the ideal "marketplace of ideas" doesn't have any room for proprietary techniques or analyses.
I saw Elliott and Lakshya argue recently at GD’s substack and it was revealing. I would recommend it as it clarified their disagreements and helped me think through what I think about these models/predictions.
I just checked out this video (or as much of it as I was allowed to watch as a non-paid-subscriber to GD's substack) -- thanks for pointing it out! Again, without diving into the model code or hearing the full debate (though apparently Morris's model is open source, so maybe I can check it out!) it sounds like Morris's model takes a more principled approach to monitoring uncertainty, and also a more "causal" approach in avoiding using noisy outcomes such as presidential vote as predictors of other concurrent noisy outcomes like congressional vote. I haven't tried to sit down and think through the implications of that choice, and it sounds like _not_ including that was Jain's principal critique of Morris's model (which he was getting into right as the video cut off), but there are bound to be some subtle implications.
In 3 1/2 years time it’ll be: “did republicans shift too far to the right.” There’ll be hot takes on how the GOP brand is toast. I think we’re in an electoral period like 1884-1900 and we’re in for a few more swings yet.
I agree with you, and the pendulum has already swung back and forth a few times.
"The GOP brand is toast" is exactly what people were saying in mid-2016 with all of the "Death of the Republican Party" think pieces saying how the nomination of someone as obviously unelectable as Donald Trump would doom the GOP to being a "regional party" for decades.
I'd elaborate on point 28, the comment, "with most showing no statistically significant effect at all." This is a common and annoying feature in academic debates, using failure to reject a null hypothesis as evidence the null hypothesis is true. In the most extreme form, one person will publish a result, and another person will claim to have refuted it with a study with a smaller sample size and sloppier work that fails to find a significant effect. Absence of evidence is not evidence of absence.
If many different model specifications agree on the direction of the effect, most most are not individually significant statistically at the 5% level, that can give you more confidence in a result than a single specification with a very low p-value. Every directionally correct specification is evidence for the result, even if not very strong on its own.
The authors predict next month's stock market return using 12 monthly values of 15 indicators (stock market return, interest rates, market price-earnings ratio, etc.). That's 300 variables from which they fit 12,000 parameters by generating random Fourier features. Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models.
This defies conventional wisdom in which modelers try to focus on a signal and ignore the noise, selecting only a few key indicators with strong statistical evidence and plausible causal links, and combining them only in simple models. Using too many variables and parameters in overly complex model is supposed to lead to "overfitting," models that explain the past perfectly and the future not at all.
"Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models."
That paper is 1.5 years old. How has their model done at predicting stock returns over that time?
I am also curious about the double descent phenomenon we are now seeing in a lot of different domains (fit to withheld data improves when the number of parameters in the model grows beyond what is necessary to fit the data perfectly). I would be curious to hear if it is also seen in election modeling. I've read arguments that double decent should be found only when the data has little real sampling noise in it and there is a long tail of infrequent parameter combinations, which doesn't seem like a great description of the polling data...
I've never seen double-descent or similar ideas in election predictions. I agree with your assessment. The techniques might be useful if you tried to predict election returns not from polls--which as you say seem to be the wrong kind of data--but from a large mass of objectively measured data like wages, prices, popular Netflix picks and performance of local sports teams.
I dimly recall a science fiction story, I think from the 1960s, about a future in which election prediction has improved to the point only one carefully selected person has to vote. He or she answers a few hours of seemingly random questions from the computer, after which the winner who would have won in a traditional election is announced. If we ever get to something like that world, it would probably use some kind of double-descent or kernel machine-learning routine.
Thanks for finding it for me (1950s, it turns out, not the 1960s I remembered, but I read it in the 1960s).
My favorite AI observation by Isaac Asimov is in the 1930s he got tired of reading science fiction stories about computers that turn on their creators. He argued that anyone smart enough to build powerful AI computers and robots would certainly build basic morality into them at a base level that the AI could not override however it reasoned. That became his famous three laws of robotics.
Now today we see people smart enough to build powerful AI without even thinking about building in morality at a level beyond the AI's control. Self-driving cars are choosing whether to swerve to avoid a pedestrian at the risk of crashing a school bus with any moral weighing done by the AI software.
Arthur C. Clarke took a different view (Asimov claimed he stormed out of the movie 2001: A Space Odyssey, saying, "He violated my three laws," and a friend replied, "So strike him down with lightning Isaac.") Clarke thought the solution was manual controls--including a plug that could be pulled--that that AI did not know about or control. That safety feature has also been ignored.
What happens if you just turn off the power to the data centres that house the AIs? At the limit, you can just physically cut off the cables. They are dependant on electricity to work.
This solution is explored in many science fiction stories. It has a few issues.
First, the huge data centers are used to train AI models. Some AI requires vast computing resources to run, but other AI can run on your phone, car or refrigerator.
Second, most people think intelligence is an emergent property. Human intelligence is thought to emerge from multiple brain systems that evolved for visual processing, fight-or-flight decisions, heart-rate setting and so on. It might be impossible to unplug all AI systems everywhere. Even if possible, it might mean reverting to the Stone Age until dumb replacement systems could be built.
Third, if AI controls the power grid, nuclear weapons, construction equipment, every car, medical implant, appliance, etc.--hears or reads every human communication--it could put up formidable resistance to being unplugged.
There's a basic risk management principle to think of potential disasters and ask what you would wish then you had done now. The idea is not that you can predict disasters, it's that preparing for things you can foresee--even if you think them utterly implausible--gives you the discipline to survive what actually happens.
In these science-fiction AI takeover scenarios, we would probably wish certain defenses and limits had been implemented in advance, and AI carefully screened from knowledge about and control of them.
Too “left” for the centrist, and too mealy mouthed for the liberals. Kamala didn’t run too far to the left (she campaigned with Liz Cheney lmao), but she did try to be everything to everyone. Voters want someone with convictions, that’s why you have people who voted Trump-AOC on the same ticket.
Are there many of those people? Is there data on that? Interested to know. I do agree that at least part of it is that voters like people who seem to speak their mind without caring what others think (which is not quite the same as having convictions; Trump doesn’t really have convictions other than in court).
I think Zohran is a good use case here to show how being like-able is ultimately more important than reaching the median voter, as there is no such thing as an actual ‘median voter’.
I also attached some Wiki election data for AOC’s congressional district, where she over-performed Harris by ~5%. This is all to say, if you base your campaign around reaching the mythical median voter, you’re dooming yourself, voters aren’t looking for milquetoast candidates.
While I personally strongly prefer moderate candidates and will usually reward the most moderate candidate with my vote, is it possible candidates that are successful with moderate positions are just more skilled politicians? Seems like in many districts, being extreme is just lazy politics where the R or D will win by default. I hope I’m wrong.
Thinking about the observation that candidates who break with their party more often overperform electorally. I wonder to what extent this is picking up on "breaking with the party" as a causal strength of electoral performance, and to what extent it's picking up on like, a selection effect for who's in office to have votes to look at in the first place. In other words, could there be a sort of Simpson's paradox thing happening where the apparent correlation in one direction can actually mask a causal relationship in the opposite direction because there are endogenous confounding variables correlated with both.
To be clear, I do find it plausible that breaking with the party on votes might be a strength, but imagine that it (or rather, heterodox positions that lead to heterodox votes) is a liability, and therefore candidates with those positions tend to lose primaries or the general election. But some candidates who are exceptionally good campaigners (or whatever) manage to win despite that. Those candidates are overperforming for other reasons, and because less skilled candidates with similar voting tendencies have lost, we don't have their voting records in the data. Therefore when we look at the group of candidates who break with their party, we're systematically looking at unusually talented politicians, who overperform _despite_ their heterodox voting tendencies.
For those not familiar with Simpson's paradox, here are two classic examples:
* A study of Berkeley graduate program admissions in I think it was the 1970s, which found that women were being admitted to graduate programs at lower rates than men overall, despite having slightly higher admission rates in each individual program. Which appears paradoxical, until you notice that overall acceptance rates differed widely across programs, and the overall acceptance rate for women was pulled down due to the fact that applicant pools to more competitive programs (which tended to be in the humanities) had higher proportions of women.
* In a study of Florida death penalty sentencing for defendants convicted of homocide in the 1980s, white defendants overall got the death penalty slightly more often than black defendants, but if you broke the data down by race of the victim, black defendants got the death penalty significantly more often _both_ in cases involving a white victim and in cases involving a black victim. Which appears paradoxical, until you notice that cases with a white victim were much more likely overall to result in the death penalty than cases with a black victim, and the defendant was more likely to be the same race as the victim, which meant white defendants were disproportionately represented in cases with white victims, pulling up their overall death penalty rate.
Interesting debate. I do wonder though about survivorship bias in regards to the primaries. If primaries favor extreme candidates, as you say, the moderate candidates that survive a primary must be better candidates for other reasons. So it makes sense they would do better in the general. But a randomly chosen moderate may not be any better than a randomly chosen extremist
How confident are you that you have the causality flowing the right way? Could it be that politicians who overperform electorally tend to subsequently buck their party more, vs politicians who buck their party going on to overperform electorally?
Wow. An hour video (thankfully with transcript) and a shotgun blast of random definitions, facts, factoids, and footnotes.
A couple of minor points.
1)
Incumbents may become more moderate in office because that is what is necessary to get legislation to pass. Bernie is consistent, and has had almost no actual impact on bills that have become law. AOC is learning how to play nice during recess and share toys and in her short time in the House has more substantial legislative impact than Bernie in his entire career.
2)
The number of tokens (footnote 7) is not the big deal. If it were, then ChatGPT would only be capable of some raw probability of the empirical best next token, which is pretty boring (like a next letter predictor for English that always suggests 'e').
The important thing in LLMs is that they are trained and model associations based on combinations of those tokens. So given a sequence of tokens and a small subset of the trillions-factorial permutations of tokens in the training set, it predicts the most likely next token in the sequence. You think a trillion tokens are a lot? Wrap your brain around a trillion factorial. There is a reason it is a Large Language Model, not a Large Token Model.
In practice, this would make your point even stronger. A small collection of relatively discrete election results is a poor candidate for LLM style methods.
Unfortunately for your point, even though there is a 'LM' in glmnet, it isn't a LLM. The 'glm' is "Generalized Linear Model". Calling it a ML tool is academically accurate, but confusing for people who think ML == AI == LLM == ChatGPT.
It sure looks like a reasonable package to analyze this kind of data, although the devil is in the details.
To your first point, I was also wondering, if there does turn out to be a residual benefit to moderation, could that be in part because of institutional support from the party? This might seem to be at odds with a finding that candidates who buck the party line more often perform better, but these candidates also tend to be in more competitive seats, which has got to be correlated with how much energy the party puts into helping them get and keep their seats.
A subtler mathematical point which is related to this is that if you model vote margin (or vote share) using a _linear_ model rather than a sigmoid model, you're very likely to see greater elasticity near the middle: when dealing with percentages, especially percentages under a structural symmetry (i.e., two competing categories each near 50%, rather than, say, shares of some small category within a much larger pool) it's almost always the case that it takes a smaller nudge to move the needle when it's near the middle than when it's at an extreme. In a district where one party is heavily favored to start with, a stronger candidate has fewer "persuadables" to bring over, and similarly, a weaker candidate is more likely to lose votes to third parties than to the opposite major party (which is half as big of an impact on two party vote share). This is one reason (among several) why proportion data is typically modeled using logistic regression or similar rather than linear regression, because the model distinguishes the impact of a predictor (in terms of its coefficient size) from the result that impact has on the surface outcome, via the sigmoid link function. Seems technical and in the weeds, but it might really matter in this case.
Interesting point about close districts requiring more support. Of course that extends to collecting donations from the public also, and there is generally more consistent funding available from the center than the extremes.
Thanks for bringing up the linear vs sigmoid item also. I hadn't put my finger on it specifically but that kind of thing was in the back of my mind for my "devil in the details" comment. Glmnet has some support for logistic models, but what was actual done matters.
I could have read the underlying analysis but "model wars" are most interesting seen from afar unless they are directly relevant to one's research.
"Conversely, centrist elites tend to be “fiscally conservative but socially liberal.” But I’m not sure you’d win a lot of elections by, say, pledging to substantially increase immigration levels while cutting social spending to lower deficits."
This depends on which issues are used to map people onto the 2D economic/social continuum. If instead of immigration and social spending, you used abortion rights and taxes, then you would find a lot of people in the "abortion should be legal and taxes should be lower" quadrant.
Calling glmnet “machine learning”… I was at risk of an aneurysm until your final point there.
Anyway, my theory on how social media ruined perception of academia is that it disproved by counter example Plato’s view, namely that cultivating intellectual excellence will entail improvement of other virtues. Bonica and Grumbach’s behavior in this instance is a case in point.
Eh, I don't think it's so bad to label a glmnet "machine learning". It's only been in the last 7-8 years or so that ML has come to be dominated by neural network models, and many foundational ML methods were developed multiple times in parallel in some combination of statistics, computer science, and cognitive science communities. The original neural network (the "perceptron") was just logistic regression by another formulation, etc. I don't really think there's ever been much of a principled distinction between what's "statistics" and what's "machine learning", and the choice of terminology tends to say more about the background of the practicioner than it does about the methodology.
Interesting, I work in bioinformatics (so need data science + biology domain knowledge). We deal with the same issues. I wish there was more understanding that even in technical fields, analysis is never just mechanical number-crunching—it’s shaped by human discernment: which variables to include, how to frame the question, what assumptions to privilege. At its core, data science is as much an art of judgment as it is a science of measurement. Unlike the physical sciences, where controlled experiments and natural laws provide a stable foundation, data science deals with messy, contingent, and socially constructed data. In practice, this means models are always provisional, interpretive, and deeply dependent on human choices. In theory, too, the field was never purely positivist: it emerged as a hybrid discipline precisely because it had to combine rigorous statistical methods with the softer skills of inference, domain knowledge, and interpretive judgment. That’s what makes it powerful—but also what demands humility, a quality often missing in political data analysis.
These are great points. Now that I've actually gone back and read B&G's critique as well as Jain's response, I definitely agree with Jain and Nate that the tone of B&G's piece is off-putting and feels like they are attempting some credential-bullying -- humility is not the term that comes to mind reading it, and the implication that Split Ticket was _politically_ biased seemed mean-spirited and unfounded.
That said, I think the obnoxious presentation masked some legitimate technical points that I think got misrepresented in the video here. This comment will probably wind up being almost an article length, for fair warning, but I don't have my own Substack to post it in, so here it is.
At the risk of committing the cardinal sin of credentials, I'll lay out mine here, not to try to bully anyone into accepting my claims uncritically, but to hopefully convince people who might be reading that the EV of reading this comment (and critically evaluating it on the merits!) is worth the effort, to identify which aspects I will have the most to say about, and perhaps on the flip side also out my possible "tribal biases": I have a Ph.D., not in political science but in statistics, and spent several years working as a statistics professor, though I'm one of those folks that exited academia for industry due in part to academia's perverse incentives. As someone who has consciously rejected the academic track after trying it out, I don't think I have a reflexive instinct to defend academics -- if anything perhaps the opposite to an extent -- but I also know how easily the nuances of statistical models can get lost, both in other academic disciplines who use statistics and outside academia.
I need also to acknowledge that since I don't have the explicit equations or source code of these models, I'm having to make some educated guesses about what precisely they're doing. But there are at least two big things B&G point out that I think got a bit unfairly dismissed here, unless they were misrepresenting what the Split Ticket folks actually did.
The first is the question of how incumbency and other similar factors were controlled for. I think this is a subtler point than it was made out to be. Specifically, regressing an outcome on a known predictor and then applying a constant adjustment based on that analysis to a subsequent model of the residuals _is_ different from including both predictors in a multiple regression model. And here I'm not 100% sure how Split Ticket went about constructing their WAR stat, but if in fact they _did_ take this sequential approach, then I do think it's a problem.
The issue is that when you regress a dependent variable on a single predictor, the coefficient you get is picking up not only on variance attributable to that one predictor, but to any variance attributable to other predictors that are correlated with it (i.e., confounding variables). Anyone who has done multiple regression will be familiar with the fact that including a new predictor in a model alters the coefficients associated with the previous predictors. This is because the coefficient in a multiple regression model is _not_ (as is often mistakenly assumed) the "effect of the variable if the others were "held constant", it is an _adjustment_ to the prediction associated with the extent to which the predictor differs from the value you'd expect it to take based on its relationship with the other predictors.
In other words, if I fit a model of electoral margin that included incumbency, ideology, and fundraising, as predictors, say, the coefficient for ideology would be telling me, about the relationship between the residual electoral margin in a model that includes incumbency and fundraising and the _residual_ ideology in a model that had _that_ as the dependent variable, with incumbency and fundraising as predictors. In other words, how much nonredundant information is there in learning a candidate's ideology that can be used in improving the prediction of electoral margin relative to how well you could do if you didn't know their ideology.
In contrast, if you fit a model with incumbency and fundraising as predictors, say, and find that incumbents perform 3 points better than non-incumbents, say, after "controlling for" fundraising, some of that 3 points is _already_ attributable to ideology insofar as it's partially predictable from the other two predictors. If you now apply that 3 points as a constant adjustment in a subsequent model that regresses the residuals from the first model onto an ideology predictor, you're no longer looking at the relationship between _residual_ ideology and residual margin, you're looking at the relationship between _absolute_ ideology and residual margin, some of which was already factored in into the constant adjustment you made.
Again, I don't even know if B&G is making the right assumption about Split Ticket's methodology, and maybe they actually _didn't_ make this mistake. But even if they did, unlike what B&G seemingly assume, I don't take this issue as a sign of incompetence or political bias meant to buoy moderate candidates, but rather as a subtle nuance of multiple regression modeling that is _very easy_ to lose sight of.
So that's the first point. And I'll have to leave the others to subsequent comments to make the "article in a comment" thing a _little_ less ridiculous.
Interesting. One thing that opponents of Lakshya and Nate say that I’m sympathetic to is that they there is a lot more uncertainty about these things. Even in biology, when we deal with raw but real genomic data, there is so much uncertainty, I would be very careful to make even correlational claims about genetic variations and disease phenotypes, so I can imagine that in political data analysis, when the data itself is socially constructed, people should be a LOT more caveated and uncertain about claims they make.
Yeah. I think this attention to uncertainty is a strength of Nate's, and one of the reasons I follow him. In most of the disagreements I've seen him involved in, he's the one on the side of reminding people that politics and polling, etc. are a lot more complicated, with a lot more uncertainty, than the simpler pundit models imagine. That said, I wish he'd actually publish his model code for full transparency, as there's inevitably precision and transparency lost in the translation to a verbal explanation (though there's also transparency gained by writing up what you did in natural language). I also understand why he doesn't from a business standpoint, but the ideal "marketplace of ideas" doesn't have any room for proprietary techniques or analyses.
I saw Elliott and Lakshya argue recently at GD’s substack and it was revealing. I would recommend it as it clarified their disagreements and helped me think through what I think about these models/predictions.
I just checked out this video (or as much of it as I was allowed to watch as a non-paid-subscriber to GD's substack) -- thanks for pointing it out! Again, without diving into the model code or hearing the full debate (though apparently Morris's model is open source, so maybe I can check it out!) it sounds like Morris's model takes a more principled approach to monitoring uncertainty, and also a more "causal" approach in avoiding using noisy outcomes such as presidential vote as predictors of other concurrent noisy outcomes like congressional vote. I haven't tried to sit down and think through the implications of that choice, and it sounds like _not_ including that was Jain's principal critique of Morris's model (which he was getting into right as the video cut off), but there are bound to be some subtle implications.
Do you want me to send you a paid subscription? I can forward a “gift” subscription as a paid member lol
In 3 1/2 years time it’ll be: “did republicans shift too far to the right.” There’ll be hot takes on how the GOP brand is toast. I think we’re in an electoral period like 1884-1900 and we’re in for a few more swings yet.
I agree with you, and the pendulum has already swung back and forth a few times.
"The GOP brand is toast" is exactly what people were saying in mid-2016 with all of the "Death of the Republican Party" think pieces saying how the nomination of someone as obviously unelectable as Donald Trump would doom the GOP to being a "regional party" for decades.
I'd elaborate on point 28, the comment, "with most showing no statistically significant effect at all." This is a common and annoying feature in academic debates, using failure to reject a null hypothesis as evidence the null hypothesis is true. In the most extreme form, one person will publish a result, and another person will claim to have refuted it with a study with a smaller sample size and sloppier work that fails to find a significant effect. Absence of evidence is not evidence of absence.
If many different model specifications agree on the direction of the effect, most most are not individually significant statistically at the 5% level, that can give you more confidence in a result than a single specification with a very low p-value. Every directionally correct specification is evidence for the result, even if not very strong on its own.
Points 33 - 35 are more controversial than suggested in the post. A seminal paper (https://economics.yale.edu/sites/default/files/2024-01/The%20Journal%20of%20Finance%20-%202023%20-%20KELLY%20-%20The%20Virtue%20of%20Complexity%20in%20Return%20Prediction%20%281%29.pdf) last year that has ignited a firestorm of controversy claims that machine learning beats conventional parsimonious fitting even with small datasets, and that fitting all historical data exactly is a good thing.
The authors predict next month's stock market return using 12 monthly values of 15 indicators (stock market return, interest rates, market price-earnings ratio, etc.). That's 300 variables from which they fit 12,000 parameters by generating random Fourier features. Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models.
This defies conventional wisdom in which modelers try to focus on a signal and ignore the noise, selecting only a few key indicators with strong statistical evidence and plausible causal links, and combining them only in simple models. Using too many variables and parameters in overly complex model is supposed to lead to "overfitting," models that explain the past perfectly and the future not at all.
"Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models."
That paper is 1.5 years old. How has their model done at predicting stock returns over that time?
I'm going to guess: not as well.
It did fine through June 2025, I haven't seen the July 2025 results yet. But it hasn't done well enough to satisfy critics.
https://www.aqr.com/Insights/Research/Working-Paper/Understanding-The-Virtue-of-Complexity
However the main debate is not whether the model works, but whether it is merely an computationally expensive way to mimic a simple momentum model.
I am also curious about the double descent phenomenon we are now seeing in a lot of different domains (fit to withheld data improves when the number of parameters in the model grows beyond what is necessary to fit the data perfectly). I would be curious to hear if it is also seen in election modeling. I've read arguments that double decent should be found only when the data has little real sampling noise in it and there is a long tail of infrequent parameter combinations, which doesn't seem like a great description of the polling data...
I've never seen double-descent or similar ideas in election predictions. I agree with your assessment. The techniques might be useful if you tried to predict election returns not from polls--which as you say seem to be the wrong kind of data--but from a large mass of objectively measured data like wages, prices, popular Netflix picks and performance of local sports teams.
I dimly recall a science fiction story, I think from the 1960s, about a future in which election prediction has improved to the point only one carefully selected person has to vote. He or she answers a few hours of seemingly random questions from the computer, after which the winner who would have won in a traditional election is announced. If we ever get to something like that world, it would probably use some kind of double-descent or kernel machine-learning routine.
I loved that short story! https://en.m.wikipedia.org/wiki/Franchise_(short_story)
Thanks for finding it for me (1950s, it turns out, not the 1960s I remembered, but I read it in the 1960s).
My favorite AI observation by Isaac Asimov is in the 1930s he got tired of reading science fiction stories about computers that turn on their creators. He argued that anyone smart enough to build powerful AI computers and robots would certainly build basic morality into them at a base level that the AI could not override however it reasoned. That became his famous three laws of robotics.
Now today we see people smart enough to build powerful AI without even thinking about building in morality at a level beyond the AI's control. Self-driving cars are choosing whether to swerve to avoid a pedestrian at the risk of crashing a school bus with any moral weighing done by the AI software.
Arthur C. Clarke took a different view (Asimov claimed he stormed out of the movie 2001: A Space Odyssey, saying, "He violated my three laws," and a friend replied, "So strike him down with lightning Isaac.") Clarke thought the solution was manual controls--including a plug that could be pulled--that that AI did not know about or control. That safety feature has also been ignored.
What happens if you just turn off the power to the data centres that house the AIs? At the limit, you can just physically cut off the cables. They are dependant on electricity to work.
This solution is explored in many science fiction stories. It has a few issues.
First, the huge data centers are used to train AI models. Some AI requires vast computing resources to run, but other AI can run on your phone, car or refrigerator.
Second, most people think intelligence is an emergent property. Human intelligence is thought to emerge from multiple brain systems that evolved for visual processing, fight-or-flight decisions, heart-rate setting and so on. It might be impossible to unplug all AI systems everywhere. Even if possible, it might mean reverting to the Stone Age until dumb replacement systems could be built.
Third, if AI controls the power grid, nuclear weapons, construction equipment, every car, medical implant, appliance, etc.--hears or reads every human communication--it could put up formidable resistance to being unplugged.
There's a basic risk management principle to think of potential disasters and ask what you would wish then you had done now. The idea is not that you can predict disasters, it's that preparing for things you can foresee--even if you think them utterly implausible--gives you the discipline to survive what actually happens.
In these science-fiction AI takeover scenarios, we would probably wish certain defenses and limits had been implemented in advance, and AI carefully screened from knowledge about and control of them.
Please create a Silver Bulletin podcast feed!
Too “left” for the centrist, and too mealy mouthed for the liberals. Kamala didn’t run too far to the left (she campaigned with Liz Cheney lmao), but she did try to be everything to everyone. Voters want someone with convictions, that’s why you have people who voted Trump-AOC on the same ticket.
Are there many of those people? Is there data on that? Interested to know. I do agree that at least part of it is that voters like people who seem to speak their mind without caring what others think (which is not quite the same as having convictions; Trump doesn’t really have convictions other than in court).
I think Zohran is a good use case here to show how being like-able is ultimately more important than reaching the median voter, as there is no such thing as an actual ‘median voter’.
I also attached some Wiki election data for AOC’s congressional district, where she over-performed Harris by ~5%. This is all to say, if you base your campaign around reaching the mythical median voter, you’re dooming yourself, voters aren’t looking for milquetoast candidates.
AOC actually underperformed a generic democrat running for the same seat. Go watch the timestamp 20:16 for the video for this post.
While I personally strongly prefer moderate candidates and will usually reward the most moderate candidate with my vote, is it possible candidates that are successful with moderate positions are just more skilled politicians? Seems like in many districts, being extreme is just lazy politics where the R or D will win by default. I hope I’m wrong.
The River is full of its own brand of bullshit. Pick your poison.
Thinking about the observation that candidates who break with their party more often overperform electorally. I wonder to what extent this is picking up on "breaking with the party" as a causal strength of electoral performance, and to what extent it's picking up on like, a selection effect for who's in office to have votes to look at in the first place. In other words, could there be a sort of Simpson's paradox thing happening where the apparent correlation in one direction can actually mask a causal relationship in the opposite direction because there are endogenous confounding variables correlated with both.
To be clear, I do find it plausible that breaking with the party on votes might be a strength, but imagine that it (or rather, heterodox positions that lead to heterodox votes) is a liability, and therefore candidates with those positions tend to lose primaries or the general election. But some candidates who are exceptionally good campaigners (or whatever) manage to win despite that. Those candidates are overperforming for other reasons, and because less skilled candidates with similar voting tendencies have lost, we don't have their voting records in the data. Therefore when we look at the group of candidates who break with their party, we're systematically looking at unusually talented politicians, who overperform _despite_ their heterodox voting tendencies.
For those not familiar with Simpson's paradox, here are two classic examples:
* A study of Berkeley graduate program admissions in I think it was the 1970s, which found that women were being admitted to graduate programs at lower rates than men overall, despite having slightly higher admission rates in each individual program. Which appears paradoxical, until you notice that overall acceptance rates differed widely across programs, and the overall acceptance rate for women was pulled down due to the fact that applicant pools to more competitive programs (which tended to be in the humanities) had higher proportions of women.
* In a study of Florida death penalty sentencing for defendants convicted of homocide in the 1980s, white defendants overall got the death penalty slightly more often than black defendants, but if you broke the data down by race of the victim, black defendants got the death penalty significantly more often _both_ in cases involving a white victim and in cases involving a black victim. Which appears paradoxical, until you notice that cases with a white victim were much more likely overall to result in the death penalty than cases with a black victim, and the defendant was more likely to be the same race as the victim, which meant white defendants were disproportionately represented in cases with white victims, pulling up their overall death penalty rate.
Interesting debate. I do wonder though about survivorship bias in regards to the primaries. If primaries favor extreme candidates, as you say, the moderate candidates that survive a primary must be better candidates for other reasons. So it makes sense they would do better in the general. But a randomly chosen moderate may not be any better than a randomly chosen extremist
Thank you for the glorious take down. Couldn't have happened to a nicer person!
How confident are you that you have the causality flowing the right way? Could it be that politicians who overperform electorally tend to subsequently buck their party more, vs politicians who buck their party going on to overperform electorally?
Wow. An hour video (thankfully with transcript) and a shotgun blast of random definitions, facts, factoids, and footnotes.
A couple of minor points.
1)
Incumbents may become more moderate in office because that is what is necessary to get legislation to pass. Bernie is consistent, and has had almost no actual impact on bills that have become law. AOC is learning how to play nice during recess and share toys and in her short time in the House has more substantial legislative impact than Bernie in his entire career.
2)
The number of tokens (footnote 7) is not the big deal. If it were, then ChatGPT would only be capable of some raw probability of the empirical best next token, which is pretty boring (like a next letter predictor for English that always suggests 'e').
The important thing in LLMs is that they are trained and model associations based on combinations of those tokens. So given a sequence of tokens and a small subset of the trillions-factorial permutations of tokens in the training set, it predicts the most likely next token in the sequence. You think a trillion tokens are a lot? Wrap your brain around a trillion factorial. There is a reason it is a Large Language Model, not a Large Token Model.
In practice, this would make your point even stronger. A small collection of relatively discrete election results is a poor candidate for LLM style methods.
Unfortunately for your point, even though there is a 'LM' in glmnet, it isn't a LLM. The 'glm' is "Generalized Linear Model". Calling it a ML tool is academically accurate, but confusing for people who think ML == AI == LLM == ChatGPT.
It sure looks like a reasonable package to analyze this kind of data, although the devil is in the details.
To your first point, I was also wondering, if there does turn out to be a residual benefit to moderation, could that be in part because of institutional support from the party? This might seem to be at odds with a finding that candidates who buck the party line more often perform better, but these candidates also tend to be in more competitive seats, which has got to be correlated with how much energy the party puts into helping them get and keep their seats.
A subtler mathematical point which is related to this is that if you model vote margin (or vote share) using a _linear_ model rather than a sigmoid model, you're very likely to see greater elasticity near the middle: when dealing with percentages, especially percentages under a structural symmetry (i.e., two competing categories each near 50%, rather than, say, shares of some small category within a much larger pool) it's almost always the case that it takes a smaller nudge to move the needle when it's near the middle than when it's at an extreme. In a district where one party is heavily favored to start with, a stronger candidate has fewer "persuadables" to bring over, and similarly, a weaker candidate is more likely to lose votes to third parties than to the opposite major party (which is half as big of an impact on two party vote share). This is one reason (among several) why proportion data is typically modeled using logistic regression or similar rather than linear regression, because the model distinguishes the impact of a predictor (in terms of its coefficient size) from the result that impact has on the surface outcome, via the sigmoid link function. Seems technical and in the weeds, but it might really matter in this case.
Interesting point about close districts requiring more support. Of course that extends to collecting donations from the public also, and there is generally more consistent funding available from the center than the extremes.
Thanks for bringing up the linear vs sigmoid item also. I hadn't put my finger on it specifically but that kind of thing was in the back of my mind for my "devil in the details" comment. Glmnet has some support for logistic models, but what was actual done matters.
I could have read the underlying analysis but "model wars" are most interesting seen from afar unless they are directly relevant to one's research.
"Conversely, centrist elites tend to be “fiscally conservative but socially liberal.” But I’m not sure you’d win a lot of elections by, say, pledging to substantially increase immigration levels while cutting social spending to lower deficits."
This depends on which issues are used to map people onto the 2D economic/social continuum. If instead of immigration and social spending, you used abortion rights and taxes, then you would find a lot of people in the "abortion should be legal and taxes should be lower" quadrant.
As thorough as humanly possible...
Great conversation!
Calling glmnet “machine learning”… I was at risk of an aneurysm until your final point there.
Anyway, my theory on how social media ruined perception of academia is that it disproved by counter example Plato’s view, namely that cultivating intellectual excellence will entail improvement of other virtues. Bonica and Grumbach’s behavior in this instance is a case in point.
Eh, I don't think it's so bad to label a glmnet "machine learning". It's only been in the last 7-8 years or so that ML has come to be dominated by neural network models, and many foundational ML methods were developed multiple times in parallel in some combination of statistics, computer science, and cognitive science communities. The original neural network (the "perceptron") was just logistic regression by another formulation, etc. I don't really think there's ever been much of a principled distinction between what's "statistics" and what's "machine learning", and the choice of terminology tends to say more about the background of the practicioner than it does about the methodology.