It’s time to stop being polite. Here are 36 takes on how moderation does or doesn't help candidates win — and why I trust outsiders to build election models more than academics.
In 3 1/2 years time it’ll be: “did republicans shift too far to the right.” There’ll be hot takes on how the GOP brand is toast. I think we’re in an electoral period like 1884-1900 and we’re in for a few more swings yet.
I agree with you, and the pendulum has already swung back and forth a few times.
"The GOP brand is toast" is exactly what people were saying in mid-2016 with all of the "Death of the Republican Party" think pieces saying how the nomination of someone as obviously unelectable as Donald Trump would doom the GOP to being a "regional party" for decades.
I'd elaborate on point 28, the comment, "with most showing no statistically significant effect at all." This is a common and annoying feature in academic debates, using failure to reject a null hypothesis as evidence the null hypothesis is true. In the most extreme form, one person will publish a result, and another person will claim to have refuted it with a study with a smaller sample size and sloppier work that fails to find a significant effect. Absence of evidence is not evidence of absence.
If many different model specifications agree on the direction of the effect, most most are not individually significant statistically at the 5% level, that can give you more confidence in a result than a single specification with a very low p-value. Every directionally correct specification is evidence for the result, even if not very strong on its own.
The authors predict next month's stock market return using 12 monthly values of 15 indicators (stock market return, interest rates, market price-earnings ratio, etc.). That's 300 variables from which they fit 12,000 parameters by generating random Fourier features. Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models.
This defies conventional wisdom in which modelers try to focus on a signal and ignore the noise, selecting only a few key indicators with strong statistical evidence and plausible causal links, and combining them only in simple models. Using too many variables and parameters in overly complex model is supposed to lead to "overfitting," models that explain the past perfectly and the future not at all.
"Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models."
That paper is 1.5 years old. How has their model done at predicting stock returns over that time?
I am also curious about the double descent phenomenon we are now seeing in a lot of different domains (fit to withheld data improves when the number of parameters in the model grows beyond what is necessary to fit the data perfectly). I would be curious to hear if it is also seen in election modeling. I've read arguments that double decent should be found only when the data has little real sampling noise in it and there is a long tail of infrequent parameter combinations, which doesn't seem like a great description of the polling data...
I've never seen double-descent or similar ideas in election predictions. I agree with your assessment. The techniques might be useful if you tried to predict election returns not from polls--which as you say seem to be the wrong kind of data--but from a large mass of objectively measured data like wages, prices, popular Netflix picks and performance of local sports teams.
I dimly recall a science fiction story, I think from the 1960s, about a future in which election prediction has improved to the point only one carefully selected person has to vote. He or she answers a few hours of seemingly random questions from the computer, after which the winner who would have won in a traditional election is announced. If we ever get to something like that world, it would probably use some kind of double-descent or kernel machine-learning routine.
Thanks for finding it for me (1950s, it turns out, not the 1960s I remembered, but I read it in the 1960s).
My favorite AI observation by Isaac Asimov is in the 1930s he got tired of reading science fiction stories about computers that turn on their creators. He argued that anyone smart enough to build powerful AI computers and robots would certainly build basic morality into them at a base level that the AI could not override however it reasoned. That became his famous three laws of robotics.
Now today we see people smart enough to build powerful AI without even thinking about building in morality at a level beyond the AI's control. Self-driving cars are choosing whether to swerve to avoid a pedestrian at the risk of crashing a school bus with any moral weighing done by the AI software.
Arthur C. Clarke took a different view (Asimov claimed he stormed out of the movie 2001: A Space Odyssey, saying, "He violated my three laws," and a friend replied, "So strike him down with lightning Isaac.") Clarke thought the solution was manual controls--including a plug that could be pulled--that that AI did not know about or control. That safety feature has also been ignored.
What happens if you just turn off the power to the data centres that house the AIs? At the limit, you can just physically cut off the cables. They are dependant on electricity to work.
This solution is explored in many science fiction stories. It has a few issues.
First, the huge data centers are used to train AI models. Some AI requires vast computing resources to run, but other AI can run on your phone, car or refrigerator.
Second, most people think intelligence is an emergent property. Human intelligence is thought to emerge from multiple brain systems that evolved for visual processing, fight-or-flight decisions, heart-rate setting and so on. It might be impossible to unplug all AI systems everywhere. Even if possible, it might mean reverting to the Stone Age until dumb replacement systems could be built.
Third, if AI controls the power grid, nuclear weapons, construction equipment, every car, medical implant, appliance, etc.--hears or reads every human communication--it could put up formidable resistance to being unplugged.
There's a basic risk management principle to think of potential disasters and ask what you would wish then you had done now. The idea is not that you can predict disasters, it's that preparing for things you can foresee--even if you think them utterly implausible--gives you the discipline to survive what actually happens.
In these science-fiction AI takeover scenarios, we would probably wish certain defenses and limits had been implemented in advance, and AI carefully screened from knowledge about and control of them.
While I personally strongly prefer moderate candidates and will usually reward the most moderate candidate with my vote, is it possible candidates that are successful with moderate positions are just more skilled politicians? Seems like in many districts, being extreme is just lazy politics where the R or D will win by default. I hope I’m wrong.
Thinking about the observation that candidates who break with their party more often overperform electorally. I wonder to what extent this is picking up on "breaking with the party" as a causal strength of electoral performance, and to what extent it's picking up on like, a selection effect for who's in office to have votes to look at in the first place. In other words, could there be a sort of Simpson's paradox thing happening where the apparent correlation in one direction can actually mask a causal relationship in the opposite direction because there are endogenous confounding variables correlated with both.
To be clear, I do find it plausible that breaking with the party on votes might be a strength, but imagine that it (or rather, heterodox positions that lead to heterodox votes) is a liability, and therefore candidates with those positions tend to lose primaries or the general election. But some candidates who are exceptionally good campaigners (or whatever) manage to win despite that. Those candidates are overperforming for other reasons, and because less skilled candidates with similar voting tendencies have lost, we don't have their voting records in the data. Therefore when we look at the group of candidates who break with their party, we're systematically looking at unusually talented politicians, who overperform _despite_ their heterodox voting tendencies.
For those not familiar with Simpson's paradox, here are two classic examples:
* A study of Berkeley graduate program admissions in I think it was the 1970s, which found that women were being admitted to graduate programs at lower rates than men overall, despite having slightly higher admission rates in each individual program. Which appears paradoxical, until you notice that overall acceptance rates differed widely across programs, and the overall acceptance rate for women was pulled down due to the fact that applicant pools to more competitive programs (which tended to be in the humanities) had higher proportions of women.
* In a study of Florida death penalty sentencing for defendants convicted of homocide in the 1980s, white defendants overall got the death penalty slightly more often than black defendants, but if you broke the data down by race of the victim, black defendants got the death penalty significantly more often _both_ in cases involving a white victim and in cases involving a black victim. Which appears paradoxical, until you notice that cases with a white victim were much more likely overall to result in the death penalty than cases with a black victim, and the defendant was more likely to be the same race as the victim, which meant white defendants were disproportionately represented in cases with white victims, pulling up their overall death penalty rate.
Interesting debate. I do wonder though about survivorship bias in regards to the primaries. If primaries favor extreme candidates, as you say, the moderate candidates that survive a primary must be better candidates for other reasons. So it makes sense they would do better in the general. But a randomly chosen moderate may not be any better than a randomly chosen extremist
This argument is silly. Most house races are structured so one party wins no matter what and then the contest is between whose more pure. If you look at all house races you’ll see that progressive do as well as moderates because all they have to win is the primary where 10-20% of the voters vote. Once the party candidate is chosen they are virtually assured victory.
Look at governors and to a lesser extent senators to see where moderates do better because we haven’t yet figured out how to gerrymander states
How confident are you that you have the causality flowing the right way? Could it be that politicians who overperform electorally tend to subsequently buck their party more, vs politicians who buck their party going on to overperform electorally?
Wow. An hour video (thankfully with transcript) and a shotgun blast of random definitions, facts, factoids, and footnotes.
A couple of minor points.
1)
Incumbents may become more moderate in office because that is what is necessary to get legislation to pass. Bernie is consistent, and has had almost no actual impact on bills that have become law. AOC is learning how to play nice during recess and share toys and in her short time in the House has more substantial legislative impact than Bernie in his entire career.
2)
The number of tokens (footnote 7) is not the big deal. If it were, then ChatGPT would only be capable of some raw probability of the empirical best next token, which is pretty boring (like a next letter predictor for English that always suggests 'e').
The important thing in LLMs is that they are trained and model associations based on combinations of those tokens. So given a sequence of tokens and a small subset of the trillions-factorial permutations of tokens in the training set, it predicts the most likely next token in the sequence. You think a trillion tokens are a lot? Wrap your brain around a trillion factorial. There is a reason it is a Large Language Model, not a Large Token Model.
In practice, this would make your point even stronger. A small collection of relatively discrete election results is a poor candidate for LLM style methods.
Unfortunately for your point, even though there is a 'LM' in glmnet, it isn't a LLM. The 'glm' is "Generalized Linear Model". Calling it a ML tool is academically accurate, but confusing for people who think ML == AI == LLM == ChatGPT.
It sure looks like a reasonable package to analyze this kind of data, although the devil is in the details.
To your first point, I was also wondering, if there does turn out to be a residual benefit to moderation, could that be in part because of institutional support from the party? This might seem to be at odds with a finding that candidates who buck the party line more often perform better, but these candidates also tend to be in more competitive seats, which has got to be correlated with how much energy the party puts into helping them get and keep their seats.
A subtler mathematical point which is related to this is that if you model vote margin (or vote share) using a _linear_ model rather than a sigmoid model, you're very likely to see greater elasticity near the middle: when dealing with percentages, especially percentages under a structural symmetry (i.e., two competing categories each near 50%, rather than, say, shares of some small category within a much larger pool) it's almost always the case that it takes a smaller nudge to move the needle when it's near the middle than when it's at an extreme. In a district where one party is heavily favored to start with, a stronger candidate has fewer "persuadables" to bring over, and similarly, a weaker candidate is more likely to lose votes to third parties than to the opposite major party (which is half as big of an impact on two party vote share). This is one reason (among several) why proportion data is typically modeled using logistic regression or similar rather than linear regression, because the model distinguishes the impact of a predictor (in terms of its coefficient size) from the result that impact has on the surface outcome, via the sigmoid link function. Seems technical and in the weeds, but it might really matter in this case.
Interesting point about close districts requiring more support. Of course that extends to collecting donations from the public also, and there is generally more consistent funding available from the center than the extremes.
Thanks for bringing up the linear vs sigmoid item also. I hadn't put my finger on it specifically but that kind of thing was in the back of my mind for my "devil in the details" comment. Glmnet has some support for logistic models, but what was actual done matters.
I could have read the underlying analysis but "model wars" are most interesting seen from afar unless they are directly relevant to one's research.
"Conversely, centrist elites tend to be “fiscally conservative but socially liberal.” But I’m not sure you’d win a lot of elections by, say, pledging to substantially increase immigration levels while cutting social spending to lower deficits."
This depends on which issues are used to map people onto the 2D economic/social continuum. If instead of immigration and social spending, you used abortion rights and taxes, then you would find a lot of people in the "abortion should be legal and taxes should be lower" quadrant.
Calling glmnet “machine learning”… I was at risk of an aneurysm until your final point there.
Anyway, my theory on how social media ruined perception of academia is that it disproved by counter example Plato’s view, namely that cultivating intellectual excellence will entail improvement of other virtues. Bonica and Grumbach’s behavior in this instance is a case in point.
Eh, I don't think it's so bad to label a glmnet "machine learning". It's only been in the last 7-8 years or so that ML has come to be dominated by neural network models, and many foundational ML methods were developed multiple times in parallel in some combination of statistics, computer science, and cognitive science communities. The original neural network (the "perceptron") was just logistic regression by another formulation, etc. I don't really think there's ever been much of a principled distinction between what's "statistics" and what's "machine learning", and the choice of terminology tends to say more about the background of the practicioner than it does about the methodology.
In 3 1/2 years time it’ll be: “did republicans shift too far to the right.” There’ll be hot takes on how the GOP brand is toast. I think we’re in an electoral period like 1884-1900 and we’re in for a few more swings yet.
I agree with you, and the pendulum has already swung back and forth a few times.
"The GOP brand is toast" is exactly what people were saying in mid-2016 with all of the "Death of the Republican Party" think pieces saying how the nomination of someone as obviously unelectable as Donald Trump would doom the GOP to being a "regional party" for decades.
I'd elaborate on point 28, the comment, "with most showing no statistically significant effect at all." This is a common and annoying feature in academic debates, using failure to reject a null hypothesis as evidence the null hypothesis is true. In the most extreme form, one person will publish a result, and another person will claim to have refuted it with a study with a smaller sample size and sloppier work that fails to find a significant effect. Absence of evidence is not evidence of absence.
If many different model specifications agree on the direction of the effect, most most are not individually significant statistically at the 5% level, that can give you more confidence in a result than a single specification with a very low p-value. Every directionally correct specification is evidence for the result, even if not very strong on its own.
Points 33 - 35 are more controversial than suggested in the post. A seminal paper (https://economics.yale.edu/sites/default/files/2024-01/The%20Journal%20of%20Finance%20-%202023%20-%20KELLY%20-%20The%20Virtue%20of%20Complexity%20in%20Return%20Prediction%20%281%29.pdf) last year that has ignited a firestorm of controversy claims that machine learning beats conventional parsimonious fitting even with small datasets, and that fitting all historical data exactly is a good thing.
The authors predict next month's stock market return using 12 monthly values of 15 indicators (stock market return, interest rates, market price-earnings ratio, etc.). That's 300 variables from which they fit 12,000 parameters by generating random Fourier features. Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models.
This defies conventional wisdom in which modelers try to focus on a signal and ignore the noise, selecting only a few key indicators with strong statistical evidence and plausible causal links, and combining them only in simple models. Using too many variables and parameters in overly complex model is supposed to lead to "overfitting," models that explain the past perfectly and the future not at all.
"Of course this allows them to fit the last 12 months' stock market returns perfectly. They also find it predicts next month's return better than conventional models."
That paper is 1.5 years old. How has their model done at predicting stock returns over that time?
I'm going to guess: not as well.
It did fine through June 2025, I haven't seen the July 2025 results yet. But it hasn't done well enough to satisfy critics.
https://www.aqr.com/Insights/Research/Working-Paper/Understanding-The-Virtue-of-Complexity
However the main debate is not whether the model works, but whether it is merely an computationally expensive way to mimic a simple momentum model.
I am also curious about the double descent phenomenon we are now seeing in a lot of different domains (fit to withheld data improves when the number of parameters in the model grows beyond what is necessary to fit the data perfectly). I would be curious to hear if it is also seen in election modeling. I've read arguments that double decent should be found only when the data has little real sampling noise in it and there is a long tail of infrequent parameter combinations, which doesn't seem like a great description of the polling data...
I've never seen double-descent or similar ideas in election predictions. I agree with your assessment. The techniques might be useful if you tried to predict election returns not from polls--which as you say seem to be the wrong kind of data--but from a large mass of objectively measured data like wages, prices, popular Netflix picks and performance of local sports teams.
I dimly recall a science fiction story, I think from the 1960s, about a future in which election prediction has improved to the point only one carefully selected person has to vote. He or she answers a few hours of seemingly random questions from the computer, after which the winner who would have won in a traditional election is announced. If we ever get to something like that world, it would probably use some kind of double-descent or kernel machine-learning routine.
I loved that short story! https://en.m.wikipedia.org/wiki/Franchise_(short_story)
Thanks for finding it for me (1950s, it turns out, not the 1960s I remembered, but I read it in the 1960s).
My favorite AI observation by Isaac Asimov is in the 1930s he got tired of reading science fiction stories about computers that turn on their creators. He argued that anyone smart enough to build powerful AI computers and robots would certainly build basic morality into them at a base level that the AI could not override however it reasoned. That became his famous three laws of robotics.
Now today we see people smart enough to build powerful AI without even thinking about building in morality at a level beyond the AI's control. Self-driving cars are choosing whether to swerve to avoid a pedestrian at the risk of crashing a school bus with any moral weighing done by the AI software.
Arthur C. Clarke took a different view (Asimov claimed he stormed out of the movie 2001: A Space Odyssey, saying, "He violated my three laws," and a friend replied, "So strike him down with lightning Isaac.") Clarke thought the solution was manual controls--including a plug that could be pulled--that that AI did not know about or control. That safety feature has also been ignored.
What happens if you just turn off the power to the data centres that house the AIs? At the limit, you can just physically cut off the cables. They are dependant on electricity to work.
This solution is explored in many science fiction stories. It has a few issues.
First, the huge data centers are used to train AI models. Some AI requires vast computing resources to run, but other AI can run on your phone, car or refrigerator.
Second, most people think intelligence is an emergent property. Human intelligence is thought to emerge from multiple brain systems that evolved for visual processing, fight-or-flight decisions, heart-rate setting and so on. It might be impossible to unplug all AI systems everywhere. Even if possible, it might mean reverting to the Stone Age until dumb replacement systems could be built.
Third, if AI controls the power grid, nuclear weapons, construction equipment, every car, medical implant, appliance, etc.--hears or reads every human communication--it could put up formidable resistance to being unplugged.
There's a basic risk management principle to think of potential disasters and ask what you would wish then you had done now. The idea is not that you can predict disasters, it's that preparing for things you can foresee--even if you think them utterly implausible--gives you the discipline to survive what actually happens.
In these science-fiction AI takeover scenarios, we would probably wish certain defenses and limits had been implemented in advance, and AI carefully screened from knowledge about and control of them.
While I personally strongly prefer moderate candidates and will usually reward the most moderate candidate with my vote, is it possible candidates that are successful with moderate positions are just more skilled politicians? Seems like in many districts, being extreme is just lazy politics where the R or D will win by default. I hope I’m wrong.
Please create a Silver Bulletin podcast feed!
Thinking about the observation that candidates who break with their party more often overperform electorally. I wonder to what extent this is picking up on "breaking with the party" as a causal strength of electoral performance, and to what extent it's picking up on like, a selection effect for who's in office to have votes to look at in the first place. In other words, could there be a sort of Simpson's paradox thing happening where the apparent correlation in one direction can actually mask a causal relationship in the opposite direction because there are endogenous confounding variables correlated with both.
To be clear, I do find it plausible that breaking with the party on votes might be a strength, but imagine that it (or rather, heterodox positions that lead to heterodox votes) is a liability, and therefore candidates with those positions tend to lose primaries or the general election. But some candidates who are exceptionally good campaigners (or whatever) manage to win despite that. Those candidates are overperforming for other reasons, and because less skilled candidates with similar voting tendencies have lost, we don't have their voting records in the data. Therefore when we look at the group of candidates who break with their party, we're systematically looking at unusually talented politicians, who overperform _despite_ their heterodox voting tendencies.
For those not familiar with Simpson's paradox, here are two classic examples:
* A study of Berkeley graduate program admissions in I think it was the 1970s, which found that women were being admitted to graduate programs at lower rates than men overall, despite having slightly higher admission rates in each individual program. Which appears paradoxical, until you notice that overall acceptance rates differed widely across programs, and the overall acceptance rate for women was pulled down due to the fact that applicant pools to more competitive programs (which tended to be in the humanities) had higher proportions of women.
* In a study of Florida death penalty sentencing for defendants convicted of homocide in the 1980s, white defendants overall got the death penalty slightly more often than black defendants, but if you broke the data down by race of the victim, black defendants got the death penalty significantly more often _both_ in cases involving a white victim and in cases involving a black victim. Which appears paradoxical, until you notice that cases with a white victim were much more likely overall to result in the death penalty than cases with a black victim, and the defendant was more likely to be the same race as the victim, which meant white defendants were disproportionately represented in cases with white victims, pulling up their overall death penalty rate.
The River is full of its own brand of bullshit. Pick your poison.
Interesting debate. I do wonder though about survivorship bias in regards to the primaries. If primaries favor extreme candidates, as you say, the moderate candidates that survive a primary must be better candidates for other reasons. So it makes sense they would do better in the general. But a randomly chosen moderate may not be any better than a randomly chosen extremist
Thank you for the glorious take down. Couldn't have happened to a nicer person!
This argument is silly. Most house races are structured so one party wins no matter what and then the contest is between whose more pure. If you look at all house races you’ll see that progressive do as well as moderates because all they have to win is the primary where 10-20% of the voters vote. Once the party candidate is chosen they are virtually assured victory.
Look at governors and to a lesser extent senators to see where moderates do better because we haven’t yet figured out how to gerrymander states
How confident are you that you have the causality flowing the right way? Could it be that politicians who overperform electorally tend to subsequently buck their party more, vs politicians who buck their party going on to overperform electorally?
Wow. An hour video (thankfully with transcript) and a shotgun blast of random definitions, facts, factoids, and footnotes.
A couple of minor points.
1)
Incumbents may become more moderate in office because that is what is necessary to get legislation to pass. Bernie is consistent, and has had almost no actual impact on bills that have become law. AOC is learning how to play nice during recess and share toys and in her short time in the House has more substantial legislative impact than Bernie in his entire career.
2)
The number of tokens (footnote 7) is not the big deal. If it were, then ChatGPT would only be capable of some raw probability of the empirical best next token, which is pretty boring (like a next letter predictor for English that always suggests 'e').
The important thing in LLMs is that they are trained and model associations based on combinations of those tokens. So given a sequence of tokens and a small subset of the trillions-factorial permutations of tokens in the training set, it predicts the most likely next token in the sequence. You think a trillion tokens are a lot? Wrap your brain around a trillion factorial. There is a reason it is a Large Language Model, not a Large Token Model.
In practice, this would make your point even stronger. A small collection of relatively discrete election results is a poor candidate for LLM style methods.
Unfortunately for your point, even though there is a 'LM' in glmnet, it isn't a LLM. The 'glm' is "Generalized Linear Model". Calling it a ML tool is academically accurate, but confusing for people who think ML == AI == LLM == ChatGPT.
It sure looks like a reasonable package to analyze this kind of data, although the devil is in the details.
To your first point, I was also wondering, if there does turn out to be a residual benefit to moderation, could that be in part because of institutional support from the party? This might seem to be at odds with a finding that candidates who buck the party line more often perform better, but these candidates also tend to be in more competitive seats, which has got to be correlated with how much energy the party puts into helping them get and keep their seats.
A subtler mathematical point which is related to this is that if you model vote margin (or vote share) using a _linear_ model rather than a sigmoid model, you're very likely to see greater elasticity near the middle: when dealing with percentages, especially percentages under a structural symmetry (i.e., two competing categories each near 50%, rather than, say, shares of some small category within a much larger pool) it's almost always the case that it takes a smaller nudge to move the needle when it's near the middle than when it's at an extreme. In a district where one party is heavily favored to start with, a stronger candidate has fewer "persuadables" to bring over, and similarly, a weaker candidate is more likely to lose votes to third parties than to the opposite major party (which is half as big of an impact on two party vote share). This is one reason (among several) why proportion data is typically modeled using logistic regression or similar rather than linear regression, because the model distinguishes the impact of a predictor (in terms of its coefficient size) from the result that impact has on the surface outcome, via the sigmoid link function. Seems technical and in the weeds, but it might really matter in this case.
Interesting point about close districts requiring more support. Of course that extends to collecting donations from the public also, and there is generally more consistent funding available from the center than the extremes.
Thanks for bringing up the linear vs sigmoid item also. I hadn't put my finger on it specifically but that kind of thing was in the back of my mind for my "devil in the details" comment. Glmnet has some support for logistic models, but what was actual done matters.
I could have read the underlying analysis but "model wars" are most interesting seen from afar unless they are directly relevant to one's research.
It would be great if the audio of these could be posted somewhere.
"Conversely, centrist elites tend to be “fiscally conservative but socially liberal.” But I’m not sure you’d win a lot of elections by, say, pledging to substantially increase immigration levels while cutting social spending to lower deficits."
This depends on which issues are used to map people onto the 2D economic/social continuum. If instead of immigration and social spending, you used abortion rights and taxes, then you would find a lot of people in the "abortion should be legal and taxes should be lower" quadrant.
As thorough as humanly possible...
Great conversation!
Calling glmnet “machine learning”… I was at risk of an aneurysm until your final point there.
Anyway, my theory on how social media ruined perception of academia is that it disproved by counter example Plato’s view, namely that cultivating intellectual excellence will entail improvement of other virtues. Bonica and Grumbach’s behavior in this instance is a case in point.
Eh, I don't think it's so bad to label a glmnet "machine learning". It's only been in the last 7-8 years or so that ML has come to be dominated by neural network models, and many foundational ML methods were developed multiple times in parallel in some combination of statistics, computer science, and cognitive science communities. The original neural network (the "perceptron") was just logistic regression by another formulation, etc. I don't really think there's ever been much of a principled distinction between what's "statistics" and what's "machine learning", and the choice of terminology tends to say more about the background of the practicioner than it does about the methodology.