The Hall of Fame is about more than WAR

A call for a pluralistic approach to Cooperstown.

Jan 06, 2025

In this interim post-election period — I say “interim” because once I get back from Asia next week, it’ll finally be time to focus more on developing our long-term plans — I’ve felt it’s been valuable to take mostly deeper dives. This is another of those: the first of a two-part series on the Baseball Hall of Fame. Despite being a lifelong baseball fan — and a former professional baseball analyst and writer — I’ve never taken a truly deep plunge into Cooperstown, so this is an exclusive feature for Silver Bulletin readers. Part I (this) is free, while Part II, analyzing all 28 players on the ballot, will be for paid subscribers.

At Baseball Prospectus from 2003 to 2008, I was part of the Moneyball wave of analysts when a more stats-centric, empirical approach began to dominate front offices. The Boston Red Sox’ World Championships in 2004 and 2007 — their first since 1918, with an assist from sabermetrics GOAT Bill James — ensured that there was no turning back. Now, every team employs statistical analysts, sometimes a dozen or more — with Baseball Prospectus alumni from my tenure running franchises and guiding them to World Series titles.

This is not an argument about the overall impact that analytics have had on the sport. That’s too much of a rabbit hole, not least because it’s part of a broader trend where everything has become more optimized and algorithmic. Instead, my complaint is more specific. Discussions about the Baseball Hall of Fame, which will annouce its new class of inductees later this month, have become too driven by one statistic: Wins Above Replacement (WAR).

WAR is a tricky beast, essentially a blend between a “rate statistic” like batting average — how effective was the player on a per-at-bat or per-game basis? — and a “counting statistic” like home runs. It squares the circle with the somewhat abstract notion of a “replacement level player.” Basically, it measures how effective a player was relative to some baseline and for how long.

WAR's precision in measuring “effectiveness” is not beyond question, but stat geeks have been refining the approach for years.1 If there’s a dispute between WAR and the conventional wisdom about a player’s value, you should usually default to WAR — and the sabermetric movement should get credit for recognizing the value of overlooked greats like Burt Blyleven and Tim Raines, who now have a deserved place in Cooperstown.

“Relative to some baseline” — replacement level — introduces complications, though. Replacement level is derived from how much teams are willing to pay for free agents: a 0 WAR player can be had essentially for “free” (the league minimum salary). That context is highly useful when benchmarking player salaries, for instance. But it’s much less clear that WAR is a good measure of greatness, which presumably we should be concerned about for the Hall of Fame.

Let's take the example of two hypothetical players:

Behind Door #1, we have SuperMoyer. He's a leveled-up version of Jamie Moyer, who pitched in the majors until he was 49 but only once made an All-Star Team. SuperMoyer has incredible longevity and consistency. In fact, he pitches for 30 major league seasons, every year putting up precisely the same numbers: a 10-10 record, a 4.25 ERA, and 145 strikeouts in 180 innings pitched. Basically exactly league average.
And behind Door #2, we have Shohei Ohtani. OK, he's not hypothetical. I mean the actual Shohei Ohtani, who just became the only 50/50 player in baseball history and also (when his arm is healthy) is a really good pitcher. The modern day Babe Ruth. The MVP in 3 of his past 4 seasons. Except tomorrow, Ohtani decides to retire from the major leagues and head back to Japan.

Who do you want in your Hall of Fame? Well, SuperMoyer will be worth something like 2.5 WAR per season or 75 WAR over the course of his 30-year career, above the historical average for Hall of Fame pitchers. Ohtani has “only” 43.8 WAR so far, conversely. But I'd bet nearly all baseball fans would conclude that Ohtani has already done enough to merit a place in Cooperstown2 while few would want to walk past SuperMoyer’s plaque in the Hall. Or, to take a non-hypothetical example, Sandy Koufax has a lower career WAR than the actual Jamie Moyer, but it’s hard to imagine objecting to his presence in the Hall of Fame.

But my bigger gripe with relying so heavily on WAR is that it says right there in the Constitution that you shouldn't — err, sorry, not the Constitution but the Hall of Fame's voting rules:

5. Voting: Voting shall be based upon the player's record, playing ability, integrity, sportsmanship, character, and contributions to the team(s) on which the player played.

None of this is optional: voting for Cooperstown shall be based on all of these factors. And even if you cringe at the dreaded “character clause,” the Hall of Fame is asking you to consider a lot more than WAR. Let’s break this down:

“...the player's record…” You could treat “record” as concomitant with WAR — but I don’t think that’s really the whole intent. The Oxford English Dictionary defines “record” as: “the sum of the past achievements or actions of a person or organization.” So achievements like batting titles, Gold Gloves, statistical milestones (e.g., accumulating 3000 hits in your career or 50 home runs in a season), and World Series championships ought to be considered, along with WAR-type stats.
“...playing ability…” This is interesting because “ability” is implicitly contrasted with “record.” I see that as a call to account for a player’s talent level — perhaps evident in his peak performance as much as in his career statistics — as well as his distinctive characteristics. Was the player capable of doing things that are rare to see on a baseball diamond? Was he the best ever, or among the best, at a particular aspect of the sport?
“...integrity, sportsmanship, character…” This is the “character clause,” infamous since it is frequently cited in disputes about PEDs. With few exceptions — you could look at Roberto Clemente Award winners — it’s unavoidably subjective. Still, the Hall says that you have to at least consider this factor.
“...and contributions to the team(s) on which the player played.” WAR does measure contributions to a player’s team — but incompletely. It leaves a lot out. One huge category is postseason contributions. Although this is less true in baseball than, say, the NBA, the sport is increasingly a “ringz culture”: World Series titles or at least deep postseason runs count for a lot and — based on my own research in Baseball Between the Numbers — also contribute substantially to franchise values and revenues. (Indeed, the league has devalued the regular season by adding more and more playoff rounds.) This also has implications for how we evaluate the regular season: contributions above replacement level but below the league average may raise a team’s floor but can make it harder for them to reach the playoffs. I also don’t think there’s anything wrong with considering a player’s popularity with his fan base: beloved players sell jerseys and put butts in the seats. It’s called the Hall of Fame, and Cooperstown celebrates Major League Baseball, a commercial enterprise.

My Hall of Fame framework

After lots of trial and error, here’s how I decided to operationalize this. To be clear, I don’t want you to think of this as some “magic formula” so much as a way to organize my thought process. I'd argue this is true with nearly every sort of algorithm, actually: there's an interplay between your intuition on the one hand (maybe, for instance, Chase Utley “feels like” a Hall of Famer) and the necessity for internal consistency on the other hand. So I can use my gut, but I can’t be totally ad hoc. If I design a reasonable set of rules such that Utley makes the Hall, what implications does that have for other candidates?

I rated each candidate’s fitness for the Hall of Fame from 0 to 10 in eight categories, where 0 represents an average long-career major leaguer (say, SuperMoyer but for 10 or 15 seasons rather than 30), 5 represents a typical3 Hall of Famer, and 10 represents GOAT (Greatest of All Time) performance in the category. I give greater weight to the categories that are more easily objectively quantified — but nothing is strictly algorithmic. Even WAR scores differ from source to source and present particular complications for catchers and relief pitchers.

Career value added (3x multiplier). This is the most straightforward category. Indeed, I’m mainly going by career regular-season WAR, although with some mental adjustments for the aforementioned relief pitchers and catchers — and for Ichiro Suzuki, who played his first nine professional seasons in Japan. I also made some small adjustments based on clutch performance in both this category and the next one.

Peak value and value above average (3x multiplier). I’ll look at two main metrics here. First, how much WAR a player delivered in a concentrated period, defined more precisely as his 5 best years in any window of 7 consecutive seasons. I’m a huge admirer of Jay Jaffe (also a former Baseball Prospectus colleague) and his JAWS system for evaluating Hall of Fame candidates, which also includes both a peak and career component. (I highly recommend reading Jaffe’s extensive review of the ballot at FanGraphs.) JAWS’s peak score is based on a player’s seven best seasons by WAR without regard to the timing — they could be, say, a few years in the player’s early 20s and then some others in his mid-30s. This is my alternative spin on that: I’d argue that focusing on a concentrated period of time is slightly more revealing of a player’s peak talent level and more in line with the Hall’s decree to consider an athlete’s “playing ability.”

The second metric in this bucket is how many wins above average a player contributed, ignoring any below-average seasons — I call this WAA+. I think this strikes a nice balance. It doesn’t punish a player for “sticking around too long” like, say, Felix Hernandez. But it also doesn’t cap him if he can contribute above-average value — the sort of value that contributes to pennant runs — over an extended period.

This has been a lot of text — so let me pause here to give you some data you can pick through:

bWAR is Baseball-Reference’s version of career WAR;
While fWAR is the FanGraphs version;
Peak 5/7 is what I just mentioned: a player’s 5 best WAR seasons in any consecutive 7-year window;
WAA+ is wins above average, ignoring negative seasons4;
Finally, WPA is win probability added (from FanGraphs), measured relative to a baseline where the average is 0. WPA has some advantages — namely, it considers clutch performance, which is useful, especially for relief pitchers who pitch mostly in high-leverage situations. However, it doesn’t account for defense or positional value.

The clutch scores can be difficult to tease out from a surface glance — and can be slightly counterintuitive5, so I wound up cross-referencing against another metric. Players who got some extra credit for clutch performances include Russell Martin, Bobby Abreu, and Omar Vizquel, while I deducted points from Troy Tulowitzki, Carlos Gonzalez, and Hanley Ramirez. In other cases, clutch performance served as a tiebreaker when I was undecided between two ratings.

Career landmarks and “eye test” (2x multiplier). Now, we get into three categories in the middle of the spectrum between objective and subjective. Bill James’s Hall of Fame Career Standards Test is specifically designed for this sort of thing, awarding points based on achieving various round-numbered milestones. But I also made an effort to consider contextual factors. It’s easier for a player to hit 500 home runs than it once was, but harder for a pitcher to win 300 games.

Season landmarks and “eye test” (2x multiplier). Two other James metrics are helpful here. His Hall of Fame Monitor awards points for accomplishments in individual seasons — including hitting statistical thresholds, winning awards of various kinds (e.g., All-Star Games, MVPs), and, to some extent, pennants or World Series. Then there’s James’s Black Ink Test, which assigns points based on leading the league in key statistical categories. Again, though, there’s also something of a subjective component. How much do a player’s best seasons “jump off the page” when you’re looking at his b-ref numbers?

Postseason contributions (2x multiplier). So far, every category we’ve talked about (except to a small extent James’s Hall of Fame Monitor Test) pertains strictly to the regular season. But I think it’s a considerable oversight to treat the postseason as merely a “rounding error.” With far more teams making the postseason in the years when players on this year's ballot competed, it’s at least a little bit of a black mark on a player’s resume if their teams rarely or never reached the playoffs.

You can get quite a long way in this category with objective data: I’ve even made an effort to quantify each player’s lifetime postseason WAR. But if a player had a particularly strong postseason in a year when a team won the World Series — say, CC Sabathia in 2009 — I think that deserves extra credit.

Here's the data. I’ve calculated a crude version of postseason WAR for each player based on their wRC+ or ERA+ (basically, how good they were relative to league average) and how many career postseason plate appearances or batters faced they had. My WAR estimates include a positional adjustment, but I don’t specifically account for player defense. You can see three real standouts — Andy Pettitte, Manny Ramirez, and Carlos Beltran — who contributed excellent postseason performance over a large sample. On the other end of the spectrum, Felix Hernandez never appeared in the playoffs at all, and a couple of players “contributed” negative value.67

Unique talents and abilities (1x multiplier). The final three categories are mostly subjective — almost entirely subjective — so get the least weight. This first category gives more credit to players (say, Suzuki) who played the game unlike anyone else or who excelled at particular aspects of the sport (say, Andruw Jones’s center field defense) — and less to well-rounded players like Bobby Abreu who weren't outstanding in any one area. Looking at Bill James’s similarity scores can help a little — the lower the statistical similarity to other players, the more unique a player is — but misses some nuances. “Style points” even matter at the margin here: Sabathia’s distinctive presence as a hulking left-handed power pitcher is more historically unique than Mark Buehrle’s lefty finesse approach, for instance.8

Franchise icon (1x multiplier). Was a player beloved by his fan base — or even by more than one fan base?

Believe it or not, there is actually a quasi-objective way to measure this — a way that wouldn’t have existed a few years ago — although some of you won’t like it since the “way” involves ChatGPT.

I asked ChatGPT9 to make a list of each franchise's 20 most popular players over the past 50 seasons. There are some debatable choices, and of course you have to make some mental adjustments for the quality of the competition: it's easier to crack the Marlins’ Top 20 list than the Yankees’. But after some prompt engineering, I think the lists are, for the most part, pretty good. If you tried to survey superfans or journalists who had covered the team for a long time, you'd do better, but not maybe that much better. The most conspicuous omission, for instance, is Abreu, who appears nowhere on the Phillies top 20 but was “considered a fan punching bag” during his tenure in Philly. Here are the lists in detail, and here are the rankings for the players on this year’s Hall of Fame ballot:

Integrity, sportsmanship, character (1x multiplier). As I’ve said, the Hall of Fame requires its voters to consider this factor. (I’d say it’s just one factor and not three — think of some compound German word like integritysportsmanshipcharacter. It would be hard to develop separate ratings for each subcategory.) But I don’t see anything in there that demands how much to weigh it.

For me, it's best treated only as a tiebreaker in otherwise marginal cases. The more time I spent considering the ballot, the less comfortable I became playing “morality police.” Take a player like Andruw Jones, who was arrested for domestic violence following his playing career. That may make him an awful human being, but there are lots of awful human beings in the Hall of Fame, including various types of racists, scoundrels, and cheaters. Or what about a player like Curt Schilling (no longer on the ballot), whose main crime was just having some, uh, inelegantly stated right-wing political opinions?

As a strong decoupler, I suppose I think we mostly ought to separate the art from the artist — the baseball player from the human being. Then again, the Hall of Fame asks voters to consider integrity, sportsmanship and character. Should it be limited strictly to things that had an effect on the diamond or in the clubhouse? So, say, Beltran’s role in the Astros trash can scandal would count — and PED use would matter — but you wouldn’t hold anything against Jones or Schilling?

I think that’s a tenable position, but the Hall doesn’t provide any guidance on it either way. I’m also not sure what you’d do with a case like Omar Vizquel's alleged sexual harassment of a batboy as a minor league manager after his playing days were over. I do know I'm far more morally alarmed by Vizquel's (alleged) behavior than by, say, Beltran’s, or really even than the PED users.

In the end, I sort of punted on this category. I outsourced my ratings to ChatGPT and Claude, basically hoping they could summarize the overall “vibe” about a player based on everything it’s read about them on a 0-to-10 scale. Since large language models like ChatGPT incorporate some degree of randomness, I asked each AI model for ratings three times, mixing up the order of the players on the list. I excluded each player's top and bottom ratings and then averaged the others, rounding to the nearest half-point:

The one exception, as you’ll see once we get to Part II — this change is not reflected in the table above — is that I overrode ChatGPT to assign a 0 to Alex Rodriguez and Ramirez, the two players on this year’s ballot to actually have been suspended for PED use once Major League Baseball stopped turning a blind eye toward it.

But I’m “compartmentalizing” here: just zeroing out this category for them, in the same way that Hernandez gets a low score in the postseason category for never once appearing in the playoffs. I’m not willing to treat it as strictly disqualifying.

I know some Hall of Fame voters want to draw a line in the sand here. Several players who were merely reported to have used steroids (such as David Ortiz) but never suspended for it are already in Cooperstown, conversely. You could say that anyone from the “anything goes” era gets a pass. Although, in principle, that courtesy should also be extended to Roger Clemens, Barry Bonds, and Mark McGwire, which it clearly hasn’t been.

But I think the moral boundaries between what Rodriguez did and an admitted but never suspended PED user like Andy Pettitte aren’t so clear. And that’s really how the rest of the sport treats these players too. Ramirez and Rodriguez are not exactly pariahs. A-Rod has rehabilitated himself as a popular broadcaster. Ramirez is still remembered warmly by fans in Boston and Cleveland and was recently honored as part of the Red Sox’ celebration of the 2004 championship team.

I'd also ask this. From a sportsmanship standpoint, which is worse: to cheat when you risk getting caught or when you don’t? When there are consequences for it, or when there aren’t?

In poker, “angleshooters” who play within the ambiguities of the rules, but in a bad-faith way that’s against the spirit of the game, are often detested as much as outright cheaters. Indeed, once there is a penalty for PED use, using steroids is sort of an expected value calculation (performance benefit less penalty times risk of getting caught). A-Rod and Ramirez did pay a price in a way that Pettitte or Ortiz did not: they were suspended without pay, costing them millions of dollars, and they get punished in the historical record because their career totals are lower than they would have been without their suspensions. (Absent his season-long suspension, Rodriguez might have stuck around to challenge Bonds’ home run record, for instance.)

In the “anything goes” era, refraining from PED use was more of a prisoner’s dilemma: if everyone else was cheating, you had an incentive to cheat, too, even if it left the collective worse off. (There were some negative consequences: PED use, if later detected, can undermine the popularity of the sport, and it can cause health problems.) So avoiding PED use was arguably more a matter of sportsmanship back then: doing the right thing even when it was probably against your narrow self-interest.

So, spoiler alert: A-Rod and Manny will make my hypothetical10 ballot. In fact, there’s not really any player for whom “integrity, sportsmanship, character” was a decisive factor, although there are one or two guys I might have given a longer look if not for some off-the-field considerations. That means we’ll mostly just get to focus on the fun part — baseball — in Part II.

My guess is that WAR captures something 90 to 95 percent of a player’s “true” value for hitting, 80 to 85 percent for pitching (distributing credit between pitchers and fielders is somewhat tricky), and 70 to 75 percent for position player defense (perhaps less for catchers).

True, they’d have to bend the requirement to have played 10 major league seasons.

I’m deliberately using the ambiguous term “typical”; I suppose I think of this as a blend between the mean and the median.

Both Peak 5/7 and WAA+ are based on the Baseball Reference version of WAR though you’re welcome to adjust up or down if you prefer fWAR more.

The issue is that WPA is very sensitive to extremely high-leverage plate appearances — i.e. bases loaded in a close game — but these constitute a tiny sample size. So I also looked at a player’s OPS as compared to his overall OPS, in what Baseball Reference defines as “high-leverage situations”, which sets a lower bar.

CORRECTION: This table originally duplicated the totals for Manny Ramirez for Carlos Beltran; the error has been fixed.

CORRECTION: Ben Zobrist was originally listed as having been on one World Series championship team when he was on two, winning titles back-to-back with the Royals in 2015 and the Cubs in 2016.

Although Buehrle-type pitchers are becoming more endangered these days.

Namely, it’s o1 model, which I’ve generally found more useful than its predecessors for all sorts of tasks.

I’m not a BBWAA member and don’t have a vote.

Thoughts About Stuff

Jan 6, 2025

“Let's take the example of two hypocritical players:”

hypoTHEtical, surely.

1 reply by Nate Silver

gary

The HOF does not recognize the leading HR hitter, the career hits leader, many of the top 10 HR hitters so what, these players never played? Let’s e real, MLB knew players were taking steroids and did nothing. So articles like this might be interesting as but who care who is inducted into what has become a bogus yearly event.

2 replies

46 more comments...

Discussion about this post

Ready for more?