
Welcome to another edition of Saber rattling. We had a family emergency over the weekend, which kept me from releasing this column sooner. Nevertheless, here we go!
Typically, I use this space to share my affinity for advanced baseball statistics. Today, I’m here to tell you that sometimes sabermetrics gets it wrong. In the case of Wins Above Replacement, the elders of baseball math are entirely wrong, and I have the data to prove it.
From time to time, the numbers guys in baseball will bicker about the efficacy of a particular statistic. Those nasty win-shares (WAR) are no different, sparking heated debates about why Player A was more deserving of an award than Player B. The problem comes when commentators rely on anecdotal evidence to support a perceived statistical relationship. So, what is WAR’s statistical relationship with runs scored?
Before diving into the data, I need to establish the (il)logical platform that WAR rests on. First, any measure of player value (something I refer to in my metrics as effectiveness) necessarily requires a direct and substantial relationship to the product that determines wins and losses. Of course, I am referring to runs scored. Runs are scored in a variety of ways. An increase of the specific game events involved in scoring runs leads to a corresponding fractional increase in player value (effectiveness). Naturally, a decrease in those same events reduces a player’s fractional value. The only snag in this otherwise wonderful tapestry is WAR has no meaningful relationship to runs scored.
The faint of heart should skip the next paragraph.
In math, it is always more difficult to prove a negative. It’s referred to as a proof of impossibility. The process involves selecting from a handful of methods to show that something exists or not. Wins Above Replacement appears to be built on a proof by contradiction (also called a non-constructive proof). Using a constructive proof I will show you how WAR doesn’t exist. Don’t worry. This is the extent of our foray into theoretical math. I wanted you to have heard the concepts, even if you have no idea what they mean. It’s a fancy way of saying we can prove or disprove something using actual numbers.
Welcome back if you skipped ahead.
So, WAR, huh?
According to Baseball-Reference WAR can be explained as measuring how many runs a player produced versus the expected number of runs produced by a replacement player. BR is not alone in designing its own version of the metric. Scholarly studies have tried to legitimize the statistic. Regardless of the proprietary version, each WAR formula attempts to measure player (or pitcher) runs above replacement. The Baseball-Reference formula incorporates a combination of batting runs, baserunning runs, runs added or lost due to grounding into double play situations, fielding runs, positional adjustment runs, and replacement-level runs. Let’s look at what the data says about each of these categories. Keep in mind they should have a significant relationship (correlation) to runs scored or it’s just tilting at windmills.
Batting Runs
Batting runs uses linear weights to award hitters for specific game events. It’s based on a batter’s weighted on-base average (wOBA). If you’re a regular reader, you already know wOBA has a measly 21% relationship with runs scored. This tells us that wOBA doesn’t remotely measure the very thing it was designed to approximate. Not to worry; the Baseball-Reference version of batting runs uses a proprietary metric called weighted runs above average (wRAA). wRAA is a zero-sum statistic (nothing wrong with that) where the average is zero. It means the average batter drives in exactly zero runs above or below average.
To be fair, I use zero-sum statistics. My runs produced (or prevented) above average (RVAL) is my version of a zero-sum measurement of batters and pitchers. The difference is I leave my zero-sum metric as its own end-product and refuse to incorporate it into another statistic. This is an important distinction. I can show (globally, all particular game events across professional baseball history) the direct correlations to runs scored among the components I use to calculate RVAL. What cannot be shown is a global RVAL correlation to runs scored because a zero-sum statistic cannot have a measurable relationship to the entire population of any counting statistic because one is a gross (total) number and the other is an average. It would be like comparing how often you were late to work versus the average employee and the total number of minutes you were late this year. It’s a horrendous apples-to-oranges comparison.
At its best, batting runs can only point to weighted on-base average to establish a relationship with runs scored. A 0.2106 correlation pales compared to all but one derivative measure (two or more raw statistics used to create a new metric, like batting average or stolen base percentage) concerning runs scored. Simply put, wOBA—and, by extension, batting runs—is a terrible approximation of runs scored. Keep in mind that zero-sum statistics—like wRAA—can not be globally correlated to a counting statistic, and you are left with batting runs relying on a strange relationship-by-marriage (wOBA) in an attempt to lay claim to legitimacy.
Baserunning Runs
Stolen bases are fun. They do have a part to play in determining runs scored. To be precise, swiping a bag has a 46% relationship with runs scored (0.4595 correlation). Having said that, a player’s stolen base percentage has almost no impact on their overall value. Since WAR does not have any data relationship with runs scored, we must evaluate the potential strength of a player’s stolen base percentage through an algorithm strongly correlated with runs scored.
If we added stolen base percentage to a player’s effectiveness score (EFT), the correlation between player effectiveness and runs scored increases from 0.7385 to 0.7414. This tells us that a player’s stolen base percentage has, by itself, a 0.29% positive relationship with scoring runs. At the same time, we have to look at a player’s caught stealing percentage to include both aspects of baserunning runs as they appear in WAR calculations.
A player’s caught stealing percentage would theoretically reduce the correlation between EFT and runs scored from 0.7385 to 0.7357. This demonstrates a 0.28% negative relationship with runs scored, which is the expected opposite of the stolen base percentage. The dilemma comes when both a player’s stolen base percentage and his caught stealing percentage are considered as mutual hypothetical possibilities. Incorporating both aspects of baserunning runs into a “new and improved” EFT calculation moves the needle only 0.01%.
To be clear, that number reads zero-point-zero-one-percent. We are supposed to believe that base-running runs are a crucial ingredient in determining Wins Above Replacement, yet the actual increase in correlation would be 0.0001. It’s tough to argue the importance of any statistic that possesses a 0.01% correlation; it’s even more farcical when that statistic is supposedly one of the six critical factors in determining a player’s value.
Runs added or lost due to grounding into Double Play situations
Including runs from grounding into double play situations fairs slightly better than baserunning runs. In fact, adding this measure into a player’s hypothetical EFT score increases the correlation between player effectiveness and runs scored by 0.0003. This translates into a boost of 0.03%. The combined value from adding baserunning runs and runs from double play situations is still less than half of one-tenth of one percent. Halfway through the WAR formula, better minds than you or me are batting 4 for 1,000 in their combined correlation with runs scored. Yikes.
Fielding Runs
If I’m honest with you, I’ve already shared the good news about the components of WAR. Things are about to get ugly. If you’re a fan of the Wins Above Replacement—shame, shame—you may want to take an antacid.
Fielding runs have no raw measurable statistic except errors. Defensive zone ratings don’t exist in the raw data. Defensive runs saved don’t exist either. You’re better off triple-axeling a perfect score from a French ice skating judge. All of these are purely subjective pursuits.
The shortstop should have gotten to that hot grounder in the hole. A better centerfielder would have run down that deep fly. We’ve heard them all before. It doesn’t change the personal nature of judging what a fielder should be able to do. Examining subjectivity another way, from 2008 – 2013, professional umpires missed 13% of pitches thrown in the strike zone, calling them balls instead of strikes. If major league umpires are wrong more than ten percent of the time, why would we expect any better from the folks scoring defensive zones? It reeks of unstructured interpretation in a discipline (statistics) that measures relationships to the decimal point.
Enough of that, though. Let’s put some numbers to the one defensive statistic we can measure: fielding and throwing errors. By itself, any error has a correlation of -0.0392 with scoring a run. It tells us that the actual, measurable flub of the ball corresponds to a negative 4% relationship with the only thing a defense is trying to prevent—the offense from scoring runs. I’ve also written previously that the worst defensive team in any given year commits on average one more error every 2.5 games than the best defensive team. This trend has been in place for decades, telling us that while unforced errors are detrimental, it is not nearly as significant throughout a season as we’ve been led to believe.
If we consider errors as a percentage of plate appearances (E/PA), the new statistic has a much stronger correlation with runs scored at -0.6426. On its face, this would seem to reinforce the belief that fielding runs DO matter. However, when error percentage is plugged into our hypothetical player effectiveness formula, it decreases the correlation between EFT and runs scored by 0.47%. The result of this decrease tells us that valuing players on their fielding runs is even more unfounded in the data than the other components of WAR we’ve already discussed.
If you’re keeping score at home, the measurable components of WAR have actually weakened the correlation between player effectiveness and runs scored by 0.0042.
Positional Adjustment Runs
We are at the point of the conversation where we have exhausted all evidence-based claims supporting WAR. The data dive wasn’t kind. The logical claims are equally underwhelming.
The first logical claim in the WAR formula is that players from different positions deserve varying amounts of win-shares based on the relative difficulty of their particular position. If you believe this claim has merit, well, I must break you. By extension, the underlying logic of this claim purports that winning a race—say a 100-meter dash—should be based on some relative difficulty among the participants.
You would also have to favor calculating batting averages based on the varying sizes of the strike zone for players of different heights. Shorter players have an unnatural advantage. We mustn’t let them be overvalued for their diminutive size. For that matter, bigger, stronger batters probably don’t deserve credit for all those home runs and extra-base hits because they are more physically able. The bottom line is positional adjustment runs represent a non sequitur to how raw baseball data is collected.
Replacement-level Runs
Everything I’ve shared thus far compares major league ballplayers to other major league ballplayers. As unfounded as those comparisons are in the data and logic, they at least represent two real things being examined side-by-side. Replacement-level runs, on the other hand, takes a bad idea—purely fictional replacement players—and compounds that by assigning a statistical value not found anywhere in the data as a means of measurement for actual players. I cannot overemphasize the utter lack of logic and evidence this sort of claim rests on. Since there isn’t even agreement on what replacement means, it is a non-starter to assign a hypothetical run value to hypothetically-abled players.
Let’s use a real-world example to illustrate the absurdity of assigning value to replacement professionals. Imagine you booked a flight on your preferred commercial airliner. You get to the ticket counter, and the attendant tells you the pilot went down with an injury but not to worry, they’ve called up a replacement pilot. Being a replacement, the new pilot isn’t as good as the scheduled pilot. In fact, the convoluted statistical formula used to evaluate pilots on take-offs and landings suggests that replacement pilots successfully take off and land planes exactly zero percent of the time compared to major league pilots. After all, that is the basis for how major league pilots earn their big bucks. Would you still get on the plane?
Perhaps, you need to see a cardiologist. On the day of the surgery, the nurse is prepping you and mentions the regularly scheduled surgeon won’t make it in, but a replacement surgeon was just brought in from the teaching hospital. Like the pilots, cardiologists in this example are measured by how many patients above replacement survive the surgery. Will you go through with the surgery?
Okay. Both of those are life and death. I get it; it’s extreme. Let’s confine our examples to less drastic occupations. We’ll look at financial advisors instead. Money matters to everyone I know. I imagine it matters to you too. Who doesn’t want to grow their wealth?
Let’s map out what it means to be a replacement financial advisor. How do we want to measure the value of financial advisors? If we follow the WAR model, we need six components. I don’t know about you, but I only care about the money. I didn’t hire my advisor because I needed another friend; I hired them to grow my retirement account. I can put up with a know-it-all jerk who makes me money. I’ll probably fire the person who only makes replacement-level financial gains. Then again, we’d all fire the person who makes replacement-level financial gains. That still doesn’t imply I want to choose my financial advisor based on how they compare to the person we’d all fire. I want to compare them to the best. That’s the only way I know if I’m getting as close to a sure bet as possible in such a risky business. Yet, Wins Above Replacement folks want us to believe that Mike Trout and Joe DiMaggio should be compared to some nameless, faceless, fictional ballplayer rather than being compared directly to one another.
Wow. I really got on my soapbox there. If you can’t tell, I’m passionate when it comes to denouncing WAR. Not only do I contend that Wins Above Replacement has no evidentiary foundation in the data, but it is also a specious statistic that looks like science. Because it looks authentic, WAR is perpetuated as a legitimate sabermetric stat even though we can clearly see it is not. In closing, I’ll leave you with a question. If WAR can effectively measure player value, why isn’t there a single, standardized formula to do so?
I hope you’ve enjoyed this weekly column. I want to challenge your thinking about baseball statistics. Someday, my own research on the game will become outdated. Please feel free to spar with me about the ideas I’ve presented here—I enjoy the discussion because it challenges my thinking. I can be reached here on Baseball Almanac, via email atchriswrites@schristophermichaels.com, and I’m on the social media (Facebook, Twitter). As always, this has been the World According to Chris. Thanks for tuning in.