The purpose of this article is two-fold.   First, we’ll quickly look and see how much it costs teams, on average, to let a guy reach Super Two status.  For an explanation of Super Two, you can go read this. Second, we’ll take a detailed look at how much teams pay for certain types of statistics during salary arbitration, and we’ll build a model to predict a player’s salary based on our findings.

To start, I’ll describe the sample I’m using.  Basically, I pulled the data for every hitter between 2012 and 2015 who went through some sort of salary arbitration (Super 2, Arb1, Arb2, or Arb3).  That’s a pretty simple sample, I’d say!   This sample included 299 player-seasons from 174 players.

Let’s look at Super Two guys first.  In my sample, there are 41 players who entered salary arbitration via Super Two status.   Here’s a quick look at the average salary and raise received by players who were Super Two players to begin, and through the end of the sample’s arbitration years.  (NOTE: I set all players’ salaries at $500K for their first three (or two) years of team control to save time.  Adding the extra $5,000 to $30,000 wasn’t going to materially change the results and it probably saved me an hour of digging!)

super2_3

I thought it was pretty interesting that each year’s average raise was almost identical, except for the anomalous Arb2 year.  Big raises for Brandon Belt, Lucas Duda, Trevor Plouffe, and Brandon Moss brought that average up quite a bit.

In the total row you can see the average player who reached Super Two status earned about $14.3 million over his 4 arbitration years.

So, how does this compare to non-Super Two guys?  This would likely be beneficial information to have if you were a team deciding whether or not to play service time shenanigans with a guy, or just let him reach Super Two status and help the team early.   Behold:

super2_2

And now in graphical form!

super2_1

On average, by the end of arbitration, players’ average salaries are about the same, but the Super Two player earns about $3.4 million extra dollars over a 4 year arbitration period, or about $850,000 per season.

With that said, it’s probably a good idea to try and keep your non-superstar players from reaching Super Two status.

(Note: A certain guy who reached Super Two and got the largest raise of the entire sample, also signed a long-term deal the next year, giving us a sort of an outlier, which I omitted from all of the Super Two stuff above.  We refer to this guy as “Buster Posey,” who went from league minimum to $8,000,000 as a Super Two player.  He won an MVP, I hear.)

So, now to the 2nd part (the meat!) of the article; a look at regular salary arbitration and what sort of statistics teams pay for.

The stats we’ll be looking are are as follows: age, games played, plate appearances, batting average , on-base percentage, slugging percentage, isolated power, home runs, stolen bases, FanGraphs Positional Defensive Runs Above Average (Def), hits, times on base, and times on base excluding home runs.  From these stats and the analysis thereof, we’ll attempt to create a predictive model for salary arbitration hearings based on what teams have actually paid guys in the past.

The methodology for this was a bit tricky at first.  First, you need to get all the data. This was a bit more time consuming than I thought.  There isn’t a place that has service time, salary data, and all the statistics in a single exportable database.  So, every piece of data had to be entered into my own worksheets by hand! (I had this article planned for April and it became far too much work!) Second, you need to get the details of each salary arbitration hearing, and third, you need to find a meaningful way to compare individual seasons to the delta between what a player previous earned and what he was awarded in that year’s salary arbitration.

In order to get an idea of how well certain stats correlate to pay raises, I used a familiar method (to readers of my articles and posts) known as linear regression.  This method gives you a value called R-squared [R2] which ranges from 0 to 1.  The closer to 1 an R2 value is, the more explanatory one variable is of the other.

Visually, we can look at graphs to see what I mean.   Below are a few choice stats with their graphs, showing the relationship between the stat and the raise each player received.

arb3

def

 

arb4

arb2

As you can see, some things correlate well (times on base, home runs) and some things don’t correlate well (OBP and defense).

Something you may have picked out is that certain stats that don’t take playing time into account; OBP, for example. This causes a low R2.  This is because a .400 OBP over a 2-week AAA call-up doesn’t really provide all that much accumulated value, but a .380 OBP over a full season would provide tons of value.

With that in mind, here’s a list of all the variables I looked at, along with their R2 values for a single-variable linear regression.

arb7

Interesting to me was how important plate appearances are in calculating pay raises.  An R2 of nearly 0.6 is very high.  Intuitively, this makes sense, because a player who receives a large amount of plate appearances is probably doing something right.

Also interesting, and the thing which spawned the title of this article, is that batting average is more predictive of raises than OBP.  This essentially means walks and hit-by-pitches are “free” for a team when negotiating in salary arbitration (given what happened in my sample).

So, now that we have some ideas of what might be important, we need to move onto another method; multivariate linear regression.  This lets us use multiple variables to make a model more accurate.  This method has a few well-documented issues, namely oversampling (i.e.- “kitchen sink regression”), where the explanatory value between the independent variables and dependent variable always increases when you add more independent variables, even if they are gibberish.  Because of this, we use what is known as “Adjusted R-Squared,” which makes an adjustment to the R2 value based on how many independent variables were used.

My first effort was to look at AVG, OBP, and SLG…the good ‘ol triple slash…to see how all of those variables together worked out.  These three netted an Adjusted R2 of 0.2364, which is lower than SLG by itself.  Basically, this means adding AVG and OBP to SLG did nothing to strengthen this particular model.

Next, I decided to look at only things that involve some sort of playing time component, since those had high individual correlations from above.

I decided to look at getting on base (OBP), power (ISO), stolen bases, defense, and plate appearances.  Those stats should give a nice, well-rounded view of what type of player we’re looking at.  The Adjusted R2 for this venture? 0.6633.  Much better than our first attempt, but still only slightly higher than “times on base” by itself.  Let’s keep trying.

Next, I simply tried using the highest two individual variables (HR and PA) to see where we get.  How about an Adjusted R2 of .7227?  That’s nice!  I think we’re on to something.

Next, I started adding individual variables to the PA-HR-combo to see if I got better results.  After testing everything, the best I could find was the triumvirate of PA, HR, and TOB-HR.  The reason I use TOB-HR is to avoid double counting home runs.  This gave an Adjusted R2 of 0.7500.   Since TOB counts singles, doubles, triples, walks, and HBP the same, I tried adding SLG back into the mix… this actually lowered the Adjusted R2 to 0.7498.  Basically, SLG didn’t explain enough of the remaining variance to justify the noise it created by being included.

So, here’s a chart to summarize what I just long-windedly typed:

arb6

There you have it… nearly three-quarters of what determines your arbitration salary raise comes from how often you went to bat, how many home runs you hit, and how often you got on base outside of your home runs.  Definitely not what I was expecting, but hey!

So, from that data we can create a model using the outputs of the multivariate linear regression.  We get the following model:

Salary Raise = ($78,537*HR) + ($16,233*TOB-HR) – ($218*PA) -$391,280

As a player, you start out with a $391,280 hole, and it costs you $218 every time you step into the batter’s box.  Hitting a homer will net you about $78k, while getting on base will net you about $16k.   Obviously, you aren’t going to get a salary cut, so the model has an effective validity from 0 to infinity.  Within the sample of 299 player-seasons, there were only two instances of a zero-dollar raise: Jurickson Profar, when he was injured for an entire season; and Drew Butera, who went 1-10 in 6 games for the Dodgers in 2013.

As the last step, let us graph every player’s expected raise (as calculated by the above model), versus his actual raise and see what is looks like!

arb1

The model itself caries an R2 of 0.7803, meaning approximately 78% of what goes into a player’s raise can be deduced by PA, HR, and TOB-HR.

On the graph, I labeled a few outliers .The reasons for each are pretty evident; Buster Posey won an MVP, Chris Davis hit 53 home runs, Giancarlo Stanton already accumulated 117 career homers before hitting Arb1, and Matt Wieters just had a really good overall year as a catcher for Baltimore.  Eliminating those outliers makes the R2 of the model increase to over 0.8.

So, loyal Nation readers… what do you think makes up the other 20%?  Agent? Team? Desire to sign the player later?  Let’s hear about it in the comments!

Also, try predicting what you think Peraza, Herrera, Duvall, and Winker’s stats will be in their first arbitration year, along with their predicted salaries!  I promise you’ll have fun… 😉