QSAO

Big Baller Data: A Basketball Analytics Guide by Anthony Turgelis

By: James Acres, Josh Antonucci, Michael Blumel, Cameron Raymond, Cody SmithHunter Smeaton

All current stats used are from basketballreference.com at time of article's publication.

As NBA fans, we are constantly bombarded with different statistics. Every evening you look at your phone to see notifications from various apps; triple double for Lebron, 50 pts 10 rebounds from Anthony Davis, and so on. We are constantly exposed to these types of simple statistics, they are what forms our opinions on players, and what we use to backup arguments when discussing the NBA players with peers. Although these statistics are extremely valuable, it is important to acknowledge different types of analytical methods that can be useful in formulating a more complete understanding of statistics in the NBA. Analytics certainly can not paint the entire picture of a basketball game, but they are certainly a part of it, so there’s no sense in ignoring it any longer.

This guide will introduce you to many concepts that are prevalent in the basketball analytics community. They can be used for your own analysis, or to enhance your viewership of the game. Hopefully, there will be concepts throughout that will challenge the way you fundamentally think about the game of basketball.

Moreyball (Not a typo)

If you are a fan of sports, baseball or analytics, then you most likely have seen or heard of the movie/book “Moneyball”. Just like our baseball guide states, if you haven’t seen it, you should watch it as soon as possible. Bill James was the true pioneer behind bringing advanced statistics to the mainstream in sports and Daryl Morey is taking it to the next level in the NBA, introducing “Moreyball”.

Daryl Morey is the Houston Rockets GM. Morey was not an athlete and had no basketball experience whatsoever. He acquired a bachelors degree in computer science from Northwestern university and an MBA from MIT. Daryl Morey is a stats junkie, and based on heavy analytics usage has built the modern Houston Rockets.  

On the other end of this spectrum is Charles Barkley. Barkley, a Hall of Famer and 11 time all star argues that, “analytics is crap” and that the NBA is talent based and that Morey is “one of those idiots” and went as far as saying analytics is, "a bunch of guys who ain't never played the game [and] they never got the girls in high school." Watch the rant on the YouTube video below:

TNT commentator Charles Barkley rants about analytics in the NBA and Houston Rockets GM Daryl Morey.

That was two years ago when Houston finished with 55-27 record. Today, Houston boasts the NBA’s best record to date and Moreyball is in full effect relying on two basic tenants.

  • 3 > 2
  • It’s much easier to dunk the ball than to shoot it

The idea is that the most efficient shots in basketball are layups/dunks, and 3 pointers. The former makes perfect sense, you’re less likely to miss a shot if you are extremely close to the rim. However, it wasn’t until somewhat recently that teams have been looking closer at the 3-point shot. Morey’s key observation was that if someone takes 100 3-point shots and makes a third of them, then that produces the same amount of points as the person who takes 100 2-point shots and makes half of them. 33% from 3-point range is below league average, but 50% on all 2-pointers is extremely impressive, unless the majority of your shots come at the rim.

This analytical approach is largely based on advanced stats like True Shooting percentage. This adjusts for the fact that a 3-point shot is worth 50% more than a 2-point shot, and that free throws are a part of an efficient offensive performance as well. Morey’s conclusion was that instead of taking a mid-range shot, in most cases, you are better off taking a few steps back and shooting a 3-pointer.

This is shown perfectly in game 1 of last year’s Western Conference Semi-Finals, where the Rockets bested the San Antonio Spurs 126-99. Below is a visualization of all the shots that the Houston Rockets took that night.

(Credit /u/BradGroux, Reddit.com)

(Credit /u/BradGroux, Reddit.com)

In this win the Rockets were able to produce 27 more points, while only taking 3 more shots than the Spurs.

However, this brings us to the limitations of Moreyball. The Spurs were able to adjust throughout the series to better defend the James Harden-led squad, and moved on to the Western Conference Finals after 6 games.

The fate of Moreyball still remains to be seen, without a Houston championship it will be hard to convince the old guard of basketball that analytics can win championships. However, with the Rockets currently sitting on the best record in the league, and the philosophy’s poster boy James Harden looking primed to win the MVP award, they seem confident. We encourage you to join us in the future as we follow the journey of Moreyball, especially come playoff time when defense strengthens and every move will be analyzed under a microscope.  

Intro to Advanced Basketball Analytics Metrics

Effective Field Goal Percentage (eFG%): Effective Field Goal percentage is a metric that you may have occasionally encountered. eFG% is a pretty easy concept to understand as it simply takes into account the fact that three point shots are worth 50% more than two point shots. Looking at this numerically, shooting 50% from three is equal to shooting 33.33% from two (remind you of Moreyball?). This is an important statistic to acknowledge when looking at a given players field goal percentage as it will give you a better understanding of their true efficiency in scoring the basketball. An example of this is shown when looking at Demar Derozan and James Harden. This season, Derozan’s field goal percentage (46.1%) is higher than Hardens (45.1%), but his effective field goal percentage is lower, Derozan at 49.4% while Harden’s eFG% is 54.6%. This can be attributed to the fact that Harden shoots (and scores) a lot more three point shots than Derozan does, resulting in a higher eFG%.

Value Added (VA) = (Minutes * (PER - PRL)) / 67. This is the estimated number of points a player adds to a team’s season total above what ‘replacement player’ (for instance, the 12th man on the roster) would produce. More on PER later (it needs its own section), so circle back here. The PRL (Position Replacement Level) = 11.5 for power forwards, 11.0 for point guards, 10.6 for centers, 10.5 for shooting guards and small forwards.

Estimated Wins Added = Value Added (VA)/30

Usage Rate (USG) = [(FGA + (FTA * 0.44) + (Assists * 0.33) + TO) * 40 * League Pace] /(Minutes x Team Pace). Don't worry, someone else does all of the calculations. What all these calculations lead to, is the number of possessions a player uses per 40 minutes.

This statistic aims to point out certain players on teams which rely on him more often to create something on offence. Russell Westbrook in the 2016-17 season, was able to break the season record of triple-doubles in a season. To numerically show how much of a workload he had, can be exemplified with the highest usage rate in the NBA at 40.8%. This means that almost half the game the team would rely on him to create scoring, as this translated to 31 points per game and 10 assists (roughly 25 points per game) to bring a grand total of around 56 points production per game. The total for the team was 106.6 PPG. To say he was heavily relied on would be an understatement.

Player Efficiency Rating (PER): 

The most popular advanced metric commonly used today in basketball is player efficiency rating or PER for short. If you are familiar with baseball statistics, then this is comparable to WAR to determine a player’s efficiency compared to others. This metric involves one of the most complex formulas known within the analytics of all major sports.

What PER tries to accomplish is evaluating how productive a player performs on a per minute basis. It adds up positive contributions a player makes on the court while subtracting negative contributions in a statistical point value system. Things like points, rebounds assists would obviously be positive additions while turnovers would be negative. This stat is adjusted for pace and playing time which makes it easily comparable player to player.

The shortcoming with this stat is that there are not many stats in basketball that can back up how efficient a player is on defense. Sure, there are blocks and steals but this only tells so much and can be mostly a result of good team defense instead of individual. Where this deficiency becomes truly evident is that in 2013, Paul George, one of the NBA’s best two-way players had a lower PER than Jamal Crawford and Jr. Smith.  For those of you who don’t know much about Jr Smith, he is one of the best bad shot takers and makers in the NBA. Take a look at the video below and you’ll get a good idea of why his shot selection should rank him much lower.

Some analysts are obsessed with this stat, and others aren’t. Like all advanced statistics, you must view the whole picture before determining whether a player is performing well or not. This season, in Cleveland’s struggles with Isaiah Thomas, LeBron was close to averaging a triple double yet constantly had a negative PER. A triple-double (10+ in in any 3 categories) is one of the most impressive things a player can do, so even if you are not familiar with basketball you can realize quickly that PER is not the end all be all stat. Typically though it can give you a quick snapshot into who the most productive players on the court are and it generally includes the NBA elite.

How it's Calculated (You don't have to follow the whole thing, but it's good to view the inputs):

The calculation is the overall rating of a player’s per-minute statistical production and is widely applied by the largest sports corporations to distinguish players between one another. The league average is 15.00 every season.

The formula begins with calculating the unadjusted PER (uPER):

uPer 1.PNG
uper 2.PNG

Where:

per3.PNG

With:

tm, the prefix, indicating of team rather than of player;

lg, the prefix, indicating of league rather than of player;

min for number of minutes played;

3P for number of three-point field goals made;

FG for number of field goals made;

FT for number of free throws made;

VOP for value of possession (but in reference to the league, in this instance);

RB for number of rebounds: ORB for offensive, DRB for defensive, TRB for (total) combined, RBP for percentage of offensive or defensive;

Got all that? Good.

Once uPER is calculated, it must be adjusted to team pace and normalized for the league to become PER.

This final step takes away the advantage given by teams that play an uptempo style, as the adjustment accounts on a per possession basis so that data can be depicted better. By looking at the top 10 list in the NBA done by ESPN, you can tell that a trend through all players is that they seem to create shots and momentum on offense that appears to be effortless.

PER leaders.PNG

Intro into Match-Up Based Statistical Analysis

In sports, everyone is trying to find a new way to predict performance based on statistical analysis. With basketball being a match-up based sport, a match-up based analysis style is the most effective tool for predicting performance. Match-Up Based analysis deals with assessing habits of players, how efficient they are in certain areas on the floor, both offensively and defensively, and comparing this to their likely opponent in a given, upcoming game.

Here is a basic hypothetical example of match-up based analysis during a Toronto Raptors vs. Houston Rockets game. To keep this short, I will exclusively focus on the Point Guard of the Toronto Raptors, Kyle Lowry. To help predict Kyle’s performance we must first look at the basic offensive statistics; FGA, AST, REB, etc. I will then break down each of these statistics into 14 distinct zones, viewable on the graphic below. This will enable us to assess where Kyle’s tendencies for shooting, passing, driving, etc., derive from. We then asses how efficient he is in these areas by using more advanced statistics (EFG%, AST%, REB%, etc.). This information is critical as it allows us to predict where Kyle will be situated on offensive possessions, in addition to how efficient he is in those areas. We do the same analysis on the defensive side and move on to the player that will be battling Kyle for a majority of the game. Using Houston as the example, he will be matched up with Chris Paul. After taking in the same statistical analysis for Chris Paul, we will then compare both Point Guards offensive and defensive results against one another. The point of this (the thing here though, Skip), is to find out which Point Guard is better on any given night. Once we’ve analyzed these players and their behaviours on either end of the floor, the result will be the foremost indicator of how they’ll perform, in any given matchup.

mba.PNG

 

Given that this is a preliminary analysis, there are many external factors that could lead to bias of measurement. Some questions to further consider may include: What happens if teams double-team a player? What if a bench player is used more defensively to cover a starter? To effectively answer these, once a more in-statistical analysis is conducted, I will be able to analyze, with a degree of certainty, why a player is chosen to guard an opposing player, on any given night, and the reactional implications of this. By accumulation of vast quantities of data, applying this analysis strategy, and breaking each player down into one number, we are able to produce a result that takes everything into account. We'll be looking at this further throughout the year.

Intro to Defensive Statistics

Most people interested in basketball are familiar with the common box store defensive stats such as steals, blocks, and defensive rebounds to name a few. Basing a player’s defensive strength on these metrics is not ideal in today’s game, and that leads us to look at  more advanced statistics.

As a brief intro to these statistics, we will discuss defensive rating as well as defensive real plus minus. Defensive rating measures the number of points per possession (can also be measured per 100 possessions) the opponent’s offense scores while a certain player is on the floor. As an example, if a player has a DRTG of 102, it means that each possession, the opponents tend to score 1.02 points. Only points that are scored as a result of the individual player defensive breakdowns are counted against him. This also eliminates other certain factors like pace of play and minutes played per game. So obviously in this case, the lower the number the better. The only downside to this statistic is the fact it is difficult to determine why the defense was so good if 5 players were on the floor. For example, if player A and B play all of their minutes together and player B is the superior defender, it will also look like player A is a great defender. So, based on this attribute its very hard to see the defensive value of a single player on the court at one time.

The next type of defensive statistic is defensive real plus minus (dRPM). It measures value in points per 100 possessions, much to the same as DRTG, but instead it only compares against as average player. A DRPM of +1.5 means you are worth 1.5 points per 100 possessions compared to an average player in the league. Additionally, it uses models to take away possible fluctuating variables like home court advantage in order to level possession scoring information. Something that DRPM does that DRTG does not is the ability to make good guesses at which of the 5 players deserve credit for good defense per possession. Since there isn’t a lot to go off of earlier in the season, DRPM takes time to accurately guess which players are good at defense and corrects itself as the season goes on. Ultimately, there isn’t an exact way to determine which player on the court is the best at defense, but DRPM uses some fancy math in order to make the most accurate and best guesses as to who it is.

Statistics in sports, especially basketball, have become increasingly popular, and newer, better models will be introduced in the foreseeable future. These are just a few different measures of defensive statistics that teams are using more and more in order to pick lineups that match up better defensively against certain opponents.


Keep up to date with the Queen's Sports Analytics Organization. Like us on Facebook. Follow us on Twitter. For any questions or if you want to get in contact with us, email qsao@clubs.queensuca, or send us a message on Facebook.

Advanced Baseball Stats for Casual Baseball Fans by Anthony Turgelis

By Anthony Turgelis

We’ve all seen Moneyball. If you haven’t seen Moneyball, go see Moneyball, it’s on Netflix. The ‘Moneyball Revolution’ within baseball has shaken up the game, and changed the way that executives in baseball are looking at the game.

This will be an intro to some of the stats, metrics, and concepts that these executives are looking at. The goal here isn’t just to define what these things are, but rather to show how they can be used as tools of evaluation, to confirm the eye-test, or to just enhance the experience of the game. You might even end up sounding smart in front of your friends. When writing this article, I tried to include everything I wish I knew when first diving into the world of baseball analytics.

To avoid boring you with the history of how this Moneyball Revolution came to be, I’ll only drop one name that you should be familiar with - Bill James. Bill can be credited for being the pioneer of statistical analysis within baseball, as in the 1970s he was one of the first to publish this type of work that would be seen by a wide audience. Many people found his work fascinating, and attempted to replicate it, and - to make a long story short - after 30 years of this, the MLB finally took notice and the Moneyball Revolution began.

Concepts/Terms to Know:

The majority of these terms and concepts have been taken from Fangraphs, which is a site to find many advanced baseball stats and analysis. Links on where to find these concepts/stats will be provided.

Fielding Independent Pitching (FIP) - FIP is an adjusted Earned Run Average (ERA, or runs allowed by a pitcher excluding errors) metric that attempts to quantify what a pitcher’s value would be if they stripped out the defense component of the game. FIP assumes that all balls that are hit into play are given league average results on whether they fall for a hit or not. This way, a pitcher is not penalized for having a bad defense behind him, which certainly would affect their pitching results, and their ERA as a result. FIP is considered predictive as it has higher correlations across seasons than ERA, which makes sense considering it measures things that the pitcher can control and not things like defense which can fluctuate by game and by season. It is adjusted so that the league-average FIP is the same as the league-average ERA. This is done so that it can be easily compared to a player’s ERA to see if they are over/under-performing their FIP, and whether there may be any regression available for the player. There are cases of players who can consistently outperform their FIP numbers, such as Marco Estrada who in 2015-16 was elite at inducing weak contact (which can be considered a skill), so FIP assuming league-average results on balls-in-play would likely paint him as less effective than he actually is. On the other hand, his ERA did balloon to 4.98 in 2017 after significantly outperforming his FIP the previous two years, so the regression bug may have actually hit him as well.

FIP can be found on Fangraphs pitcher pages, such as Marco Estrada’s, next to ERA, where you will find his 2017 FIP to be 4.61.

Batting Average on Balls in Play (BABIP) - BABIP is a player’s batting average on only balls that were put into play, and the average is roughly .300 for both hitters and pitchers. The reason why this is a very important stat, is that it tends to stabilize after 800 balls in play. This means that if a player is having a stretch of months (or even a whole year) where they are achieving a much higher/lower BABIP than league average, and their career average, they are likely due for some regression as they have likely been getting lucky/unlucky on the results of the balls they have put into play. It’s worth noting that better hitters will likely have higher BABIPs, and vice-versa, and some players are able to sustain high BABIPs throughout their career without regression. The 2017 Toronto Blue Jays hitters ranked dead last in the entire MLB in BABIP in 2017, which can be seen as a source of optimism that they may achieve better results on their balls in play in 2018.

BABIP can be found on Fangraphs pitcher/batter pages, such as fringe prospect Dwight Smith Jr’s, who rode a .588 BABIP in 2017 to achieve his .370 batting average, which was less impressive and likely luck-driven given his ridiculous BABIP, and so he still earned a demotion and will likely not get an early look to crack the 2018 team.

Hit Probability - To temporarily stray from Fangraphs, Hit Probability is a metric that was introduced by Statcast at the beginning of the 2017 season to estimate the likelihood that a ball-in-play will be a hit, based on its launch angle and exit velocity compared to similarly hit balls in the past. Similarly to FIP, it attempts to negate the effects of defense and the ballpark on players who may have high percentage hits robbed by star outfielders making unlikely plays, or getting credit for many weak hits that likely would not be repeated. I did an analysis on how the 2017 Blue Jays were being affected by luck based on their hit probabilities, and throughout the season I saw players regress to what their averages were expected to be based on their Hit Probability numbers. The most extreme case was Devon Travis who had a cold start but still had high aggregated Hit Probability numbers but who, as the season progressed, positively regressed to the expected level. The quarter season report can be found here, and the mid-season report can be found here.

Hit Probability statistics can be found on Baseball-Savant here, where you can select any game and see the hit probabilities for all balls in play for that game.

Weighted Runs Created + (wRC+) - wRC+ is an attempt to quantify a player’s total offensive output into one total stat, based on the value of their contributions, after park adjustments. It uses the concept of Weighted On Base Average (wOBA) which simply gives the run value of each plate outcome. For example, it finds that triples contribute to runs roughly twice as often as a single, so a triple would be worth double the value of a single in this calculation. After doing this, you can find out the value of runs created by each player’s offensive outputs. wRC+ is a rate statistic, so it is very easy to be used even in smaller samples to see how a hitter has been performing. It is one of the best tools to use when evaluating a hitter’s offensive abilities. The league average wRC+ is 100, and each point above 100 is indicative of one percentage above league-average.

It can be found on the batter pages on Fangraphs, such as Mike Trout’s, who was the 2017 leader at 181 wRC+, beating Aaron Judge by 8 points even with 19 less home runs.

Park Adjustments - No Two Parks are The Same:

To state the obvious, no two MLB ballparks are the same. The most noticeable difference is obviously the different dimensions, but additionally there are many other factors at play such as weather and other environmental factors. As a result, there tend to be plenty of differences in player performance at different parks, and adjustments are calculated to reduce the effects of these parks as best as possible. They typically are separated for left and right-handed batters, since parks are not always symmetrical, they may favour one-sided batters over another.

Colorado’s Coors Field is regarded as the extreme case of a ‘Hitter’s Ballpark’ - hitters tend to generally perform well there due to the high altitude and large outfield so batters can expect more balls in the outfield to fall for hits. Conversely, AT&T Park in San Francisco is regarded as the largest case of a ‘Pitcher’s Ballpark’ due to its high walls and damp air. Rogers Center in Toronto is ranked as the 8th best ballpark for hitters. Four out of five ballparks in the AL East are considered to favour the hitter over the pitcher, so that could be one of the reasons why a team based in Toronto fails to attract premium free agent pitchers.

The War on WAR:

If you only have time to learn about one advanced stat in baseball, Wins Above Replacement (WAR) is the one to go with. WAR is an attempt to quantify the overall value of a player’s contributions into one easy number. It simply could be put as: The number of wins that you can expect your team to add while employing the player, compared to a different player that would be easily acquired from the minor-leagues or a team’s bench.

WAR is a counting stat and is based on what happened, rather than what will happen in the future. If an MVP-calibre player only played 20 games, they may have a lower WAR than many inferior players, due simply to the fact that they didn’t play enough games to accumulate a high WAR total.

Fangraphs goes into more details of what exactly goes into the WAR stat for hitters, but essentially it is the total value of runs that a batter contributes to the team in the areas of: hitting, baserunning, fielding, divided by how many wins a team can be expected to win with those runs added (Runs/Win generally fluctuates by year but is ~10). It is then adjusted by position (For example: CF is much harder to play than 1B, so they are credited accordingly - more here), adjusted by ballpark, and adjusted to consider the ‘Replacement Level’ player and how much more/less valuable that player is to this imaginary player.

For Pitchers, it is much more complicated, so it’s best to outline the two different WAR stats that are most commonly referenced. First, there’s Fangraphs WAR, commonly referred to as fWAR. fWAR uses Fielding Independent Pitching (FIP) during their calculations, instead of ERA. Recall that FIP is generally regarded as a more predictive stat than ERA, so fWAR could be better used as a tool to project future pitching performance. Conversely, Baseball Reference uses ERA when calculating their bWAR stat. ERA is based on what has actually happened, and could be influenced by team defense among other external effects. These effects are variable by game and are out of the pitcher's control, so this should be seen as more of a ‘what happened in the past?’ stat, rather than a ‘what should I expect in the future?’ stat.

Conclusion

I hope that this article has given you an introduction to some tools to enhance your viewership of baseball. These tools were selected as stats that may challenge how the game is traditionally viewed. Player’s are often over/undervalued by fans since traditional metrics such as batting average will never paint the full picture of their contributions. Hopefully the concepts learned today will allow you to form more complete opinions on player’s teams while enjoying the games.

Keep up to date with the Queen's Sports Analytics Organization. Like us on Facebook. Follow us on Twitter. For any questions or if you want to get in contact with us, email qsao@clubs.queensuca, or send us a message on Facebook.