A look at each of the three playoff-bound Canadian NHL teams’ performances this season, and what their chances are heading into the playoffs.Read More
By: Ryan Reid
How early is too early when it comes to getting excited about a player or teams’ success early on in the season? While looking at Mikko Rantanen’s pace through 20 games and assuming he will score 130 points seems a bit ridiculous now (he is currently on pace for just over 100), the fact is that a 20 game sample size for teams as a whole is often very predictive of whether or not they will ultimately make the playoffs. In fact, over the past 5 seasons, 77.5% of teams that found themselves in a playoff position at American Thanksgiving went on to make the playoffs.
Given the high predictability of holding a playoff spot at Thanksgiving, I believed that when other statistics are analyzed, they are likely to provide an even greater ability to predict which teams are playoff teams given various statistics collected at American Thanksgiving each year.
With the help of machine learning, I hoped to be able to create a model to out predict the strategy of picking current playoff teams.
In creating a machine learning model, I wanted to be able to classify whether a team could be best classified as a playoff team or not, given a variety of statistics collected on Thanksgiving. To do so, I used Logistic Regression within machine learning in order to classify and group variables as binary, 1 being a playoff team, and 0 being a non-playoff team. Through examining the past 11 years of team data from Thanksgiving (minus the lockout shortened season for obvious reasons) and classifying each team, I hoped to train my model to be able to accurately classify playoff teams.
Within python I used the numpy, pandas, pickle, and various features within sklearn including RFE (Recursive Feature Elimination) and Logistic Regression packages to create the model. Pandas was used to import and read spreadsheets from within excel. Pickle was used to save my finalized model. Numpy was used in certain fit calculations. RFE was used to eliminate features and assign coefficients to the impact criteria was having on the decision of whether a team made the playoffs. Finally, Logistic Regression was used to assign a predicted shape to the model.
Starting off with all statistics I could collect for teams at Thanksgiving, I began to weed out less predictive variables until I landed on a group of 8. Using Recursive Feature Elimination (RFE), I was able to continually run the model and see which variables were deemed most predictive and should be included in the model. The factors as listed below were deemed most predictive, in order of importance
to the model.
While point percentage is the most predictive, other statistics like shooting percentage, save percentage, or goals for percentage provide a bigger picture perspective that allows for a better predictive capability for the machine learning model.
It has been determined that having higher shots for, shooting percentage, and save percentage all have a negative effect on whether or not you end up making the playoffs. For shooting percentage and save percentage, this is likely due to the fact that the model has identified a PDO like correlation in which teams with a lower save percentage and shooting percentage can be classified as “unlucky” and will eventually regress towards the norm. Additionally, the number of shots a team takes relative to the other team has a negative correlation with making the playoffs. This could be due to score effects that cause losing teams to typically generate more shots that are of lower quality. As the model shows, it is primarily high danger chances that are predictive of making the playoffs, not just any shot.
Running the model, 81.25% or 13 out of 16 playoff teams in a playoff spot as of March 1stwere correctly classified as playoff teams. Furthermore, an additional 2 teams (Columbus and Colorado) sat only 1 point back of a playoff spot. In contrast, picking the playoff teams at Thanksgiving would only result in a 68.75% success rate or 11 out of 16 teams. Furthermore, 3 teams that were in a playoff position at Thanksgiving are no longer in the playoff race in comparison to only 1 team (Buffalo) predicted by the model.
Particularly interesting decisions made by the machine learning model include the decision to not pick the Rangers to make the playoffs, despite leading the Metro at Thanksgiving, and the choice to select Vegas to make the playoffs despite a slow start.
One reason behind this choice could have been New York’s low number of ROW. With a mere 8 ROW in 22 games, the New York Rangers sat atop the Metropolitan Division mainly in part to their 4-0 record in shootouts. Seeing that the New York Rangers were playing so many close games, the model likely discounted the strength of the Rangers. Additionally, the New York Rangers had the 4thlowest corsi for %, 6thlowest shots for %, 9thlowest scoring chance for %. As for points for %, the Rangers were ranked at an underwhelming 13th in the league, but led the Metro since the Metro was a weak division and the Rangers had more games played. Given the Rangers low valuation across all these supporting criteria, the machine predicted that they would not make the playoffs despite their stronger points for % at Thanksgiving.
As for the Golden Knights, despite holding the 29thbest point % in the league, Vegas was among the top 4 in the league in shots for %, corsi for % and scoring chances for %. Additionally, Vegas had the league’s lowest PDO (SH% + SV%) at 95.66. Given all these things considered, the model likely believed it was only a matter of time before the Vegas Golden Knights began winning.
Flaws in the Model
While my machine learning model appears to have the ability to out predict the strategy of picking all playoff teams at Thanksgiving, two main limitations of the model as highlighted above is the inability of the machine to pick teams based on the given playoff format, and the lack of data at various game states.
Unaware of the NHL’s current playoff format, the model picked 9 Eastern Conference teams, and only 7 Western Conference teams. Without a grasp on the alignment of divisions within the league, the model is at a disadvantage when picking teams, particularly when specific divisions or conferences are more “stacked” than others. Therefore, there is the potential of the model picking an otherwise impossible selection of teams to make the playoffs.
Furthermore, data collected to be fed into the model was only even-strength data. While this provides a decent picture of a team’s capability, certain teams that rely on their power play, as the Penguins traditionally have, may be disadvantaged and discounted. Finding a way to incorporate this data into the model would likely provide a fuller picture and a more accurate prediction.
While the model I have created is by no means perfect, it provides a unique perspective into not only the importance of the first 20 or so games of the season, but also what statistics beyond wins are important in attempting to classify a playoff team. While the model appears to out predict the strategy of selecting all playoff teams at Thanksgiving, it will be interesting to see in years to come if there is a continued ability to classify playoff teams given Thanksgiving stats.
***All statistics gathered from Natural Stat Trick
Sometimes when I watch hockey on television, the broadcast will display a stat that makes me cringe. One of my (least) favourites is a stat like the one displayed just under the score in the screenshot below:
Most of us have noticed these stats on broadcasts before. I imagine they are common because they match the game state (i.e. the Leafs are leading after the first period), so broadcasters probably believe we find them insightful. However, we are all smart enough to understand that teams should theoretically have a better record in games that saw them outscore their opponents in the first period. In this case, whatever amount of insight the broadcasters believe they are providing us with is merely an illusion. Perhaps they also saw value in the fact that the Leafs were undefeated in those 13 games, but that is not what I want to focus on today.
More generally, my primary objective for this post is to shed light on the context behind this type of stat, mostly because broadcasts rarely provide it for us. Ultimately, I will examine 11 seasons worth of data to understand how the outcome of a specific period effects the number of standings points a team should expect to earn in that game. Yes, this means there will be binning*. And yes, I acknowledge that binning is almost always an inappropriate approach in any meaningful statistical analysis. The catch here is that broadcasters continue to display these binned stats without any context, and I believe it is important to understand the context of a stat we see on television many times each season.
* Binning is essentially dividing a continuous variable into subgroups of arbitrary size called “bins.”In this case, we are dividing a 60-minute hockey game into three 20-minute periods.
A particular team wins a period by scoring more goals than their opponent. I looked at which teams won, lost, or tied each period by running some Python code through a data set provided by moneypuck.com. The data includes 13057 regular season games between the 2007-2008 and 2017-2018 seasons, inclusive. (Full disclosure: I’m pretty sure four games are missing here. My attempts to figure out why were unsuccessful, but I went ahead with this article because the rest of my code is correct, and 4 games out of over 13K is virtually insignificant anyways). The table below displays our sample sizes over those eleven seasons:
Remember that when the home team loses, the away team wins, so the table with our results will be twice as large at the table above. I split the data into home and away teams because of home-ice advantage; Home teams win more games than the visitors, which suggests that home teams win specific periods more often too. We can see this is true in the table shown above. In period 1, for example, the home team won 4585 times and lost only 3822 times. The remaining 4650 games saw first periods that ended in ties.
We want to know the average number of standings points the home team earned in games after winning, tying, or losing period 1. This will give us three values: One average for each outcome of the first period. We also want to find the same information for the away team, giving us atotal of six different values for period 1. (This step is not redundant because of the “Pity Point”system, which awards one point to the losing team if they lost in overtime or the shootout. The implication is that some games result in two standings points but others end in three, so knowing which team won the game still does not tell us exactly how many points the losing team earned). Repeating this process for periods 2 and 3 brings our total to 18 different values. The results are shown below:
The first entry in the table (i.e. the top left cell) tells us that when home teams win period 1, they end up earning an average of 1.65 points in the standings. We saw earlier that the home team has won the first period 4585 times, and now we know that they typically earn 1.65 points in the standings from those specific games. But if we ignore the outcome of each period, and focus instead on the outcomes of all 13057 games in our sample, we find that the average team earns 1.21 points in the standings when playing at home. (This number is from the sentence below the table —the two values there suggest the average NHL team finishes an 82-game season with around 91.43 points, which makes sense). So, we know that home teams win an average of 1.21 points in general, but if they win the first period they typically earn 1.65 points. In other words, they jumped from an expected points percentage of 60.5% to 82.5%. That is a significant increase.
However, in those 4585 games, the away team lost the first period because they were outscored by the home team. It is safe to say that the away team experienced a similar change, but in the opposite direction. Indeed, their expected gain decreased from 1.02 points (a general away game) to 0.54 points (the condition of losing period 1 on the road). Every time your favourite team is playing a road game and loses period 1, they are on track to earn 0.48 less standings points than when the game started; That is equivalent to dropping from a points percentage of 51% to 27%. Losing period 1 on the road is quite damaging, indeed.
Another point of interest in these results, albeit an unsurprising one, is the presence of home-ice advantage in all scenarios. Regardless of how a specific period unfolds, the home team is always better off than the away team would be in the same situation.
I also illustrated these results in Tableau for those of you who are visual learners. The data is exactly the same as in the results table, but now it’s illustrated relative to the appropriate benchmark (1.21 points for home teams and 1.02 points for away teams).
Now, let’s reconsider the original stat for a moment. We know that when the Leafs won the first period, they won all 13 of those games. Clearly, they earned 26 points in the standings from those games alone. How many points would the average team have earned under the same conditions? While the broadcast did not specify which games were home or away, let’s assume just for fun that 7 of them were at home, and 6 were on the road. So, if the average team won 7 home games and 6 away games, and also happened to win the first period every time, they would have: 7(1.65) + 6(1.53) = 20.73 standings points. Considering that the Leafs earned 26, we can see they are about 5 points ahead of the average team in this regard. Alternatively, we can be nice and allow our theoretical “average team”to have home-ice advantage in all 13 games. This would bump them up to 13(1.65) = 21.45 points, which is still a fair amount below the Leafs’ 26 points.
One issue with this approach is that weighted averages like the ones I found do not effectively illustrate the distributionof possible outcomes. All of us know it is impossible to earn precisely 1.65 points in the standings —the outcome is either 0, 1, or 2. An alternative approach involves measuring the likelihood of a team coming away with 2 points, 13 times in a row, given that all 13 games were played at home and that they won the first period every time. We know the average is 13(1.65) = 21.45 standings points, but how likely is that? It took a little extra work, but I calculated that the average team would have only a 3.86% chance to earn all 26 points available in those games. (I did this by finding the conditional probability of winning a specific game after winning the first period at home, and then multiplying that number by itself 13 times). Although the probability for the Leafs is a touch lower than this, since there is a good chance a bunch of those 13 games were not played at home, you should not allow such a low probability to shock you; 13 games is a small sample, especially for measuring goals. There is definitely lots of luck mixed in there.
This brings us back to my original anecdote about cringing whenever I encounter this type of stat. Even if we acknowledge its fundamental flaw —scoring goals leads to wins, no matter when those goals occur in a game —the stat is virtually meaningless in a small sample. Goals are simply too rare to provide us with much insight in a sample of 13 games. Nevertheless, broadcasters will continue displaying these numbers without context. This article will not change that. So, the next time it happens, you can now compare that team to league average over the past eleven seasons. Even if the stat is not shown on television, all you need to know is the outcome of a specific period to find out how the average team has historically performed under the same condition. At the very least, we have a piece of context that we did not have before.
By: Owen Kewell
The following is a detailed explanation of the work done to produce my public player comparison data visualization tool. If you wish to see the visualization in action it can be found at the following link, but I wholeheartedly encourage you to continue reading to understand exactly what you’re looking at:
NHL players are in direct competition with hundreds of their peers. The game-after-game grind of professional hockey tests these individuals on their ability to both generate and suppress offense. As a player, it’s almost guaranteed that some of your competitors will be better than you on one or both sides of the puck. Similarly, you’re likely to be better than plenty of others. It’s also likely that there are a handful of players league-wide whose talent levels are right around your own.
The NHL is a big league. In the 2017-18 season, 759 different skaters suited up for at least 10 games, including 492 forwards and 267 defensemen. In such a deep league, each player should be statistically similar to at least a handful of their peers. But how to find these league-wide comparables?
Enter a bit of helpful data science. Thanks to something called Euclidean distance, we can systemically identify a player’s closest comparables around the league. Let’s start with a look at Anze Kopitar.
The above graphic is a screenshot of my visualization tool.
With the single input of a player’s name, the tool displays the NHL players who represent the five closest offensive and defensive comparables. It also shows an estimate of the strength of this relationship in the form of a similarity percentage.
The visualization is intuitive to read. Kopitar’s closest offensive comparable is Voracek, followed by Backstrom, Kane, Granlund and Bailey. His closest defensive comparables are Couturier, Frolik, Backlund, Wheeler, and Jordan Staal. All relevant similarity percentages are included as well.
The skeptics among you might be asking where these results come from. Great question.
A Brief Word on Distance
The idea of distance, specifically Euclidean distance, is crucial to the analysis that I’ve done. Euclidean distance is a fancy name for the length of the straight line that connects two different points of data. You may not have known it, but it’s possible that you used Euclidean distance during high school math to find the distance between two points in (X,Y) cartesian space.
Now think of any two points existing in three-dimensional space. If we know the details of these points then we’re able to calculate the length of the theoretical line that would connect them, or their Euclidean distance. Essentially, we can measure how close the data points are to each other.
Thanks to the power of mathematics, we’re not constrained to using data points with three or fewer dimensions. Despite being unable to picture the higher dimensions, we've developed techniques for measuring distance even as we increase the complexity of the input data.
Applying Distance to Hockey
Hockey is excellent at producing complex data points. Each NHL game produces an abundance of data for all players involved. This data can, in turn, be used to construct a robust statistical profile for each player.
As you might have guessed, we can calculate the distance between any two of these players. A relatively short distance between a pair would tell us that the players are similar, while a relatively long distance would indicate that they are not similar at all. We can use these distance measures to identify meaningful player comparables, thereby answering our original question.
I set out to do this for the NHL in its current state.
First, I had to determine which player statistics to include in my analysis. Fortunately, the excellent Rob Vollman publishes a data set on his website that features hundreds of statistics combed from multiple sources, including Corsica Hockey (http://corsica.hockey/), Natural Stat Trick (https://naturalstattrick.com) and NHL.com. The downloadable data set can be found here: http://www.hockeyabstract.com/testimonials. From this set, I identified the statistics that I considered to be most important in measuring a player’s offensive and defensive impacts. Let’s talk about offense first.
I decided to base offensive similarity on the above 27 statistics. I’ve grouped them into five categories for illustrative purposes. The profile includes 15 even-strength stats, 7 power-play stats, and 3 short-handed stats, plus 2 qualifiers. This 15-7-3 distribution across game states reflects my view of the relative importance of each state in assessing offensive competence. Thanks to the scope of these statistical measures, we can construct a sophisticated profile for each player detailing exactly how they produce offense. I consider this offensive sophistication to be a strength of the model.
While most of the above statistics should be self-explanatory, some clarification is needed for others. ‘Pass’ is an estimate of a player’s passes that lead to a teammate’s shot attempt. ‘IPP%’ is short for ‘Individual Points Percentage’, which refers to the proportion of a team’s goals scored with a player on the ice where that player registers a point. Most stats are expressed as /60 rates to provide more meaningful comparisons.
You might have noticed that I double-counted production at even-strength by including both raw scoring counts and their /60 equivalent. This was done intentionally to give more weight to offensive production, as I believe these metrics to be more important than most, if not all, of the other statistics that I included. I wanted my model to reflect this belief. Double-counting provides a practical way to accomplish this without skewing the model’s results too heavily, as production statistics still represent less than 40% of the model’s input data.
Now, let's look at defense.
Defensive statistical profiles were built using the above 19 statistics. This includes 15 even-strength stats, 2 short-handed stats, and the same 2 qualifiers. Once again, even-strength defensive results are given greater weight than their special teams equivalents.
Sadly, hockey remains limited in its ability to produce statistical measurements of individual defensive talent. It’s hard to quantify events that don’t happen, and even harder to properly identify the individuals responsible for the lack of these events. Despite this, we still have access to a number of useful statistics. We can measure the rates at which opposing players record offensive events, such as shot attempts and scoring chances. We can also examine expected goals against, which gives us a sense of a player’s ability to suppress quality scoring chances. Additionally, we can measure the rates at which a player records defense-focused micro-events like shot blocks and giveaways. The defensive profile built by combining these stats is less sophisticated than its offensive counterpart due to the limited scope of its components, but the profile remains at least somewhat useful for comparison purposes.
For every NHLer to play 10 or more games in 2017-18, I took a weighted average of their statistics across the past two seasons. I decided to weight the 2017-18 season at 60% and the 2016-17 season at 40%. If the player did not play in 2016-17, then their 2017-18 statistics were given a weight of 100%. These weights represent a subjective choice made to increase the relative importance of the data set’s more recent season.
Having taken this weighted average, I constructed two data sets; one for offense and the other for defense. I imported these spreadsheets into Pandas, which is a Python package designed to perform data science tasks. I then faced a dilemma. Distance is a raw quantitative measure and is therefore sensitive to its data’s magnitude. For example, the number of ‘Games Played’ ranges from 10-82, but Individual Points Percentage (IPP%) maxes out at 1. This magnitude issue would skew distance calculations unless properly accounted for.
To solve this problem, I proportionally scaled all data to range from 0 to 1. 0 would be given to the player who achieved the stat’s lowest rate league-wide, and 1 to the player who achieved the highest. A player whose stat was exactly halfway between the two extremes would be given 0.5, and so on. This exercise in standardization resulted in the model giving equal consideration to each of its input statistics, which was the desired outcome.
I then wrote and executed code that calculated the distance between a given player and all others around the league who share their position. This distance list was then sorted to identify the other players who were closest, and therefore most comparable, to the original input player. This was done for both offensive and defensive similarity, and then repeated for all NHL players.
This process generated a list of offensive and defensive comparables for every player in the league. I consider these lists to be the true value, and certainly the main attraction, of my visualization tool.
Not satisfied with simply displaying the list of comparable players, I wanted to contextualize the distance calculations by transforming them into a measure that was more intuitively meaningful and easier to communicate. To do this, I created a similarity percent measure with a simple formula.
In the above formula, A is the input player, B is their comparable that we’re examining, and C is the player least similar to A league-wide. For example, if A->B were to have a distance of 1 and A->C a distance of 5, then the A->B similarity would be 1 - (1/5), or 80%. Similarity percentages in the final visualization were calculated using this methodology and provide an estimate of the degree to which two players are comparable.
While I wholeheartedly believe that this tool is useful, it is far from perfect. Due to a lack of statistics that measure individual defensive events, the accuracy of defensive comparisons remains the largest limitation. I hope that the arrival of tracking data facilitates our ability to measure pass interceptions, gap control, lane coverage, forced errors, and other individual defensive micro-events. Until we have this data, however, we must rely on rates that track on-ice suppression of the opposing team’s offense. On-ice statistics tend to be similar for players who play together often, which causes the model to overstate defensive similarity between common linemates. For example, Josh Bailey rates as John Tavares’ closest defensive comparable, which doesn’t really pass the sniff test. For this reason, I believe that the offensive comparisons are more relevant and meaningful than their defensive counterparts.
This tool’s primary use is to provide a league-wide talent barometer. Personally, I enjoy using the visualization tool to assess relative value of players involved in trades and contract signings around the league. Lists of comparable players give us a common frame through which we can inform our understanding of an individual's hockey abilities. Plus, they’re fun. Everyone loves comparables.
The results are not meant to advise, but rather to entertain. The visualization represents little more than a point-in-time snapshot of a player’s standing around the league. As soon as the 2018-19 season begins, the tool will lose relevance until I re-run the model with data from the new season. Additionally, I should explicitly mention that the tool does not have any known predictive properties.
If you have any questions or comments about this or any of my other work, please feel free to reach out to me. Twitter (@owenkewell) will be my primary platform for releasing all future analytics and visualization work, and so I encourage you to stay up to date with me through this medium.
NHL seasons follow a formulaic plotline.
Entering training camp, teams share a common goal: win the Stanley Cup. The gruelling 82-game regular season separates those with legitimate title hopes from those whose rosters are insufficient, leaving only the sixteen most eligible teams. The attrition of playoff hockey gradually whittles down this number until a single champion emerges victorious, battle-tested from the path they took to win hockey’s top prize. Two months off, then we do it all again.
Teams that have won the Stanley Cup share certain traits. Anecdotally, it’s been helpful to have a dominant 1st line centre akin to Sidney Crosby, Jonathan Toews or Anze Kopitar. Elite puck-moving defensemen don’t hurt either, nor does a hot goalie. Delving deeper, though, what do championship teams have in common?
I decided to answer this question systematically with the help of some machine learning.
Some Background on Classification
Classification is a popular branch of supervised machine learning where one attempts to create a model capable of making predictions on new data points. We do this by building up, or ‘training’, the model using historical data, explicitly telling the model whether each past data point achieved the target class that we’re trying to predict. In the context of hockey, this data point could be some number of team statistics produced by the 2015 Chicago Blackhawks. The target here would be whether they won the Stanley Cup, which they did.
Sufficiently robust classification models can identify a number of statistical trends that underpin the phenomenon that they’re observing. The models can then learn from these trends to make reasonably intelligent predictions on the outcome of future data points by comparing them to the data that the classifier has already seen.
Building a Hockey Classifier
We can apply these techniques to hockey. We have the tools to train a model to learn which team statistics are most predictive of playoff success. To do this, we must first decide which stats to include in our dataset. To create the most intelligent classifier, we decided to include as many meaningful team statistics as possible. Here’s what we came up with:
It’s worth noting that we engineered the ‘Div Avg Point’ feature by calculating the average number of points contained by all teams in a given team’s division. The remaining statistics were sourced from Corsica and Natural Stat Trick. An explanation of each of these stats can be found on the glossaries for the two websites.
Our dataset included 210 data points: 30 teams per season over the 7 seasons between 2010-11 and 2016-17. Each data point included team name, the above 53 team stats, and a binary variable to indicate whether the team in question won the Cup. Using this data, we trained nine different models to recognize the statistical commonalities between the 7 teams whose seasons ended with a Stanley Cup championship. The best-performing model was a Logistic Regression model trained on even-strength data, and so all further analysis was conducted using this model.
Results: Team Stats that Matter Most
To evaluate which team stats were most strongly linked to winning a Cup, we created a z-score standardized version of our team data. We then calculated the estimated coefficients that our logistic regression model assigned to each team stat. The size of these coefficients indicates the relative importance of different team stats in predicting Stanley Cup champions. The 5-highest ranking team stats can be seen below:
Of all team statistics, ‘Goals For Per 60 Minutes’, or GF/60, is most predictive of winning a Stanley Cup. Of the 7 champions in the dataset, 4 ranked within the top 5 league-wide in GF/60 in their respective season, with 2016-17 Pittsburgh most notably leading the league in the statistic. Impressive results in ‘High Danger Chances For’ and ‘Team Wins’ both strongly correlate to playoff success, while ‘Scoring Chance For Percentage’ and ‘Shots on Goal For Percentage’ round out the top 5.
What Does It Mean?
Generating a list of commonalities among past champions allows us to comment on what factors impact a team’s likelihood of going all the way. Most apparent is the importance of offense. It is more important to generate goals and high-danger chances than it is to prevent them, as GA/60 and HDCA rank 36th and 13th among all statistics, respectively (their corollaries are 1st and 2nd). In the playoffs, the best team offense tends to trump the best team defense, which we saw anecdotally in last year’s Pittsburgh v Nashville Final. If you want to win a Stanley Cup, the best defense is a good offense.
We can see that a team’s ability to generate scoring chances, both high-danger and otherwise, is more predictive of playoff success than their ability to generate shots. Although hockey analytics pioneers championed the use of shot metrics as a proxy for puck possession, recent industry sentiment has shifted towards the belief that shot quality matters more than shot volume. The thinking here, which is supported by the above results, is that not all shots have an equal chance of beating a goalie, and so it is more important to generate a shot with a high chance of going in than it is to generate a shot of any kind. Between a team who can consistently out-chance opponents and a team who can consistently out-shoot opponents, the former is more likely to win a hockey game, and therefore playoff series.
Application: The 2017-18 Season
A predictive model isn’t very helpful unless it can make predictions. So let’s make some predictions.
By feeding our model the team stats produced by the recently-completed 2017-18 regular season, we can output predictions of each team’s likelihood of winning the 2018 Stanley Cup. Since this is the fun part, let’s get right to the probability estimates for all 31 NHL teams:
The rankings above essentially indicate how similar each team’s season was to the regular season of teams that went on to win it all. In doing so, they hope to identify the teams most likely to replicate this success The model favours the Boston Bruins to win the 2018 Stanley Cup, predicting a victory over the Nashville Predators in the Final.
The above data highlights a few curiosities. Notably, we can see that some non-playoff teams had 5-on-5 numbers that were relatively comparable to past Cup champions. Specifically, the Blues, Stars, and Flames played 5-on-5 hockey well enough this season to qualify for the playoffs. The Blues and Flames can attribute their disappointingly long off-seasons to the 30th and 29th-ranked power plays, respectively. The Stars’ implosion is more of a statistical anomaly, and while conducting an autopsy would be interesting it would be better served as a subject for another article.
The lowest-ranked teams to have made the playoffs in the real world are the New Jersey Devils and the Washington Capitals. While their offensive star power might have been enough to get these squads to the dance, the model predicts a quick exit for them both.
A Computer-Generated Bracket:
For fun, I’ve filled out the above bracket using the class probability rankings generated by our model. Of the 8 teams who have won or are winning their first-round playoff series, the model picked 7 of them as at the winner, with Philadelphia being the exception. While it’s far too early to comment on the model’s accuracy, as only a single playoff series has been completed, it’s an encouraging start.
Limitations of the Analysis
The above results must be considered in the appropriate context. The model was trained and tested using only 5-on-5 data, which would explain the lack of love for teams with strong special teams like Pittsburgh and Toronto. The model is also blind to the NHL’s playoff format. Due to the NHL’s decision to have teams play against their divisional foes during the first two playoff rounds, teams in strong divisions have a much harder road to winning a Cup. Consider that Minnesota’s path to the conference final would likely involve Winnipeg and Nashville in the first two rounds, who finished 2nd and 1st in NHL standings in the regular season. Divisional difficulty is not reflected in the probabilities listed above, though incorporating divisional difficulty either probabilistically or through a strength of schedule modifier could be areas of further analysis.
A final limitation of the model is that it is trained using only 7 champions. In an ideal world, we would have access to dozens or hundreds of Stanley Cup positive instances, but due to the nature of the game there can only be one champion per year. We considered extending the dataset backwards past 2011 but ultimately decided against doing so. The NHL is different today than it was in the past. Training a model on a champion from 2000 tells us little about what it takes to have success in 2018. Using 2010-11 onwards represented a happy medium in the trade-off between data relevance and quantity.
Winning a Stanley Cup remains an inexact science. While it’s valuable to identify trends among past winners, there is no guarantee that what’s worked before will work again. It’s a game of educated guesses.
I believe that the most legitimate way to build a Stanley Cup winner is a combination of the past and the future. Analyzing historical data to identify team traits that are predictive of a championship is half the battle. The rest is anticipating what the future of the NHL will look like. The champions of the next few years will be lead by managers who are best able to identify what it’ll takes to win in the modern NHL. While the above framework approaches the first half in a systematic way, the latter remains much harder to crystallize.
In the meantime, let’s turn to what’s in front of our eyes. The playoffs have been tremendously entertaining thus far, and that’ll only pick up as teams are threatened by elimination. Let’s enjoy some playoff hockey. Let’s see which playing styles, tactics, and matchups seem to work. Let’s learn.
Even if your team gets eliminated, just remember that this season’s playoffs are just a couple months away from being data points to train next season’s model.
Then we do it all again.
By: Kurt Schulthies
Monday May 13, 2013:
The city of Toronto was electric. Competing in the Stanley Cup Playoffs for the first time in 12 seasons, the Toronto Maple Leafs inched their way to game 7 against the heavily favoured Boston Bruins. Continuing an improbable run led by Phil Kessel, Nazem Kadri, James Van Riemsdyk, Cody Franson, Dion Phaneuf, and James Reimer.
I was with a dozen of my closest friends, sitting at the head of the table in a Shoeless Joe’s party room. Every detail of that night is vivid in my mind -- for what was about to come can only be described as demoralizing. The Leafs held a 3 goal lead with less than 11 minutes to go in regulation time.
The lead evaporated. The Bruins’ eventual overtime winner became an inevitability.
Without a word, I immediately got up from my seat and stormed out of the bar. I glanced over at the patrons -- and to this day, I have never seen so many people simultaneously unsure how to react.
Toronto is a dramatically different team. Now led by their sophomore phenom Auston Matthews, the Leafs look for revenge against the team that crushed the hopes of an entire fanbase five years ago.
Taking an analytics-focused view, let’s see how Toronto and Boston compare now.
All data used is courtesy of Corsica and NaturalStatTrick
The Leafs are superior to the Bruins in every major offensive category. Toronto is one of the highest paced teams in the league, relying on their high-end offensive talent to best opponents. Boston had a similarly strong offensive season, but failed to generate a significant amount of high danger scoring chances per 60 minutes of play. This can likely be attributed to the Bruins' slower paced style of play.
The visuals above show the league rank of each forward in 5v5 primary points per 60 minutes. This metric is highly repeatable year over year, and gives a somewhat accurate depiction of a player’s offensive prowess. However, numbers are somewhat skewed by factors such as the quality of their linemates and the quality of competition faced.
The first thing that stands out about the Leafs’ chart is Auston Matthews. He ranks first league wide in 5v5 P1/60. Fans can expect him to be a constant threat, and the biggest ‘X-factor’ player in the series. Boston is led by what is likely the league’s most dominant first line. It is one of the only lines that is capable of dominating the overpowering combination of Auston Matthews and William Nylander.
Toronto is incredible at generating high danger scoring chances. This metric is much more predictive of goal scoring than stats such as ‘shots’. In contrast, Boston is far below league average at generating scoring chances right in front of the net, but remain a threat in the high slot. Toronto outperforms metrics such as Corsi for and scoring chances due to their admirable scoring talent, and high number of odd man rushes per game. Boston has slightly above average shot quality, meaning they likely score near their expected results according to Corsi and scoring chances.
Zdeno Chara - Charlie McAvoy
Torey Krug - Kevan Miller
Matt Grzelcyk - Adam McQuaid
Morgan Reilly - Ron Hainsey
Jake Gardiner - Nikita Zaitsev
Travis Dermott - Roman Polak
Boston has been an excellent defensive team this season, beating Toronto in every major defensive category. The Bruins are one of the best shot suppression teams in the NHL, forcing teams to shoot from unfavourable scoring positions. In contrast, the Leafs allow a high concentration of dangerous scoring chances from the slot, leading to a much worse defensive performance. Shots against location heat maps for each team can be seen below:
Toronto gives up a lot of high danger chances, leading to a higher expected goals against per game. It also means the team underperforms metrics such as corsi and scoring chances. Boston, in contrast, is excellent at shot suppression. This leads to outperforming metrics such as corsi and scoring chances, and results in a very low expected goals against per game.
Both the Leafs and Bruins boast top tier goaltenders with Frederik Andersen and Tuuka Rask. Using a goalie comparison tool created by Tyler Kelley (@DocKelley41), we are able to compare each goalie by key metrics:
For more on what each metric means, read here. The values on the x-axis of the graph are the percentile ranks that each of their stats fall on. Frederik Andersen is near the top of the charts with his Goals Saved Above Average. This is unsurprising considering the aforementioned shaky Leafs defense and the great play of Andersen so far this year. The stat highlights that if an average goalie were to be placed in the Leafs net in front of Andersen, they would be expected to concede a lot more goals. By this metric among others, it appears Andersen has a small edge over Tuuka Rask this season.
The team statistics would suggest the Boston Bruins are the favourites in this series. However, in head-to-head matchups in the Toronto Maple Leafs have been the better team with a 7-1-0 record in 8 games over the past 2 seasons. This series should be a war, and one of the most likely first round matchups to go to 7 games. With that being said, my final prediction is Leafs in 7 games.
The calm before the storm.
The brackets have been setup, the matchup strategies developed, and the razors hidden away. For the first time since June, playoff hockey is here. We are mere hours from the puck drop that’ll kick off the 2017-18 Stanley Cup Playoffs, the starting pistol for a two-month long marathon where only one team can cross the finish line. In anticipation of this, we at the Queen’s Sports Analytics Organization decided to tee up the matchups featuring Canadian teams. We start with the Winnipeg Jets, who will play host to the Minnesota Wild on Wednesday night. The first round playoff series between the Central division rival Winnipeg Jets (2nd, 52-20-10) and the Minnesota Wild (3rd, 45-26-11) is an exciting matchup that is sure to feature a high level of speed, talent, and physicality from both sides. Both squads have enjoyed productive seasons, with the Jets posting the best record of any Canadian team, finishing with 114 points.
Winnipeg enters the series with the reputation of having one of the most lethal forward groups in the league. Lead by a rejuvenated Blake Wheeler (91 points) and 44 goals from sophomore winger Patrik Laine, the Jets possess high-end offensive firepower that has torched the league for the better part of the season. Minnesota, meanwhile, enjoyed strong seasons from Eric Staal (76 points), Mikael Granlund (67 points) and Jason Zucker (64 points). Let’s take a quick look at some summary statistics from the regular season.
The Jets scored 23 more goals than the Wild over the season, though much of this can be explained by their superior power play. Jets skaters had a higher shooting percentage, though the difference is too small to reasonably infer superior shooting ability. The Jets outperformed the Wild at generating shot attempts and scoring chances, though the Wild were able to create more high-danger scoring chances. While individual point totals suggest Winnipeg has more high-end forwards, we can examine depth charts to clarify the picture.
The graphic above shows the current depth charts (courtesy of Daily Faceoff) and each player’s rank among NHL forwards in even-strength primary points per 60 minutes. Here we confirm our belief that Winnipeg’s forward group is much deeper than Minnesota’s, as we can see that six Jets produced at a top-line rate compared to just three Wild players. To understand how the above results were achieved, we turn to heat maps.
The red areas indicate locations where a team shoots more frequently than league average, while blue is the inverse. In these maps we can see two teams who have a very different approach to generating offence. The Jets set up a triangle of attack, which results in a high volume of shots coming from the points and the mid-high slot. Being able to attack the slot with such regularity doubtlessly contributed to the success that the Jets experienced this season. The Wild, meanwhile, seem to play more on the perimeter with the goal of funneling pucks towards the crease. This explains why Minnesota produced more high-danger chances than the Jets despite generating less total scoring chances.
The offence matchup clearly favours Winnipeg. The Jets have the top-end firepower and the depth to roll scoring threats on every line. Throw in a dangerous power play, and the Jets are dangerous enough to make life miserable for anyone attempting to contain them.
Josh Morrissey – Jacob Trouba
Joe Morrow – Dustin Byfuglien
Ben Chiarot – Tyler Myers
Jonas Brodin – Matthew Dumba
Carson Soucy – Jared Spurgeon
Nick Seeler – Nate Prosser
The Winnipeg Jets allowed 216 goals in 2017/18, with 144 coming at even strength, while Minnesota allowed 229 goals (144 at 5v5). Winnipeg gave up an average of 31.9 shots per game, while Minnesota surrendered 31.3 on average. In terms of possession metrics, Winnipeg controlled 51.42% of shot attempts over the course of the 2017/18 season, good for 10th in the league, while Minnesota sits 29th with only 47.17% of shot attempts.
Comparing the top pairing defencemen for both teams using HERO charts:
The Minnesota Wild’s defence corps has taken a significant blow going into the postseason with the loss of number 1 defenseman Ryan Suter, who logged an average of 26:46 minutes of ice time per game before suffering a season-ending ankle injury on March 31. Veteran defender Jared Spurgeon remains a game-time decision due to an injured hamstring. The burden to cover these minutes will fall squarely on the shoulders of young defensemen Jonas Brodin and Matt Dumba, who will be counted on in key defensive situations. The Winnipeg Jets boast a tough lineup of physical defencemen, including Dustin Byfuglien and Tyler Myers, who will look to shut down the Wild’s top offensive lines. The Winnipeg Jets have the edge when it comes to top-tier defencemen, as well as much stronger depth on the blueline overall.
Finally, let’s compare the heat maps for both Winnipeg and Minnesota in their own defensive zones.
Taking a look at these maps, both teams are effectively limiting the number of scoring chances from high-danger scoring areas around the net (<25 feet) and in the slot. Minnesota’s heat map clearly indicates that the majority of chances are coming from the point (>40 feet out from the net) and down the right side, a potential weakness that Winnipeg’s quick wingers will look to exploit. Winnipeg’s defence is managing to limit almost all chances from high-scoring areas directly in front of their net, keeping the majority of shot attempts to the outside perimeter of the rink.
We close our positional matchups by considering goaltending. Winnipeg will rely on Connor Hellebuyck, who broke out this year to post the winningest season ever by an American goalie. The young upstart will go toe to toe with Devan Dubnyk, the waiver-wire reclamation project that Minnesota has turned into a competent starter. Dubnyk has the qualitative advantage of playoff experience, but let’s see how the numbers stack up.
Unless otherwise specified, the above percentages reflect even-strength play. We see that Hellebuyck and Dubnyk performed similarly at even strength, as their save percentages for low, medium and high danger shots are all within a single percentage point. Where we see a difference, however, is on the special teams. While these stats are influenced by the quality of special team units, we see that Hellebuyck has significantly outperformed Dubnyk on both power plays and penalty kills. We also see that Hellebuyck saved about 2 goals more than expected given the quality of the shots being faced, whereas Dubnyk was over 7 goals in the hole on this metric.
If there had to be a choice between the two to start a Game 7, Connor Hellebuyck would be a safe choice. Despite his inexperience, his exceptional season played a huge role in Winnipeg’s ascension to 2nd place in the NHL’s overall standings. He’s shown to be better than Dubnyk at stopping the puck, and for that reason, he gives his team a better chance to win.
In summary, the numbers indicate that Winnipeg has the advantage in terms of offense, defense, and goaltending. The Jets enter the playoffs on an absolute tear, having won 11 of their last 12 games. They are 3-1-0 vs. the Wild in their season series. We are predicting that the Winnipeg Jets will be victorious in their first-round series against the Minnesota Wild, likely in 5 or 6 games.
How the Queen's Men's Hockey Team is Using Analytics - Interviewing Director of Analytics, Miles Hoaken /
If you've ever thought that sports analytics could only be implemented in national leagues, where there is plenty of data made publicly, then it's time to think again. Miles Hoaken is a first year Queen's University student in the Commerce program, that is the creator and director of the analytics department for the Queen's Men's Hockey team. In Miles' first year alongside the coaching staff, the team was able to break the school's record of most wins in a season (19) and finish second in the OUA Eastern conference. I sat down with Miles to talk about how he uses analytics to help make the team even better, and for tips on how other students can start getting into hockey analytics.
Thanks for coming today and agreeing to do this interview. I’m sure many students who support the Queen’s Men’s Hockey Team aren’t aware that there is an analytics department for the team, let alone that it’s run by a Queen’s student. Tell us a bit about who you are and what you do.
My name’s Miles Hoaken and I’m from Toronto. I started getting into hockey analytics when I was 13 years old. Basically, the Leafs lost game 7, blowing a 4-1 lead (as I’m sure a lot of you are aware), which made me realize that there might be another layer that myself and Leafs management weren’t paying attention to, and since they’re my childhood team I tend to follow them more. I started a blog when I was 13 years old, writing down some ideas that I had that were based on some hockey analytics, but not a lot since I was only 13 and I didn’t have the math background at the time to understand what some of the stats were. In 2014, the summer of analytics, I saw tons of people getting hired and realized it was realistic for analysts to get hired based on the work they produced on their blogs or Twitter, so I decided to get further into analytics, started writing more on my blog, and then in Grade 12 I got an analytics position with my high school team. I did statistical consulting on their play, mainly analyzing zone entries but varied depending on what the coach wanted from me on that day. From there, I parlayed that into my role with Queen’s, which is essentially running all their analytics and statistical operations. I basically serve as a coach on their coaching staff, so I’m right there in the office helping make decisions, advising the coach on certain strategy items, giving presentations to the players on occasion, and running that whole operation. We take a variety of stats, mainly pertaining to offensive output since that was the area coach was most interested in.
So you’re 18 years old and working with the coaches for the Varsity Hockey Team here at Queen’s for players who are often 3-5 years older than you. Cool to think about. How do you get the data that you use?
I get all the data live at the games, and it’s all tracked by hand. I print-out templates before the game that have everything that I’m going to fill out, for example, for an entry chart I’ll have categories to see who entered the zone, what type of entry it was (controlled or uncontrolled), what general location it was, and then some counting stats. To get the shot locations, I simply mark them down on a piece of paper and fill in the numbers in my spreadsheet after the game. This works well for us since we are trying to do them all live. I unfortunately don’t have the time luxury to go through all the games for many different stats and many different viewings, because I would probably fail all my school courses if I did. So it has to all be live, and has to all be fast, so the best way to do that right now is by hand. Next year, we have five other people helping me track stats, which should allow us to have more data to work with, but the long-term goal is to automate these parts of the job so that when I graduate, the analytics department could be run by one person at the click of a button.
Are you looking for more students to help out?
Right now we’ve filled all of our data-tracker positions for the upcoming year. We’re always looking for coders who can help out on some of the stuff on the presentation side since building a portal for the coaches is something that I’m trying to do. At my current level of coding I don’t think I could do it, but eventually with some help I think that we can get there. Keep watching our Facebook page, after next-year we’ll be looking for more data-trackers.
How has the coaching staff responded to your work with them?
The whole staff has been very receptive to analytics. Sometimes I come in with crazy ideas, but they really bear with me and take into account what I’m saying. Credit goes to Brett Gibson, when I walked into his office in the first meeting, I was a bit of an unknown and we were going to use an iPad app to track stats. I was able to convince him that the iPad app wasn’t that good and would be a waste of his money and the program’s money and that they should instead trust me and my templates. Maybe it takes a little bit of logic and a bit of crazy to trust an 18 year-old that he had never met before, but he put his faith in me and gave me this role and I will forever be grateful to him for that. He’s done a great job of incorporating me into the decisions and making sure my voice is heard. It’s something he didn’t have to do but I’m really glad he did. It’s been a great situation with the coaches, and coach Gibson has brought the program from a point where we only had 4 former CHLers when he started, to 21 CHLers now so that speaks to his work ethic and commitment to the program for sure.
What’s your relationship like with the players? Do you think they’ve bought in to your recommendations?
I’ve presented to them once so far. It was interesting to read the room because it seemed like the people at the top of the list for the stats I was presenting had a quicker buy-in to what I was talking about. The players at the bottom of the list seemed to look a little bit more confused by it, but what I found that the players near the bottom of the list actually had a larger increase in these stats than those near the top of the list, which made me think they were responding well to it. They also get access to my reports after every game.
Do you do any coding as part of what you do?
I would say that half of my job is in the rink doing the tracking and recommendations, and the other half is during the week, coding and making programs. The report I give to the coach after every game contains some offensive statistics which are all generated by graphs on the program R. I set it up so that I can simply change the game number and it will generate the code for that game. It’s a big part of what I do, if anyone is looking to get into hockey analytics, I would say the first thing to learn is coding because it will just make everything a lot easier. I also use coding to generate statistics on the league. I have a web scraper that takes the raw data from the U-Sports website and then turns it into ‘fancy stats’ – Goals for %, Shots for %, some I’m even able to get for 5v5 play through the data that they give us. So coding is a big part of it, I use R, personally but there is a big debate in the hockey analytics community between R and Python – you really can’t go wrong with either. I’m learning Python as part of a coding course at Queen’s next year (CISC121), but R is what I started with and the one I feel the most comfortable with.
This year you spoke about what you do for the Queen's Men's Hockey Team at VANHAC (Vancouver Hockey Analytics Conference). The presentation link is here. Tell us about your experience at VANHAC. Would you recommend it for those who are interested in working or learning about hockey analytics?
VANHAC was a really great experience for me. I went as a high school student, it was sort of like my grad trip. Some people go on S-Trip, I went to a hockey analytics conference which I think tells you all you need to know about my personality and my passion for this. *laughs* VANHAC is really awesome, it’s probably the best conference in North America, in terms of your value and hockey analytics specifically. Sloan (MIT) is the big one for sports analytics in general which I hope to go to someday. Really though, if you want to meet people from NHL teams, see some of the best research that’s come out recently, you have to go to VANHAC. It’s great because you don’t necessarily need to be an expert to go, some people were there with no experience whatsoever, didn’t know what Corsi was and ended up really enjoying it so it’s a really fun environment. The hockey analytics community is one of the most welcoming communities ever. When I was presenting there this year I didn’t feel nervous at all, so I definitely recommend it to any hockey analytics fan or even someone just trying to get into it.
Do you think we’ll ever see analytics at the forefront of U-Sports hockey? I feel like if more students knew that what you do is possible, there might be more focus towards it leading to each team having their own student-led analytics department.
At VANHAC, Brad Mills (@MillsBradley11) who’s the COO of Hockey Data (@HockeyDataInc), he approached me after my presentation and since he played in the NHL, we started talking about how the game is changing from the advances of analytics since he played. He said that given the amount of teams in U-Sports, and given all the statistics I was using, it would cost ~$11,000 to do what I do for every single team in the regular season. I was surprised at how little it was, but at the same time, I mentioned “That $11,000 is only worth it if we have all the data and nobody else does” since that’s what gives us our competitive advantage. That was actually one of the questions I received after my presentations which was “Do you do any analysis on players from other teams” and the answer to that is no, because the public information I can get is points, and I have no idea where these points are coming from necessarily, or if their skilled in any other way that a micro-stat could capture but I don’t have access to it. There are definitely people like me at other Universities, maybe not to the same extent or scale since we’re becoming one of the more advanced ones, especially given the amount of trackers we’ll have next year. I know Western and UOttawa have an analytics person as well but some teams don’t even have that voice in the room, and with that sometimes you can get into groupthink.
You’re active on Twitter (@SmoakinHoaken). How has Twitter been a learning tool for you?
Twitter has been huge for me, I got Twitter when I was 13, which you can probably tell from my handle (@SmoakinHoaken) (Hannah Montana reference). It’s been really key, people post their research on Twitter first, and people have gotten hired not because of Twitter, but because of the work they’ve put out on Twitter. It’s great for questions too, if you’re new to hockey analytics, you can use the hashtag #HockeyHelper and Alex Novet or someone from @HockeyGraphs will reply to you really fast with some advice.
Aside from @QSAOqueens, what are 5 Twitter accounts that you recommend hockey analytics enthusiasts to follow?
@IneffectiveMath – Micah Blake McCurdy (www.hockeyviz.com) – I got to meet him at VANHAC and he posts a lot of cool visuals and has a patreon with premium content which allows him to make even more graphs. His theme is that numbers are tired, and pictures are wired, which I really like. We’re actually trying to incorporate more pictures and visuals with Queen’s next year.
@SteveBurtch – I think he said he tweets a thousand times a month or something like that, so you get a lot of content that’s interesting. As he’s joked about himself, he has a surprisingly low “Bad-takes/60 tweets”, so you should definitely follow him.
@nnstats - Superbowl champion, someone I really look up to for advice on coding and life etc. She will be the first female GM for sure.
@MannyElk – If you’re looking for hockey twitter, but also salads and interesting takes on pop culture, you should definitely follow Manny.
Manny is certainly a fun follow, and the others are great as well. How would you recommend hockey enthusiasts to learn more about the analytics behind the game, aside from reading all of the great content on www.qsao-queens.com and attending QSAO events?
I’d say read a lot. That’s what I did throughout my highschool years, I would just read and read and read until I finally felt comfortable presenting these ideas to a coach to volunteer. If you’re not comfortable learning coding just yet, learn everything you can about Excel or other data visualization software. Also learn to effectively communicate your ideas. I know that if I present my idea to the coaching staff as a bunch of numbers, they will not care. If I explain how the idea could be implemented and show them that it works in some setting, then it’s way more likely to be accepted. I think that’s a big problem with some analysts, they can be a little cantankerous or have a high and mighty attitude at times where it’s ‘them-vs-the-world,’ but that mentality won’t serve them well in life. So it’s really important to communicate these ideas effectively, in my opinion.
All the VANHAC talks are on Youtube (link to the playlist here) so watching all of those would be great. If you’re still in high school, or even University, find a way to work with them in any capacity. I started by running a Twitter account in Grade 11 for the Don Mills Flyers. From there, I met plenty of interesting people in the industry that I still keep in contact with. This gig helped me work for my high school team, which ended up being analytics.
Thanks for doing this Miles, looking forward to working with you in the future to help make the Queen's Men's Hockey Team even better.
Player comparison is a popular topic of debate among armchair general managers: which guy is better? Would you rather have Player A or Player B? In the wake of a big 1:1 trade, which team won? While in the past we were left to bias, favouritism, and the infamous eye test, today we have some visualization tools to help compare players across useful metrics.
One of the best and most intuitive of these tools is the HERO Chart, as pioneered by Domenic Galamini Jr. (@MimicoHero). These charts, which are within the realm of descriptive statistics, can be found at the following website: http://ownthepuck.blogspot.ca/
Below we can see Alex Ovechkin’s HERO Chart:
What Stats Are Measured?
HERO charts show performance across five stats: ICETIME, GOALS, FIRSTA, SHOTGEN, and SHOTSUP. ICETIME refers to all-situation (even strength, power-play, or short-handed) minutes per game. GOALS measures 5-on-5 goals per 60 minutes, while FIRSTA measures 5-on-5 first assists per 60 minutes. SHOTGEN is 5-on-5 shots generated per 60 minutes and SHOTSUP is 5-on-5 shots suppressed per 60 minutes, both relative to average. These stats are measured across the most recent three seasons, with weightings of 44%-33%-22% respectively to ultimately reach a single measure.
It’s important to note some key features of these metrics. Aside from ICETIME, the other four stats are measured only at even-strength and per 60 minutes of playing time. This serves to level the playing field, and accounts for the situation and frequency with which different players are deployed. Making these adjustments gives us a better sense of a player’s true performance, though we must consider HERO chart results in an appropriate context. Logging massive minutes and special teams scoring remain hugely important parts of the game, so they should not be disregarded when evaluating a player’s usefulness even if they are not reflected in a player’s HERO chart.
What Do the Numbers Mean?
Each of the numbers you see represents a standardized rating from 0 to 10. A rating of 5 represents league average performance at a skater’s position, with a standard deviation of 2 in either direction. For example, as we can see, Alex Ovechkin is league average at first assists compared to eligible wingers. A rating above 5 shows performance above league average, and vice versa. The scores are normally distributed with a standard deviation of 2. We can see that Ovechkin is considerably above league average at generating shots, and somewhat below league average at suppressing shots.
Can I See Someone’s Stats Over Time?
Yes you can! Just under the HERO chart you’ll find a chart showing how the player has performed over recent years. The dark blue line represents primary points per hour, and the light blue line represents shot impact per hour. Here is Ovechkin’s. We can see a slow decline, though Ovechkin remains a strong performer in both metrics.
How Do I Compare Players Directly?
HERO charts were largely built to perform direct comparison, so when you enter Domenic’s website you’ll see two charts beside each other. You can select players of your choice from the dropdown menu for either chart and see a direct comparison. Let’s compare two elite centres: Sidney Crosby and Connor McDavid.
Scanning the charts, we can see that Crosby ranks higher in goals and shot generation, while McDavid ranks higher in first assists and shot suppression. Both players are fantastic across the board.
What Else Can I Do?
In addition to comparing players to other players, we can compare players to positional archetypes. For example, we could see how Max Pacioretty stacks up compared to the average first-line winger, or how Morgan Rielly performs relative to an average #1 defenceman. Below we can see Pacioretty’s chart:
If you’re interested in learning more about how the archetypes are calculated, there’s a section labelled ‘Chart Guide’ on the website containing an explanation of the methodology. Personally, I (Owen) enjoy using archetype comparisons to evaluate acquisitions that my favourite team makes, as it gives a high-level indication of where a player could fit into a lineup. It’s also useful for convincing your friends that the young guy you’re bullish on has legitimate upside, and that your team is going to go all the way because of it.
I Have Unanswered Questions - Where Do I Go?
That’s a quick and dirty explanation of what HERO charts are and how to use them. If you have any burning questions that are unaddressed, I encourage you to read through the HERO chart FAQ’s that Domenic published. The link can be found here: https://ownthepuck.wordpress.com/2017/01/21/hero-charts-frequently-asked-questions/.
All-3-Zone Player Comparison Tools:
Eric Tulsky once said "the magic of analytics is in recording all of the small things lost to memory that add up to something significant.” The easiest events to remember after you watch a hockey game are the big events: the goals, and sometimes even the shots. What you probably don’t remember, though, are the small plays that led up to those events, and the small plays that led to nothing at all. Tulsky worked with people like Corey Sznajder (@Shutdownline) to study the events in the neutral zone that drive offense. Although Tulsky now works for the Hurricanes, Sznajder runs a massive tracking project whose numbers are brought to life by CJ Turtoro's (@CJTDevil) All-3-Zones Player Comparison Tools. Before we learn about these tools, it is important to note that Sznajder literally watches every game to collect these stats, as opposed to the data from HERO charts which are released by the league and then displayed as you saw earlier. The sample sizes in these visuals are smaller as a result, but we will see in a moment how they capture some important ways that players create value for their teams.
There are two sets of visuals, which can be found at the links below:
First, we will discuss the set of visuals you can find by clicking that first link above. Below, you will see a screenshot of one of the four visuals available at that link:
The stats displayed on this page quantify what happens when a player tries to enter the offensive zone with the puck. He can either carry it in (carry-ins/60), dump it into the zone and then chase after it (dump-ins/60), pass it off to a teammate (Entry passes/60) or fail in his attempt (fails/60).
We care about these numbers because entering the offensive zone with control of the puck is a reliable way to create offense. It is one way to quantify a small thing lost to memory that gives rise to something significant. As you can probably see from the leaderboard above, players who succeed at entering with control are better at creating offence than those who struggle to bypass opposing defenders. This is why the players here are sorted by possession entries (carry-ins + entry passes per 60 minutes).
While tracking carry-ins is a way to quantify the creation of offence, we can also use these numbers to quantify defence. Whenever a player tries to carry the puck into the offensive zone, the opposing defenders want to stop them. The best defenders in these metrics allow the fewest possession entries. The worst ones allow attackers to create offence with ease. It should not surprise you, then, that attackers try to target the defenders who struggle to defend the blue line. Defenders who allow possession entries 90% of the time they are targeted by opposing teams are obviously quite poor at defending the blue line. Below, you will see which defenders allow the fewest possession entries as a percentage of the number of times they were targeted:
Some of the best defenders in the league show up in this leaderboard, which is further validation that what we are studying is actually important. It is always a good sign when the numbers are validated by the eye test and by years of research.
The best defensive teams either prevent zone entries altogether, or they remove the puck from the defensive zone as soon as possible. Indeed, zone exits are another way to measure defensive contributions in hockey, for both forwards and defensemen. The screenshot below shows which players succeed at removing the puck from their zone:
Again, positive contributions are measured by Possession Exits/60. Exiting with possession of the puck occurs when a player carries the puck out of the defensive zone (carries/60), or when they make a successful pass to a teammate (Exit passes/60). If a player fails to exit the zone with the puck, it is obviously a failed attempted (Fails/60). If he dumps it, clears it, or ices the puck, he is merely giving the other team another chance to create offence, which is why Possession Exits/60 ignores Dumps/60, Clears/60, and Icings/60. Exiting the defending zone with possession of the puck is obviously better than not.
So far, we have learned how to quantify the ways players transition from the defensive zone to the neutral zone, and then into the offensive zone. All of these numbers have one underlying theme: Puck possession leads to shots. But how do we measure which players create the most shots? While the obvious answer is to count the number of shots a player takes, the tracking project takes this one step further, and counts up to three passes before each shot is taken. In the same way that points are counted as goals and assists at the player level, the tracking project keeps track of shots and the passes that precede them. The visual below illustrates how each player contributes to shots by shooting or passing:
This leaderboard ranks players by their Total Shot Contributions per 60 minutes. A player contributes to a shot if he is the shooter (Shots/60), or if he made at least one of three passes before the shot was taken. Assisting on a shot is the same as assisting on a goal, except Shot Contributions consider up to three passes before a shot while points only consider two passes. If a player made a pass immediately before the shot was taken it is called a Primary Shot Assist (sA1/60), if he made the second pass before the shot it is a Secondary Shot Assist (sA2/60), and if he made the third pass it is a Tertiary Shot Assist (sA3/60). Altogether, shot contributions are an excellent and reliable way to measure which players are creating offence.
Now that we have explored this first set of 4 visualizations, we can move on to the second part: The Player Comparison Tool. As you will see below, the Player Comparison Tool presents the numbers in a way that summarizes all of the stats we have learned about from the leaderboards. Take a look:
Most of the stats seen here should seem familiar, but this time they are aggregated to provide you with a more general snapshot of each player. For example, the Shot Contributions leaderboard we saw earlier broke down Shot Contributions into four stats: shots, primary shot assists, secondary shot assists, and tertiary shot assists. The Player Comparison Tool, summarizes these numbers to measure shooting (Shots60), passing (ShotAssists60; sA1/60 + sA2/60 + sA3/60), and total contributions (ShotContr60; Shots60 + ShotAssists60).
The zone entry leaderboard is summarized in the Entry section, using possession entries expressed as a rate stat (PossEntries60) and possession entries expressed as a percentage of total entry attempts (PossEntry%). Similarly, the zone exit leaderboard is summarized in the Exit section.
It is important to note that if you are viewing a forward using this tool, you will only see the first three sections. The fourth section, Entry Defence, is only available for defenders. This section summarizes the aforementioned Entry Defence per Target leaderboard. As discussed earlier, the best way to defend the blueline is to prevent attackers from entering the zone with control of the puck. A defender who breaks up a play at the blue line is credited with breaking up the play (Breakups60). Defenders who concede controlled zone entries less often are the ones who rank best in the second stat (PossEntriesAllowed60). This is also expressed as a percentage of the number of times the defender is the target of an attempted zone entry by the other team (PossEntry% Allowed).
You can view a players results in two 1-year windows and one 2-year window, covering the 2016-17 season and the 2017-18 season. This allows you to compare one player to himself (in consecutive seasons) or two players to each other (in the same single season or across both seasons simultaneously). As shown in the intro to analytics article, an example that motivates the study of the former is Nikita Zaitsev’s first two NHL seasons. If you are feeling extra fancy, you can also view two different players with the same name...
Although the most valid comparisons are those between players of the same position, which is obviously not true of the Sebastian Aho’s, it demonstrates one of the many ways you can be creative with these visuals once you start using them. With these tools at your disposal, you can answer silly questions like “Is Sebastian Aho better than Sebastian Aho?” along with more objective ones such as “Who contributes to offence the most often?” and “Which defenders are best at defending the blueline?” It would be impossible to answer any of these questions without the hard work of people like Sznajder, Turtoro, Tulsky, and the mission to record mundane elements of the game that uncover hidden areas of player value.
Advanced statistics is an area that has recently started to pick up steam and shift into the mainstream focus in hockey over the past decade. Many NHL teams now employ full-time analytics staff dedicated to breaking down the numbers behind the game. So, what makes analytics such a powerful tool? Aside from helping you dominate your next fantasy hockey pool, advanced statistics provide potent insights into what is really causing teams to win or lose.
Hockey is a sport that has long been misunderstood. Its gameplay is fundamentally volatile, spontaneous and difficult to follow. There are countless different factors that contribute to a team’s chances of scoring a goal or winning a game on a nightly basis. While many in Canada would beg to differ, ice hockey still firmly occupies last place in terms of revenue and fan support amongst the big four major North American sport leagues (NFL, MLB, NBA, & NHL). As such, hockey is on the whole overlooked and is often the last to implement certain changes that come about in professional sports. The idea of a set of advanced statistics that would offer better insights into the game arose as other major sports leagues, starting with Major League Baseball, began looking beyond superficial characteristics and searching for the underlying numbers influencing outcomes. Coaches, players, and fans alike have all been subjected over the years to an epidemic failure to truly understand what is happening out there on the ice. This is the motivation behind the hockey analytics movement: to use data analysis to enhance and develop our knowledge of ice hockey and inform decision-making for the benefit of all who wish to understand the sport better.
Another barrier to progress in the field of hockey analytics is the hesitance of the sport to embrace modern statistics. Most casual fans are familiar with basic stats such as goals, assists, PIM, and plus/minus. But do these stats really tell the full story? In fact, most of these are actually detrimental to the uninformed fan’s understanding of the game. For starters, there is usually no distinction between first and second assists in traditional stat-keeping. A player could have touched the puck thirty seconds earlier in the play or made an unbelievable pass to set up a goal, and either way it still counts as a single assist on the scoresheet. Looking only at goals and assists can be deceiving; we need more reliable, repeatable metrics to determine which players are most valuable to their teams. Advanced stats are all about looking beyond the surface and identifying what’s actually driving the play.
So, what are these so-called “advanced stats”? Let’s start with the basics.
PDO: PDO (it doesn't stand for anything) is defined as a team’s save percentage (usually 5v5) + shooting percentage, with an average score of 1. If you only learn one concept, it's this one. It is usually regarded as a measure of a team or player’s luck, and can be a useful indicator that a player is under/over performing and whether a regression to the mean (back towards 1.000) is likely. This will not happen in every situation, of course, but watch for teams that have astronomic PDOs to hit a reality check sooner rather than later. Team PDO stats can be found on corsica.hockey’s team stats page.
Without trying to scare anyone, the Toronto Maple Leafs currently boast the 4th highest PDO at 101.85. To help ease your mind a bit, the Tampa Bay Lightning who are considered the team to beat in the East have the highest PDO of 102.35, and there's a decent gap between second place. They could be currently playing at a higher level than they really are as well, time will tell.
Corsi: You may have heard of terms like Corsi and/or Fenwick being thrown around before. These are core concepts that are fundamental to understanding what drives the play during a game. Basically, Corsi is an approximation of puck possession that measures the total shot attempts for your team, and against your team, and stats can be viewed for Corsi results when a specific player is on the ice.
A shot attempt is defined as any time the puck is directed at the goal, including shots on net, missed shots, and blocked shots. Anything above 50% possession is generally seen as being positive as you are generating more shot attempts than you are allowing.
Corsi stats are typically kept in the following ways: Corsi For (CF), Corsi Against (CA), +/-, and CF%. An example of how CF% can be useful is when evaluating offensive defensemen. Sometimes, these players are overvalued because of their noticeable offensive production, while failing to consider that their shaky defensive game offsets the offensive value they provide.
Fenwick: Fenwick is similar to Corsi, but excludes shot attempts that are blocked. Of course, with both of these stats, one should also take into account that a player’s possession score is influenced by both their linemates as well as the quality of competition (QoC). These stats can always be adjusted to reflect different game scenarios, like whether the team was up or down by a goal at the time, etc.
Measuring puck possession in hockey makes sense, because the team that has the puck on their stick more often controls the play. Granted, Corsi/Fenwick are far from perfect, and the team with the better possession metrics doesn’t always come out ahead. But at the very least, including all shot attempts offers a much larger sample size of data than traditional stats, and provides a solid foundation for further analysis.
Zone Starts (ZS%): this measures the proportion of the time that a player starts a shift in each area of the ice (offensive zone vs. defensive zone). A ZS% of greater than 50% tells us that the player is deployed in offensive situations more frequently than defensive situations. This is important because it gives us insight into a player’s usage, or in what scenarios he is normally deployed by his team’s coach. It also provides context for interpreting a player’s Corsi/Fenwick. Players who are more skilled offensively will tend to have a higher ZS% because they give the team a better chance to take advantage of the offensive zone faceoff and generate scoring opportunities. At the very least, ZS% can be used to get a glimpse at how a coach favors a player’s skillset.
Intro on 5v5 Isolated Stats and Repeatability
Often times, you will see those who do work with hockey analytics cite a player's stats solely while they are at even strength, or 5v5. Why? There's a few reasons.
First, 5v5 obviously takes up most of the hockey game. If a player is valuable to his team at 5v5, he will be valuable to a team for more time throughout the game, and this should be seen as a large positive. A player's power play contributions are certainly valuable to a team, but often over-valued. Next, the game is played very differently at different states. It would be wildly unfair to penalty killers to have their penalty kill stats included in their overall line, as more goals against are scored on the penalty kill, even for the best penalty killers. Separating these statistics helps provide a more complete picture into the player's skillset and value that they have contributed to their team. Finally, 5v5 stats are generally regarded as the most repeatable, partially due to the larger sample. While players' PP and PK stats can highly vary by year, 5v5 stats typically remain relatively stable (read more at PPP here if you like).
In addition, primary points (goals and first assists) have been regarded as relatively repeatable stats, so be on the lookout for player's that have many secondary assists to possibly have their point totals regress in the future (read more on this here).
Intro to Comparison Tools
One of the areas that has most benefited from hockey analytics is the domain of player comparison. One of the best and most intuitive tools is the HERO chart, as pioneered by Domenic Galamini Jr (@MiminoHero). The HERO chart is a quick comparison of how players stack up across ice time, goal scoring, primary assists, shot generation and shot suppression. At a single glance, we can get a sense of the strengths and contributions of different players. Here we compare Sidney Crosby to Connor McDavid:
We can see that Crosby is better at goal-scoring and shot generation, while McDavid is better at primary assists and shot suppression.
To compare any two players of your choice, or to compare a player to a positional archetype like First-Line Centre or Second-Pair Defender, you can use Galamini’s website: http://ownthepuck.blogspot.ca/. These comparisons can be used to enhance understanding of a player’s skill set, inform debates, and evaluate moves made by NHL teams, among other uses.
All-3-Zone Data Visualizations:
While a HERO chart is an all-encompassing snapshot of a players contributions on the ice, the All-Three-Zones visuals are concerned with more specific aspects of the game. CJ Turtoro (@CJTDevil) created two sets of visuals using data from Corey Sznajder’s (@ShutdownLine) massive tracking project.
You can find both sets of visuals at the links below:
In the first set of visuals, you will find 4 leaderboards. Players are ranked in the 5v5 stats listed below.
5v5 Entries -- How often players enter the offensive zone by making a clean pass to a teammate (Entry passes/60) or by carrying the puck across the blue line themselves (Carry-ins/60).
Other notes: The best way to enter the zone is to enter with possession of the puck (Entry passes + Carry-ins, as discussed above). These types of entries are called Possession Entries. Although other types of attempts are included in the leaderboard as well, players are automatically sorted by Possession Entries/60 because these alternative attempts are less than ideal. If you decide to change this, use the “Sort By (Entries)” filter to rank the players in other ways.
5v5 Exits -- This is the same as 5v5 entries, except at the blue line separating the defensive zone from the neutral zone. Players are ranked based on how often they transition the puck from the defensive zone into the neutral zone either by carrying it (Carries/60) or by passing it to a teammate (Exit Passes/60).
Other notes: Like 5v5 entries, the best ways to exit the defensive zone are classified as Possession Exits. This is why players are automatically sorted by Possession Exits/60. Again, the “Sort By (Exits)” filter will let you change how the leaderboard is sorted.
5v5 Entries per Target (5v5 Entry Def %) -- This stat measures defence at the blue line. It answers the question: When a defender is in proximity to an attempted zone entry, how often does he stop the attempt?
Other Notes: It is important to note that a “defender” is any player on the team playing defence (i.e. the team without the puck). Forwards are included in this definition of defender, but the best way to use this leaderboard is to judge defensemen only. This is why forwards are automatically filtered out of the leaderboard, but you can always change this using the filter if you wish.
5v5 Shots and Passes -- Players are ranked based on how often they contribute to shots. Players contribute to shots by being the shooter or by making one of three passes immediately before the shot in the same way they earn points by scoring a goal or by making one of two passes immediately before the shot was taken.
If you want a closer look at certain groups of players, the filters allow you to look at players who play certain positions (forwards/defencemen) and players who play on certain teams. In the screenshot below, for example, I filtered the 5v5 Entries leaderboard to see what it looks like for forwards on the Oilers:
You can use these leaderboards to judge offence (5v5 entries, 5v5 shot contributions), and defence (5v5 exits, 5v5 Entry Def %). Ultimately, these four leaderboards will help you identify the best and worst players in these areas.
In order to focus on one or two players, you should use the second set of visuals: The A3Z Player Comparison Tool. While HERO charts allow for player comparisons in stats collected by the NHL, this visualization was designed to help you judge players based on their performance in several stats from the tracking project. Instead of standard deviations, however, the measurement of choice in this comparison tool is percentiles. So keep in mind that “100” means the result is better than 100% of the other results. You can view a players results in two 1-year windows and one 2-year window, covering the 2016-17 season and the 2017-18 season. Here’s a two-year snapshot of how Erik Karlsson and Sidney Crosby rank in some of these key stats:
You probably noticed that the stats for forwards and defencemen are slightly different. The only difference is that defencemen have three extra categories, which measure their ability (or lack thereof) to defend their own blue line (i.e. their 5v5 Entries per Target, as discussed in the previous section). You may have also noticed some useful information hidden beneath each players name, including the numbers of games and minutes that have been tracked for the player. Although the numbers in the screenshot above are from two seasons, another thing to keep in mind is that you can also compare a players development over two seasons by looking at their stats in one-year windows. To see what I mean, take a look at Nikita Zaitsev’s numbers in two consecutive seasons:
Visualizing the dramatic fall of Nikita Zaitsev in this way is an excellent starting point for further analysis. Likewise, you can also compare two different players in the same season or over two seasons. This is, after all, a Player Comparison Tool. Other common uses for both sets of A3Z visualizations are to identify strengths and weaknesses of certain players, to evaluate potential acquisitions, to design the optimal lineup for your favourite team, and many more.
Of course, there are countless other useful terms and concepts to consider in analytics, like relative stats, shot quality, and expected goals (xG), which we’ll be touching upon more in-depth in future articles. If you’re interested in advanced stats and would like to learn more, we’ll be putting out more content on exciting topics in hockey analytics over the coming months, so stay tuned.