Major League Baseball Capacity Rates: Analyzing Fan Attendance Using Panel Data by Anthony Turgelis

By: Cody Smith and Patrick Mills

A note from the article’s editor (Anthony Turgelis): This article will read somewhat differently from the rest of our content, as this was a detailed final project for a 4th year Applied Econometrics course at Queen’s. I hope you enjoy the format, as it is more typical for analytics research projects.


            Major League Baseball has been recognized as a staple of American culture since its inception in 1869. Over generations, the League has transformed from humble beginnings into an organization with 30 franchises over North America each playing 182 games per season. This period has seen franchises experience different levels of success when attempting to attract fans to games, and this variation has drawn the attention of many Economists looking for an answer as to why. Ahn and Lee (2014) found that from 1904-1957, fans were captivated by teams with the most winning record. This period was followed by a change in preference which Ahn and Lee identified as a desire for competitive games (1958-2012). Now, in the modern era, sports fans have a variety of professional sports to watch, as well as the ability to follow and support teams from a distance thanks to increased media coverage and reduced costs of transportation. Similar to how companies in retail rely on branding to differentiate themselves and attract customers, it is likely that branding will become increasingly important for MLB franchises as they compete with a growing number of substitute franchises across sports to attract people to fill seats at games.  This paper will examine the effect of Brand Equity Value (BEV) on capacity rates across MLB franchises. Our computation of BEV will be explained in our description of data. If the relation between BEV and capacity rates is found to be statistically significant, it could have major implications as it would provide compelling evidence of a new shift in fan preferences. Franchises would have to choose how to market themselves to stimulate interest among local and distant baseball fan bases alike. Evidence for a shift in fan preferences would also provide considerations for the MLB when creating policy to promote and protect the league’s growth and popularity.

            The remainder of this paper will be structured as follows: The second section will provide a commentary on existing literature discussing factors of MLB attendance. The third section will describe from where the data used in this paper as well as provide further explanation of BEV. The fourth section will explain the empirical model used in our regression analysis, as well as why it was chosen. The fifth section will present our regression tables as well as a commentary on the results. Finally, the sixth section will conclude with a summary of this papers findings as well as the importance and implications of these results.

Literature Review

Economists have published a collection of articles speculating over the key determinants of attendance for Major League Baseball franchises. Nesbit and King (2012) released a report finding that fans who play fantasy baseball are more likely to attend games, while Mittelhammer, Fort, et al., (2007), found increasing proximity between teams had an adverse effect on attendance in accordance with Hotelling’s model that consumers buy goods from the closest supplier. Ahn and Lee (2014) concluded that in earlier years of baseball (1904-1957) fans had been drawn to teams with winning records – the better the record, the higher the turnout – before a new era of baseball (1958-2012), saw fans who were drawn to games based on uncertainty of outcome, size and quality of the stadium as well as the playing styles of the teams. Throughout the literature however, one overarching determinant became reoccurring: the maintenance of a competitive balance in the league. Berri and Schmidt (2001) investigated this matter and concluded that as the league became more competitive, attendance could be expected to increase. Lemke, Leonard, et al. (2010), who attempted to establish a relation between promotions and giveaways in small and large markets on attendance, found competitive balance to be a significant factor of attendance. Ahn and Lee (2014), reached the same conclusion as well. Economists approaching the subject seem to agree that competitive balance is essential to the interest of fans as well as the financial health of the league. Lee (2016), showed that fans consider characteristics of home and away teams when making attendance decisions. In the same report, Lee also hypothesized that modern era baseball fans had less incentives to cheer for the home team, citing development of media and greater access to information, mobility in residence and reduced transportation costs as factors that would allow fans to pick and choose a team they wanted to support, rather than teams in close proximity. While this is still consistent with competitive balance, if true, previous methods of incentivizing local fans to attend games would prove to be less efficient.

Description of Data

            The primary data that was utilized in our regression model was obtained from ESPN, Sports Reference, Statistics Canada, Statista and the United States Census Bureau. Supplementary information used from Boston Globe Media Partners examined MLB ticket prices, and this paper took data presented by Forbes Media to establish franchise valuations. Data from ESPN and Sports Reference provided information on stadium capacity rates, strength of schedule, win rate over .500, estimated payroll, homeruns per game, and pace of games. The United States Census Bureau and Statistics Canada data pools were used to establish the demographic parameters of population and median household income. Data from Statista was referenced to establish ticket prices. We also took into consideration the addition of professional sports teams to cities with MLB franchises by including a dummy variable. If these arriving franchises were a part of a big four league (NHL, NFL, NBA, MLB), the existing MLB franchise was given a value of 1, while MLB teams in cities that did not gain a professional team were given a value of 0. Those that lost a franchise were given a value of -1. We also assigned a dummy variable to account for MLB franchises that moved into a new stadium over the period of examination. Teams that moved were assigned a value of 1 while teams that did not were assigned a value of 0. It is important to note that this is a potential source of error in our paper as teams who moved from a larger stadium to a smaller stadium will have experienced an increase in stadium capacity rate without a real increase in game attendance. Data from these sources was used to collect relevant information from all 30 MLB teams in the years 2008 and 2018, with a total of 60 observations.

Our primary measure deals with the evaluation of brand equity and is represented by the variable BEV. To accurately quantitate this qualitative statistic, we standardized four separate variables and used the average of the values to compute each franchises BEV statistic. As we only took data from separate two years, it was only possible to determine the effect of BEV in 2018. These four variables used for our BEV statistic are as follows:


1.      Market share – Presence in the market (2018 season)

a.       % of market share per team

b.      Individual team valuation/Sum of MLB team total valuation

2.      Transaction value – Price offered for service (2018 season)

a.       Average ticket prices/team

3.      Success generation – Team performance change (2008-2018)

a.       (Win % 2018 season/Win % 2008 season)-1

4.      Growth rate – Team valuation change (2008-2018)

a.       (Team valuation 2018 season/Team valuation 2008 season)-1

Table 1. Definition of Variables


Figure 2. Summary Statistics


Empirical Model

            To estimate the determinants of capacity rate, we have chosen to use a panel data regression. To decide between a fixed effects and random effects model, a Hausman test was conducted. After our results yielded Prob>Chi2= > 0.05, we chose to use a random effects GLS regression. This method will allow us to compare common factors of short-run demand to determine capacity rates in our given seasons, 2008 and 2018.

Equation (1) presents our basic empirical model of MLB game attendance:


Table 1 included in our description of data provides an explanation for each variable included in our empirical model. Using a random effects model is this instance is useful as the variation across franchises in our model is assumed to be random and uncorrelated with the predictor or independent variables included. Random effects assume that the error term is not correlated with the predictors, and under this assumption a random effects model will produce unbiased estimates of the the coefficients, use all the data available, and produce the smallest standard of error. After running our random effects GLS regression, we ran a simple OLS regression to determine how much of the variation in capacity rates could be explained by our Brand Equity Value variable, BEV, which will serve as our primary parameter of interest.

Results and Discussion

            After running our random effects GLS regression, we found three variables to be statistically significant. They were: estimated payroll for the season, home runs per game and franchise value. Our R-squared was 0.6407, and shows a strong correlation between the effect of the variables on the capacity percentage for MLB teams.


After running this regression, we tested our BEV statistic against the 2018 results to determine how much of the effect could be could attributed Brand equity value.


We found BEV to be statistically significant, with an R-squared value of 0.5196, meaning that 52% of the change in capacity percentage across MLB franchises can be accounted for by our constructed BEV variable. To the best of our knowledge, this paper is the first Economic evaluation of the effects of Brand Equity on capacity percentage, which is a direct measure of fan demand for a franchise. Our findings from our random effects GLS regression were consistent with existing literature, as estimated payroll, homeruns per game and franchise value have consistently been significant indications of attendance.


            The purpose of this paper was to establish an understanding of the impact of different variables on the capacity rate of MLB franchises. We took a collection of data from the 2008 and 2018 MLB season, as well as corresponding data of demographics and ran a random effects GLS regression, followed by a linear regression to determine how much of the change in capacity rates could be explained by our variable of interest, BEV. We found three variables in our random effects model to be statistically significant: estimated payroll, homeruns per game, and franchise value. Our findings in this regression are consistent with the literature. The results of our linear regression proved our hypothesis that Brand Equity Value plays a statistically significant role in determining capacity rates for franchises across the MLB. These findings are important as they indicate that fan preferences may be changing again, which is an observation that has been made in the literature over different periods. A change in fan preferences will have implications relating to economic and policy decisions as franchises attempt to stimulate fan interest by providing different amenities and incentives to differentiate themselves from the competition. The success of these efforts has implications for the municipalities and regions that generate tax revenue from the operations of these franchises, and could impact future decisions regarding expansion and relocation of MLB franchises.

What Makes a Top 10 Pitcher? by Alex Craig

By: Josh Margles

In baseball statistics, an earned run average (ERA) is the mean of earned runs given up by a pitcher per nine innings pitched. I decided to take a deeper look to see what goes into the ERA of a pitcher. In this study, I divided all the qualified pitchers from the last five years into two groups; top 10 ERA and non-top 10, as a means to determine what makes a top 10 ERA pitcher.

Using four indicators; strikeout percentage, walk percentage, left on base percentage, and BABIP (batting average on balls in play) we can figure out the probability that a pitcher will finish in the top 10 in ERA. I ranked all the pitchers in the last five seasons by these categories, and put them into a big matrix of numbers based on their rankings. To indicate if they finished in the top 10 ERA category, I put a 1 for top 10, and a 0 for finished outside the top 10. I used each pitcher’s yearly rank instead of their actual numbers because each year’s top 10 is different. Therefore, it is important to compare numbers on a year- to-year basis.

Some of the chart looks like this:

Screen Shot 2019-01-24 at 4.02.14 PM.png

To find a prediction, I used a program in R called XGBoost. XGBoost takes the information based on the previous data and tests to see if there is a pattern between where the pitcher finished in rank, and if he finished in the top 10 of ERA in the season. After running the numbers with different parameters on XGBoost we can determine two things. The program tells us which of the four stats is most indicative of a high ERA rank, and which pitchers were outliers (the model predicts the outcome).

First, let’s look at which stat is the most predictive in determining the rank. Surprisingly, LOB rank has the most impact on a pitchers ERA rank. Note that these aren’t percentages, rather they are used to show the relative importance in each stat in predicting ERA.

Screen Shot 2019-01-24 at 4.01.35 PM.png

This chart shows that where the pitcher finishes in LOB percentage is the best predictor. Interestingly enough, the pitcher that had the highest LOB percent (he left the highest percentage of runners on base) each of the last five years finished in the top 10 in ERA. Also, out of the pitchers that finished in the top five LOB percentage, 20 out of the 27 (there was one three-way tie) finished in the top 10. The chart also shows that LOB rank and K rank are much more significant than BB rank or BABIP rank.

Next, let’s look at the predictive aspect of the model. I ran the model using a number of different combinations of test and training data, and then had it predict on the pitchers. The model predicted around 85 percent of the pitchers correctly. Now, let’s look at a few pitchers that the model incorrectly predicted and why this data was wrong.

Screen Shot 2019-01-24 at 4.06.16 PM.png

Garrett Richards finished the 2014 season with a 2.61 ERA, which placed him 10th in the MLB. However, the model predicted that Richards would finish outside of the top 10 with those ranks. One explanation for why Richards finished with a good ERA is his HR rate. He had a 0.27 HR/9 rate in 2014, which was the lowest of any qualified pitcher in the last five years. So, while he allowed a lot of baserunners, not a lot came in because of the fact that he could keep the ball in the yard. Richards has been injured the last few years, but his success has been almost completely related to his home run rate.

Screen Shot 2019-01-24 at 4.09.23 PM.png

Stroman in 2017 had an ERA of 3.09, which placed him 9th. What Stroman lacks in strikeouts, he made up for in his ground ball to fly ball rate, as well as his groundball percentage. This allowed Stroman to get easy outs without needing to strike everyone out. Since he got so many groundballs, most of the hits he gave up were singles, which limited the amount of earned runs. He also induced the most double plays in 2017, which helped him get out of innings without allowing any earned runs.

Screen Shot 2019-01-24 at 4.11.24 PM.png

One problem with this model is that it treats everyone outside the top 10 as equals. In 2015, Scherzer had a 2.79 which was the 11th best in the MLB. Even though he finished with a great ERA, the reason he didn’t make it into the top 10 was because of the amount of HR he allowed. He gave up 31 HR which was the most in the NL. Even though he finished in the top 10 in these four stats, his home runs prevented him from being in the top 10 in ERA.

Screen Shot 2019-01-24 at 4.13.40 PM.png

One of the more interesting results was that the model projects Fiers in the top 10 even though he had a 3.56 ERA, finishing 24th in 2018. The reason why his LOB rank is so good, while still consistently giving up runs, is because he gave up the second most HR/9 of anyone in the MLB. While the rest of his numbers look good, like Scherzer, home runs prevented Fiers from having an elite ERA.


Stats from, and