Objective
The objective of this experiment is to use correlation analysis to determine which team batting statistic is the best predictor of a baseball team's run-scoring ability.
Introduction
Baseball is an interesting combination of individual and team effort. For example, there is the one-on-one duel of pitcher against batter. But once the batter reaches base, he needs his teammates to follow with hits (or "productive outs") in order to move him up the bases so that he can score. From the scientific side, an interesting aspect of baseball is the rich trove of statistics on nearly every aspect of the game.
In this project, you will learn about correlation analysis, a statistical method for quantifying the relationship between two variables. As an example, consider as our two variables the age and height of male students in an elementary school. In general, individuals in this age range grow taller every year. If we made a scatterplot with height as our y-axis and age as our x-axis, we would expect the data points to show a consistent upward trend, with height increasing steadily along with age. The graph below shows simulated data (based on average growth charts).
As a second example, suppose that we graph height as a function of birth month instead of age. Would you expect to find a correlation? Here is the same simulated height data, graphed now as a function of birth month (randomly assigned).
To convince you that it is the same data, here is the same graph, with the different age groups (shown by grade level, K–6) each assigned a different symbol. You can clearly see the difference in average height of the different grade levels.
The statistic that describes this relationship between two variables is the correlation coefficient, r (or, more formally, the "Pearson product-moment correlation coefficient"). It is a scale-independent measure of how two measures co-vary (change together). The correlation coefficient ranges between −1 and +1.
What do the values of the correlation coefficient mean? Well, the closer the correlation coefficient is to either +1 or −1, the more strongly the two variables are correlated. If the correlation coefficient is negative, the variables are inversely correlated (when one variable increases, the other decreases). If the correlation coefficient is positive, the variables are positively correlated (when one variable increases, the other increases also). How close to +1 or −1 does the correlation coefficient need to be in order for us to consider the correlation to be "strong"? A good method for deciding this is to calculate the square of the correlation coefficient (r 2) and then multiply by 100. This gives you the percent variance in common between the two variables (Rummel, 1976). Let's see what this means by calculating r 2 over the range from 0 to +1. (Note: for the corresponding values of r between 0 and −1, r 2 will be the same, since squaring a negative number results in a positive number.)
r | r 2 | % variance in common |
1.00 | 1.00 | 100 |
0.90 | 0.81 | 81 |
0.80 | 0.64 | 64 |
0.70 | 0.49 | 49 |
0.60 | 0.36 | 36 |
0.50 | 0.25 | 25 |
0.40 | 0.16 | 16 |
0.30 | 0.09 | 9 |
0.20 | 0.04 | 4 |
0.10 | 0.01 | 1 |
0.00 | 0.00 | 0 |
As you can see from the table, r 2 decreases much more rapidly than r. When r = 0.9, r 2 = 0.81, and the variables have 81% of their variance in common. When r = 0.7, that might seem like a fairly strong correlation, but r 2 has fallen to 0.49. The variables now have just less than half of their variance in common. By the time r 2 has fallen to 0.5, r 2 = 0.25, so the variables have only one-fourth of their variance in common.
For our simulated height data, the correlation coefficient for height vs. age was 0.88, indicating that age and height share 77% of their variance in common. In other words, 77% of the "spread" (variance) of the height data is shared with the "spread" of the age data. For height vs. birth month, the correlation coefficient was 0.03, so, to two decimal places, r 2 = 0.00. There is no correlation between the variables (as we suspected).
It is important to remember that correlation does not imply that one variable causes the other to vary. Correlation between two variables is a way of measuring the relationship between the variables, but correlation is silent about the cause of the relationship.
If the correlation coefficient is exactly ±1, then the two variables are perfectly correlated. This means that their relationship can be described by a linear equation, of the form:
You've probably seen this equation before, and you may remember that m is the slope of the line, and b is the y-intercept of the line (where the line crosses the y-axis). If two variables are strongly correlated, it is sometimes valuable to use the linear equation as a method for predicting the value of the independent variable when we know the value of the dependent variable. This method is called linear regression.
Let's look again at the scatterplot of simulated height vs. age for elementary school students. If we draw a "best fit" line through the points, our scatterplot looks like this:
A "best fit" line means the line that minimizes the distance between the line and all of the data points in the scatterplot. If you wanted to predict a boy's height, and all you knew was his age, using this line to make a prediction would be your best guess. A spreadsheet program (like Excel) can do this "best fit" calculation for you, and help you get started with making a graph of the data and the regression line. You can also make a graph of the "residuals," which shows the distance of each data point from the regression line. Here is an example of a residuals graph, again using our simulated height vs. age data:
The residuals plot essentially rotates the linear regression plot by 45°, making it easier to compare how the data points are distributed around the regression line. It is easier to make the comparisons when the regression line has a slope of zero. The vertical scale can also be expanded, since the data is now centered within the area of the graph. If you see patterns in the residuals plot, these are features of the data that are not explained by correlation between the two variables.
This project will use correlation analysis to determine which team batting statistic is the best predictor of a baseball team's run-scoring ability (Albert, 20003). In addition to standard batting statistics, you'll also use batter's runs average (BRA), total average (TA), and runs created (RC). Each of these is defined in the Experimental Procedure section, where you can learn how to program them in to a spreadsheet with a formula.
There are many possible variations to this project that could apply similar methods, or extend them further for a more advanced project. See the Variations section below for some ideas. No doubt you can also come up with your own. You can also check out the book on which this project is based, Teaching Statistics Using Baseball, by Jim Albert.
Terms, Concepts and Questions to Start Background Research
To do this project, you should do research that enables you to understand the following terms and concepts:
- baseball batting statistics:
- hits (H),
- doubles (2B),
- triples (3B),
- walks (BB),
- strikeouts (SO),
- batting average (BA),
- on-base percentage (OBP),
- slugging percentage (SLG),
- batter's runs average (BRA),
- total average (TA),
- runs created (RC).
- correlation coefficient (or Pearson product-moment correlation coefficient),
- linear regression.
Questions
- If you find a correlation between two variables in your data set, can you conclude that one of the variables causes the other to change in a predictable way?
Bibliography
- Batting statistics are defined here:
Forman, S.L., 2006. "Batting Glossary," Baseball-Reference.com - Major League Statistics and Information [accessed March 3, 2006] http://www.baseball-reference.com/about/bat_glossary.shtml. - Here are two starting points for your background research on statistics:
- Wikipedia contributors, 2006. "Correlation," Wikipedia, The Free Encyclopedia [accessed March 3, 2006] http://en.wikipedia.org/w/index.php?title=Correlation&oldid=42516704.
- Rummel, R.J., 1976. "Understanding Correlation, Chapter 4.3, Interpreting the Correlation: Correlation Squared" Department of Political Science, University of Hawaii [accessed March 6, 2006] http://www.mega.nu:8080/ampp/rummel/uc.htm#S4.3.
- The following sites are good sources for baseball statistics.
- This project uses annual team batting statistics from baseball-reference.com:
Forman, S.L., 2006. "League Index," Baseball-Reference.com - Major League Statistics and Information [accessed March 3, 2006] http://www.baseball-reference.com/leagues/. - Here is another site where you can download historical baseball statistics:
Lahman, S., 2006. "The Baseball Archive," [accessed March 3, 2006] http://www.baseball1.com/.
- This project uses annual team batting statistics from baseball-reference.com:
- Here is an Excel tutorial to get you started using a spreadsheet program:
James, B., date unknown. "Excel 101," University of South Dakota, [accessed March 3, 2006] http://www.usd.edu/trio/tut/excel/. - If you'd like more ideas for exploring baseball statistics, check out the book this project is based on:
Albert, Jim, 2003. Teaching Statistics Using Baseball. Washington, D.C.: The Mathematical Association of America.
Materials and Equipment
To do this experiment you will need the following materials and equipment:
- a computer with Internet access and a spreadsheet program (the example below uses Microsoft Excel (Office 2003 version), similar functionality is probably available in other spreadsheet programs),
- a printer.
Experimental Procedure
- Do your background research so that you are knowledgeable about the terms, concepts and questions above.
- If you are not familiar with using a spreadsheet program, be sure to take the time to go through the Excel tutorial listed in the Bibliography.
- Here is a short version of the data analysis steps you'll be following in order to find which batting statistic correlates best with run production. Use the links to jump down to the detailed sections, below. Use your browser's "back" button to return to this brief list:
- Download the historical data and import into Excel. (Downloading and Importing Data)
- Format the data for statistical analysis. (Formatting the Data)
- Add derived statistical measures (RC, TA, BRA or any others of your choosing) for testing. (Adding Derived Statistical Measures)
- Use the spreadsheet's correlation and linear regression analyses to see how the various batting statistics correlate with runs scored. (Running Correlation and Linear Regression Analysis)
- Compare the results. (Comparing the Results)
- If spreadsheets are something new for you, then the detailed explanations that follow should help. If you are already comfortable with using spreadsheets, then you should be in good shape on your own.
Downloading and Importing Data
- Download annual team battings statistics from Baseball-Reference.com: http://www.baseball-reference.com/leagues/.
- The data is arranged by year, and by league—National (NL) or American (AL).
- Choose the year and click on the league link (AL or NL) to get the team statistics for that league and year.
- Click and drag your mouse to highlight the batting statistics table, then copy and paste the table into Notepad.
- Repeat steps b and c for the other league. You can put both tables into the same text file, just include a blank line in between them.
- Save the tables as a text file (use the extension ".txt").
- The "Glossary" link (visible just above the table of team batting stats) has explanations of the batting statistics and their abbreviations.
- Import the saved batting data into your spreadsheet program. Here's how to do it in Excel:
- From the menu, select File/Open.... You'll see a dialog like the one below.
- At the bottom of the File Open dialog, under "Files of type:" use the drop-down list to select "Text Files (*.prn, *.txt, *.csv)".
- Navigate to the directory where you saved your batting data file, select it, and click "Open."
- Excel now takes you through the "Text Import Wizard," a series of three dialogs. The first dialog looks like this (team batting data for 2005 shown):
- Make sure "Fixed width" file type is selected (as above), then click "Next."
- The second Text Import Wizard dialog is used to set the field widths. It looks like this:
- The lines with arrows show where Excel will be breaking the data into columns. Check to make sure that each of the data columns has been recognized (use the horizontal and vertical scroll bars to view all of the columns).
- For the team batting data, you'll probably find that you need to add a column break for the third-from-last data column (the cursor points to the spot in the image below). The dialog box has instructions for adding, moving and deleting column breaks.
- When all of the data columns are set to your satisfaction, click "Next".
- The final step of the Text Import Wizard is to select data formats for each of the columns, as shown below.
- The default selection, "General", is what you want for any column with numerical data. For columns containing only text (like the team names), you can select "Text", but this is optional. Again you can use the horizontal and vertical scroll bars to examine the data columns and make sure that all the data types are set properly.
- When you are satisfied with your selections, click "Finish" to import the data.
- From the menu, select File/Open.... You'll see a dialog like the one below.
Here are some tips for getting your data organized before analyzing it. You'll learn how to remove unwanted rows (or columns), how to change the order of data columns, and how to freeze the column labels, so that they remain visible even when vertically scrolling the data table.
- Removing unwanted rows (or columns). Sometimes an imported data file contains extra rows or columns that you don't need for your analysis. For example, with the team batting statistics, the second row is a dividing line made with characters. Since some of the analysis features we'll be using later require the data to be in contiguous blocks, we'll want to remove these extra rows. This is really simple. Here's how to do delete rows (or columns) in Excel:
- Right-click on the number(s) of the row(s) (or letters of the columns) you want to delete.
- The entire row (or column) will be highlighted, indicating that it is selected, and you will see a popup menu, as shown below.
- Select "Delete" from the popup menu (as shown above), and you're done.
- You'll want to delete any row that does not contain team batting data (except for row 1, which contains the column labels).
- Rearranging columns. For the correlation and linear regression analysis we'll be running later, Excel requires the data of interest to be in contiguous blocks. We are interested in measuring the correlation between runs scored and batting statistics such as hits, doubles, triples, batting average, etc. The columns are arranged so that two data columns—games (G) and at-bats (AB)—separate runs per game (R/G) and runs (R) from the hitting stats of interest. If you want, you can simply delete the games and at-bat columns (as described above, 1a–d). Alternatively, you can move the columns. Here's how to rearrange data columns in Excel:
- First, you need to create two new columns to move the data into. We'll add them to the left of the runs/game column.
- Right-click on the column letters to select columns B and C.
- The entire columns will be highlighted, indicating that they are selected, and you will see a popup menu, as shown below.
- Select "Insert" from the popup menu (as shown above), and two empty columns are inserted to the left of your selection. (If you want to insert more columns, select more to start with.)
- Now select the games and at-bats columns, then cut-and-paste them to the new location.
- Finally, delete the now-empty columns and you're done.
- Freezing the header row (data labels). If your data table has too many rows to fit on your screen, it can be nice to have the column labels stay put when you scroll vertically. That way, you can still see what data is in each column. Here's how to freeze your column labels in Excel:
- Select the row below your column labels.
- From the menu, select "Window/Freeze Panes." That's it!
Adding Derived Statistical Measures
In this section you will be adding derived statistics—those that are calculated from other statistics in your table. In addition to the derived statistics below, you can include other measures that you found in your background research, or you can try to create your own derived statistic.
- First, insert at least three additional columns between the OPS+ and hmR/G columns. You'll be adding the three derived statistics mentioned in the Introduction: BRA, TA and RC.
- Batter's runs average (BRA), is the product of OBP and SLG. Here's how to enter it in Excel:
- All formulas in Excel start with the equals sign: "=".
- The arithmetic operators for formulas are: "+", "-", "*", and "/", for addition, subtraction, multiplication and division, respectively.
- On our example spreadsheet, OBP is in column "M" and SLG is in column "N".
- So to enter the formula for BRA for row 2, you would type: "=M2*N2", as shown below:
- Hit "Enter", and Excel calculates the value for you, as shown below.
- Next, copy and paste the formula into the rest of the column.
- Finally, you will want to change the formatting of the column so that only 3 decimal places are displayed (the same number of significant figures as in the operands, OBP and SLG). Right-click to select the entire BRA column, and select "Format cells..." from the popup menu.
- In the Format Cells dialog, Choose the "Number" tab, then select "Number" from the Category list, and change the "Decimal places" value to 3, as shown below:
- Click "OK", and Excel displays BRA to 3 decimal places.
- Total Average (TA), the ratio of the number of bases to the number of outs:
TA = (TB + BB)/(AB - H). There is no column for TB (total bases) in the tables from Baseball-Reference.com. You can easily get TB from SLG, because:SLG = TB/AB ,
so if we multiply both sides by AB, we can get TB:TB = SLG*AB . - Substituting SLG*AB for TB, our formula becomes:
TA = (SLG*AB + BB)/(AB - H). Find the corresponding data columns on your spreadsheet and enter the formula. As you did for BRA, above, copy and paste the TA formula to the rest of the column, and change the number display to 3 decimal places. - Runs Created, devised by Bill James. RC = (H + BB)TB/(AB + BB). After the first two examples, you should be able to do this one on your own.
Running Correlation and Linear Regression Analysis
- To run the correlation analysis, use the menu to select "Tools/Data Analysis...". [Note, if this choice is not available, select "Tools/Add-Ins...". Check the "Analysis Toolpak" box, and click "OK". The "Data Analysis..." choice should now be available on the "Tools" menu.]
- In the Data Analysis dialog, select "Correlation" from the list of Analysis Tools (as shown below) then press "OK".
- In the Correlation dialog, there are several pieces of information to fill in:
- You want to measure the correlations between runs scored (R) and each of the batting statistics (all of the columns from hits (H) to runs created (RC). You should include the first row, which contains the column labels. Excel will use the labels to identify the correlation data. Enter this range of columns in the "Input Range" field.
- Make sure that "Columns" and "Labels in First Row" are both checked.
- In the "Output Options" section, it's a good idea to put the output on a new worksheet, which you can also name.
- When you are satisfied with your selections (see the image below as an example) hit "OK" and Excel will add a new worksheet with the correlation analysis results.
- Excel calculates the correlation coefficients for each pair of data columns in the range you supplied, and displays the results in a matrix on the new sheet you selected. Here is an example, using data from the 2005 baseball season:
- The correlation matrix works like a mileage chart in a road atlas. To look up the correlation of batting average (BA) with runs (R), you look down the "R" column until you come to the value in the "BA" row (boxed coefficient in the image above).
- Next you'll do the linear regression analysis. This time you will need to run the analysis separately for each pair of variables (e.g., runs vs. OBP, runs vs. SLG, etc.) you want to test. You can use the results of the correlation analysis, above, to decide which pairs to explore further with linear regression.
- To run the linear regression analysis, use the menu to select "Tools/Data Analysis..." and then scroll down to choose "Regression" in the Data Analysis dialog, as shown below:
- In the Regression dialog, there are several pieces of information to fill in:
- For the "Input Y Range," enter the range of cells for R (runs). Be sure to include the first row, with the column label.
- For the "Input X Range," enter the range of cells for the variable of interest (here, we're going to plot the regression line for R vs. OBP).
- Make sure that the "Labels" box is checked.
- Under "Output options," select "New Worksheet Ply" for the results, and give it a name (here, we're calling the new worksheet "2005_RvsOBP_Regr").
- Under "Residuals," make sure that boxes are checked as shown, so that Excel will automatically create regression and residuals plots for you.
- When you are satisfied with your selections (see the image below as an example) hit "OK" and Excel will add a new worksheet with the linear regression analysis results.
- Excel does a lot for you automatically, but you will still need to tweak the formatting of the results. Here are some suggestions:
- Column widths. When Excel completes the Regression analysis, it will display the new worksheet containing the results, and all of the numerical results will be highlighted, indicating that they are selected. The first thing to do is to adjust the column widths so that you can read all of the headings. From the menu, select Format/Column/AutoFit Selection," and the widths will be set so that you can read everything.
- Next, you will need to reformat both the regression plot and the residuals plot. Think of Excel's automatic graphs as just a starting point (see below).
To really see what's going on, you will want to expand the size of the graphs, and adjust both the x- and y-axis scales (see below). You may want to make other tweaks as well.
- Here are some suggestions for formatting the graphs.
- To change the size of a graph window, click on the "Chart Area" (between the graph ("Plot Area") and the border around it. (If you hold the mouse still over the graph, a tooltip will pop up and tell you where you are.) The border of the Chart Area will be highlighted, and there will be "handles" to click and drag so that you can size the graph window to your liking.
- You can change the size of the "Plot Area" (the graph itself) in a similar manner.
- To make changes to the x- or y-axis, double-click on the axis labels in the graph. You should see a dialog like this one:
- On the "Patterns" tab (see above), change the "Major tick mark type" and "Minor tick mark type" to "Outside," so that the tick marks don't obscure your data points. (Note: when you are formatting the Residuals graph, under "Tick mark labels" select "Low.")
- On the "Scale" tab (see below), adjust the limits so that your data just fits within the graph area. Choose a value for the "Minimum" that is just below your lowest data point, and a value for the "Maximum" that is just above your highest data point. The "Major unit" is the step size for the major tick marks. This value determines the interval for labeling the axis with numbers. The "Minor unit" is the step size for the tick marks falling between numbered values.
- On the "Number" tab (see below), first select "Number" in the "Category" list. Then you can set the number of decimal places to display on the axis labels. Here the axis is OBP (on-base percentage), so we've chosen 3 decimal places, which is the customary way of displaying OBP.
- Push "OK" and you can see the updates on your graph. If you need to make more changes, just double-click on the axis label again. Make these changes for both the x- and y-axis.
- Finally, here are a few more tweaks you can make to the graph: refining the graph title and axis labels, removing the legend, and changing the regression line from a series of symbols to a solid line.
- Right-click on the graph and select "Chart Options...". You'll see a dialog like the one below:
- On the "Titles" tab (see above), you can name the graph, and change the labels for the x- and y-axis.
- On the "Legend" tab (see below), you can de-select "Show legend." Then push "OK."
- Next, point your mouse at one of the data points for the regression line. (If you hold the mouse still over a data point, a tooltip will pop up to tell you which data series the point is from.) Double-click on the point, and the "Format Data Series" dialog will pop up (see below).
- It's best to have the regression line appear as a simple line, not as data symbols. In the "Marker" section, select "None". In the "Line" section, select "Custom", choose a solid line style in black, and then push "OK."
- Here's the finished result (below). Now the data points are clearly visible, as is their relationship to the regression line.
- Right-click on the graph and select "Chart Options...". You'll see a dialog like the one below:
- Use your correlation analysis results to decide which batting statistics are more highly correlated with scoring runs.
- Make a table or graph of your results, using the r2 statistic to interpret the significance of the correlation.
- For each of the highly-correlated statistics, work through the linear regression analysis, and make a regression plot and a residuals plot. Compare the graphs and see which measure is best at predicting the number of runs scored.
Variations
Many variations of this project are possible. We're sure that you can think of more yourself, but here are a few ideas to get you started.
- Do you get the same results if you run this analysis for a different year? For a different baseball era? Can you think of reasons to explain any differences you find?
- Are there other derived statistics (besides RC, TA, and BRA) that might do a better job at predicting runs scored?
- You have to score at least one run to win a baseball game, so we expect teams that score more runs to win more games. However, you also have to keep the other team from scoring more runs than you do. So how well does a team's run-scoring ability correlate with winning percentage?
- Investigate correlations between team pitching statistics and winning percentage. Which pitching statistic is the best predictor of success?
Baseball Economics
- How well do player salaries correlate with offensive performance? In baseball it is generally expected that the three outfielders and the first and third basemen will produce runs for the team by being skilled with the bat. Assemble the individual batting and salary statistics for this group of players for a single season. How well does salary correlate with the various batting statistics used above? You can take this further by expanding your sample to multiple seasons.
- How well does team payroll correlate with winning percentage?
More Advanced Project Ideas
- Baseball and Athletic Longevity. History tells us that, over a human lifetime, the trajectory for most individual accomplishments is an arc. We all start off pretty much helpless as infants, grow in physical and mental skill through childhood, teenage years and young adulthood. If we are fortunate enough to live into old age, we also, inevitably, start to notice a decline in those same skills as the body and mind age. Baseball statistics provide a way to measure the trajectory of athletic ability for large numbers of individuals. There are many, many questions you could explore along these lines. What is the "average" age for peak performance? How much variance is there in this age? Does it differ for pitchers and batters? Which position has the greatest longevity? The shortest? Has peak performance age changed over time? Use year-by-year career statistics for individual players to identify their peak years by some measure that you devise. Compile and analyze tables of peak performance data for groups of players to answer one of the questions above, or a similar question that interests you.
- For more ideas, see Teaching Baseball Using Statistics, by Jim Albert (listed in the Bibliography).
Credits
Andrew Olson, Ph.D., Science Buddies
Sources
- Albert, Jim, 2003. Teaching Statistics Using Baseball. Washington, D.C.: The Mathematical Association of America.
- Rummel, R.J., 1976. "Understanding Correlation, Chapter 4.3, Interpreting the Correlation: Correlation Squared" Department of Political Science, University of Hawaii [accessed March 6, 2006] http://www.mega.nu:8080/ampp/rummel/uc.htm#S4.3.
0 comments:
Post a Comment