Predicting OPS: Simple Regression
During this past semester I took my first class with Dr. Somodi, who graciously let me sign up and join the class a week in. UNI recently added a certificate for statistical computing, or at least has started advertising it heavily. I was signed up for FM exam prep course, but I'm doing the statistics rout of the major and a course in statistical computing seemed more prudent than a class that will cover the material in my Math of Finance class I had already taken. This class is part of a two class series that goes over R which is a language I feel very comfortable in at this point.
I'm a fan of project based classes, which this definitely is. Even our exams are really more projects that we complete during class. For our final we were told to pick a partner, pick a data set, and preform some of the statistical techniques we had learned this semester. Surprisingly I knew very few of the students in this course when it began as many of them are not in my major. However, I got to know my partner Clara fairly well over the past semester as we had many classes together. So when project time came we decided to work together.
First we had to pick a data set. In Regression class with Dr. Kirmani, I had a similar project where I had to specifically preform regression on a data set with at least 10 variables and at least 100 observations. With my group in Dr. Kirmani's class we had chosen a baseball data set from MLB.com and developed a regression model to predict RBI. However, I didn't want to try to replicate my SAS code in R so we decided to predict OPS instead. Neither Clara or I were very familiar with baseball, but we knew we could do something with this data set.
Then we had to do some exploratory data analysis. We knew that some measures were built by using others to calculate, which if those building blocks were used at the same time as the resulting measures we could have multicolinearity, which would make our regression useless. We also realized that some variables were to highly correlated, such as number of games, times at bat, and number of balls. So we knew we couldn't use those all at the same time. We determined that our problem would be to regress OPS on Hits, Hit by Pitch, Scores after Fly, Total Bases, Plate Appearances, Walks, and Games. We then decided to use a slightly different data set than I had used for the RBI using the 2016 season.
To begin, we decided to check our potential full model with a correlation matrix. This was with all the variables to begin with and was trimmed down to our model. This left us with our beginning model which is shown below. There was still a bit of correlation that we might have to address, but this was a good starting point.
We knew that some of our variables were correlated so we designed two models, one used games, and the other used times at bat. We then preformed step-wise regression for each model and tested our assumptions. We then revised the models and tested again.
The assumptions for linear regression are fairly straight forward.
- Normality of residuals,
- No or little multicoliniarity
- No auto correlation and independence of variables
- Homoscedasticity
We tested the normality of our model by looking at the Residuals vs Fitted plot as well as preforming the Shapiro test. For both models this was fine. We had already done some planning to avoid multiplicity with our correlation matrix, but we also check this with VIF. Our model with Total Bases had some higher values here, but was likely passable. We had selected our variables with the intention of avoiding issues with independence so we were fine there. Finally, we checked the homoscedasticity with the QQ plot. The TB model had an S like shape. After checking these we decided to use games instead of total bases.
We preformed regression again and got our results. We got an R-squared value of .89 which we felt was fairly decent. We then wrote our paper and gave a 7 minute presentation to the class. Over all I felt like I learned a lot in the class and think UNI should make this a core class for the statistics core of my major. Our slides are available here.
Comments
Post a Comment