OPS Revisited: A Regression Example
This semester, I had my first and only class with my adviser Dr. Ecker. The class while interesting was unfortunately taken out of order meaning that I have already taken the classes it leads into as it really is more of a sampler course of applied statistical methods. For example this class was supposed to be taken ahead of my stat computing courses as well as my regression class. As such I ended up doing more than was likely required for this project.
As I had a large semester project in each of my classes this semester, I decided to revisit the OPS prediction that Clara and I had done last semester. I already had the data set and much of the exploritory analysis done, plus I was working with a section of the data set for my classification project with Levi. Plus, I thought it would be interesting to update the project with new data from the latest season and new statistical techniques as well as familiarity.
For this project we had to use three of the techniques we had covered in class, regression, t-tests, ANOVA, ANCOVA, clustering, and some basic classification. I chose to highlight regression, and clustering, as well as ANOVA, because after my regression class and stat computing 1 and 2 I couldn't imagine doing a regression without ANOVA.
Remember that this data set includes variables that are created as the result of other variables in the data set. I divided my variables into three categories, Categorical, Building Block, and Composition variables. This time I also divided my data into a training set (70%) and a testing set (30%).
As a reminder the assumptions of linear regression are:
As I had a large semester project in each of my classes this semester, I decided to revisit the OPS prediction that Clara and I had done last semester. I already had the data set and much of the exploritory analysis done, plus I was working with a section of the data set for my classification project with Levi. Plus, I thought it would be interesting to update the project with new data from the latest season and new statistical techniques as well as familiarity.
For this project we had to use three of the techniques we had covered in class, regression, t-tests, ANOVA, ANCOVA, clustering, and some basic classification. I chose to highlight regression, and clustering, as well as ANOVA, because after my regression class and stat computing 1 and 2 I couldn't imagine doing a regression without ANOVA.
Remember that this data set includes variables that are created as the result of other variables in the data set. I divided my variables into three categories, Categorical, Building Block, and Composition variables. This time I also divided my data into a training set (70%) and a testing set (30%).
Category | Variables |
Categorical | Player name, Team, Position |
Building Blocks | G, AB, H, 2B,3B, HR, BB, SO, SB, CS , IBB, HBP, SAC, SF, TB, GDP, GO, AO, NP, PA. RBI |
Composition | AVG, OPB, SLG, OPS, GO_AO, XBH |
As a reminder the assumptions of linear regression are:
- Normality of residuals,
- No or little multicoliniarity
- No auto correlation and independence of variables
- Homoscedasticity
After selecting my variables to ensure independence and to avoid multicoliniarity right off the bat (no pun intended) I then preformed three kinds of model selection: Forwards, Backwards, and Step-wise. I then checked my assumptions with various methods and compared the models. I found that Backward and Stepwise selection produced the same model with a better adjusted R-Squared value. This is lower than the R-squared I had in my previous OPS model, however, not by much and it is the adjusted R-Squared. I also am using a training set as well as more current data. I think how close these are is interesting however despite having differing models.
Model | ADJ R | Selected? |
Forward: OPS ~ R + `2B` + `3B` + RBI + SO + SB + CS + AVG + HBP + SAC + GDP + GO_AO + PA | 0.7844 | No |
Backward: OPS ~ R + `2B` + RBI + AVG + HBP + PA | 0.7929 | Yes |
Stepwise: OPS ~ R + `2B` + RBI + AVG + HBP + PA | 0.7929 | Yes |
By checking my testing data, I found an average difference of -0.02083527 which seemed acceptable to me. Especially as this data set contained data from a currently in progress season. I then produced the corresponding ANOVA table and included that as well.
I then did some hierarchical clustering for my final part on this project. I accomplished this much to Dr. Ecker's dismay by using R instead of S+. They're essentially the same and R allows me to install packages. Plus I'm for comfortable in the R environment.
I used NBClust() to determine cluster size, the suggestion was 3, however Dr. Ecker encouraged me to go with 2 to demonstrate why it was ineffective as I already had some experience with more advanced clustering methods. I used Dimension reduction methods to make the following plot. The clusters do overlap quite a bit, which is both because of the dimension reduction as well as being 1 cluster short of what is optimal.
Cluster 1 did seem to be better players in general, while cluster 2 seemed to players that saw less play time, usually brought in to pinch hit or such.
This was an interesting project that I think would have been much better suited for me before I took all the other courses in my tract. However due to a lack of sign up the class wasn't offered until my senior year. The slides for my presentation are available here.
Comments
Post a Comment