Baseball Classification

I know at this point, it probably seems like I'm a huge baseball fan. And now I kind of am, but only because they have such excellent stats kept about them. Have you actually tried to watch a whole baseball game? I can't. But enough about my lack of inclination towards sports. Today, I'm writing about my final project for Statistical Computing II, with Dr. Somodi.

Overview

This class was probably about half as full as the first in the series which surprised me. However now when it came time for the final project I knew the other students a bit better and this time decided to pair with Levi who had switched from the actuarial tract to the statistics tract. He was a little more sports inclined and since I had already scraped the MLB website previously and was familiar with the data set, it was a natural fit for us and our project.

For this project we were working on clustering and classification through various methods. The class had spent a good deal of time working on both the supervised and unsupervised learning methods in R. So as my data set already contained player data we decided that we would try to do our classification on those and then preform a k-means clustering on the data set as well.

Classification

For classification we preformed three kinds of classification Logistic Regression, Classical Decision Trees, and Random Forests. We had gotten a lot of experience with these methods and comparing between them using our prediction tables. As such these seemed like fairly simple methods to hopefully determine player type. However, we decided that having a tree for each player type might be cumbersome and unwieldy as well as not that useful, so we preformed a data transformation to determine if a player was an infielder or outfielder. We split our data set into a testing and training set and got to work.

Logistic Regression

For logistic regression we found that AB, R, 3b, SO, SB, SAC and GDP were all considered significant. We had no issues and had a 70% correct classification on our testing data. Our table can be found below. We felt that this was a fairly strong result.

Actual FALSE TRUE In 66 29 Out 33 81> (66+81)/209[1] 0.7033493

Classical Decision Trees

For this method we started by taking the first tree created which was quite large. This tree had a 68% classification rating on our test data, which was alright, but could be improved a bit. It was also far to complicated to have any real use.

We then made a CP plot to figure out where to trim the tree and found that a CP of .51 provided the best result which meant we only had 2 decisions in our tree. The trimmed tree had a prediction rate of 70% which was better than the full tree. The only two variables that seemed to matter were SB and GDP.

Random Forests

Random forests of course are simply a collection of decision trees. These are then compared by the computer and an importance graph can be made. For our project we made 500 trees for our forest. We found that SB was the most important and that GO_AO was the next but there was a large gap. We got a prediction rate of 64% which was the worst we had seen so far.

Comparing Results

When comparing we found that logistic regression and our decision tree had the same accuracy. We opted to say the decision tree would be more practical because it was fairly short and left little room for interpretative error. However both methods are viable options.

Clustering

For the clustering part of our project we used K-Means clustering. We did consider Kohonen networks, but that seemed a bit unnecessary in addition to all of the classification we were preforming as well. We determined the number of clusters should be 5 and then ran the K-means clustering algorithm.

We created cluster profiles that actually matched what intuitively made sense. To avoid duplicating everything, and because this has gotten longer than I expected, I won't show the cluster profiles here. However our slides can be found here which contain the profiles and some additional information there.

Search