Data Mining: Wholesale Analysis

This last semester, I was enrolled in a data mining and predictive analytics course. This is a brief write up of the semester project done by me and Mwalimu.


For our semester end project, we were not give a large amount of direction. We were told to select a large data set and preform sufficient analysis and to employ most of the data mining tools we had covered over the past semester. Mwalimu and I decided to find a data set on the UIC machine learning repository . We had found a few data sets to try, but nothing that really allowed us to do both classification and clustering as we wanted, that is until Mwalimu found this data set on wholsalers and their clients. 

We knew that wholesaling was a very data driven industry and knowing nothing else about the industry we wanted to know if we could classify clients between retail stores and hotels, as well as what sort of clusters there were, allowing a more directed marketing/service to those clients. 


We had the following variables:
  • Fresh: Annual spending on fresh products (Produce)
  • Milk: Annual spending on dairy projects
  • Grocery: Annual spending on grocery products
  • Detergents_Paper: Annual spending on cleaning supplies
  • Frozen: Annual spending on frozen products 
  • Delicatessen: Annual spending on deli items
  • Channel: This was a factor of hotel and retail
  • Region: which was where the client was
The summary stats are in this table. 


Variable Min max Mean Std. Deviation
Fresh 3 112151 12000.3 12647.329
Milk 55 73498 5796.27 7380.377
Grocery 3 92780 7951.28 950.163
Frozen 25 60869 3071.93 4854.673
Detergents 3 40827 2881.49 4767.854
Delicatessen 3 47943 1524.87 2820.106
I think it's fairly clear that we had some large variances, and going back I would rather use median over mean as a measure of central tendency. However, at the time, we had already started and stoped at least 2 projects and needed to just to have something to work with. 

Clustering: K means

We decided to use the K means algorithm to do our clustering. We did most of our work in R with some data preprocessing in excel. We used the package NbClust to decide on the number of clusters we would use. Our results told us to use 2 clusters. 

We then wanted to visualize the clusters but with 8 variables we needed to use PCA to reduce our dimensions. We accomplished this with the package fviz_cluster(). Our first principle component covered 42% and the second 20% of the variations.
Cluster 1 seemed to mostly be the first channel which was mostly hotels. We also got boxplots for each cluster which are in the slides I'm attaching at the bottom. 

Cluster 2 was mostly channel 2. Mostly retail locations more closely grouped by region. The more they spent in any area the more likely they were to be in this cluster. 

Classification

After doing clustering we were relieved as we were fairly confident with how the data had separated that we could use our classification techniques to classify the kind of customer. We employed 4 classification methods:  Classical decision tree, Random Forests, Support Vector Machines, and logistic regression. We separated our data into a training and test set. (176/264)

Classical Decision Tree

For this we used a minimum split of one (1) and allowed R to determine the minimum. This gave us a beautiful unpruned tree with a precision rate of 0.903.


It is however very in depth and we weren't satisfied leaving it pruned. We used a Cp plot and determined that a cut off of .015 was appropriate. From this we got a much smaller tree with only 3 decision points that had the exact same precision rate. By reducing complexity we didn't lose any precision. 

Random Forrest

This black box method of classification is usually better than a decision tree as it generates many trees and finds the most impact factors. 

We found that 
  • The amount spend on cleaning supplies was the most important 
  • Then the amount spent on grocery items
  • Then amounts spent on dairy
  • Then the amount spent on frozen. 
The importance dropped off after that, however we had a precision rate of 0.926

Support vector machines:

These create a hyper-plane that attempts to classify the data. We didn't get any interesting images, nor particually impressive results. We had an untuned model that had a precision rate of  0.881 and a tuned model that had a rate of 0.903. 

We also spent the least amount of time on this method, both in the class and in this project. 

Logistic Regression

Finally we reached logistic regression. When we trained our model we only ended up with 2 variables being significant: Paper products and grocery spending. As we worked with our selection process and dropped other insignificant variables we found that Fresh was also significant at the 10% level. 

For our final model we used Fresh, Grocery, and Paper products. I believe we selected this model using step wise selection. in the end however, this model was our worst yet with a precision rate of only 0.876


In conclusion we learned that paper constantly seemed to  be the most important metric as did grocery. Using information like that can help group different business together so that the wholesaler can provide better service and cater to the differing groups. 


The link to the slides is here.
.

Comments

Popular Posts