Graph Spectral Clustering for the Mushrooms Dataset

Can we use GSC to cluster the Mushrooms Dataset? From the UCI Machine Learning repository.

In [8]:
from lib.spectral_clustering import spectral_clustering
from lib.categorical_similarity_functions import categorical_preprocessing_csv
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#from sklearn.decomposition import PCA
import numpy as np

First thing's first, we need to take our csv file and make it something that we can work with.

In [ ]:
numOfAtts, dataA = categorical_preprocessing_csv("scripts/data/mushrooms.csv")

Now, let's take that data and cluster it!

Hooray! We have our assignments! Now, we need to be able to compare these assignments to the actual data (we want to see how well we clustered the data into poisonous and edible).

Now that we have taken the first column and reshaped the data, we can convert our 'p''s to 1's and 'e''s to 0's so we can compare this vector to our assignments.

Now, we'll compute the error by taking the norm of the difference between the cluster assignments vector and the actual categorical assignments of the mushrooms.

This does pretty well! But it's also worth noting that we use the column that tells us whether a mushroom is poisonous or not to make our clusters. To get a more accurate representation of how well our clustering algorithm works, we should ignore this column when we cluster and then make the same comparison as we did here.

In [ ]:
data = dataA.T[1:]
data = data.T
numOfAtts = numOfAtts[1:]
print(data.shape)
print(numOfAtts.shape)
In [ ]:
assns, SandU = spectral_clustering(data,2,"rw", with_eigen = True, numOfAtts=numOfAtts)
print(assns)
In [ ]:
dataT = np.array(dataA.T)
verify = dataT[0] 
bindata = [] 
for i in range(len(verify)):
    if verify[i]=='p':
        bindata.append(0)
    elif verify[i] == 'e':
        bindata.append(1)
bindata = np.array(bindata)
errvec = assns-bindata 
n= errvec.shape[0]
err = np.linalg.norm(errvec)/n
print(err)

# to check in how many spots the assingments match reality
count=0
for i in range(len(errvec)):
    if errvec[i]==0:
        count +=1
print(count)

Seeing the error is nice and all, but we all know that visuals are where it's at. The only problem is that our data is HUGE. So, why don't we take advantage of the dimension reduction done by spectral clustering, and project our data into $\mathbb{R}^2$?

In [ ]:
print(verify[0:30])
print(bindata[0:30])
print(assns[0:30])
S, U = SandU
print(U.shape)

To represent our clusters, we use PCA to find the k dominant features of our 22-dimensional data. To do this, we need to first normalize our data matrix so that it has mean 0 and variance 1 (depending on if we think features are independent or dependent of the variance). Then we find the k eigenvectors corresponding to the largest k eigenvalues of the covariance matrix $\frac{1}{n}X^TX$. These k eigenvectors are what we call the principal components and maximize the variance.

In [ ]:
# each row is a mushroom
# center each column to 0
rows, column = np.shape(data)
means_of_columns = np.mean(data, axis = 0) #22 * 1 vector of means
normalized_data = np.zeros(np.shape(data)).T #22*8000 matrix
# center each coordinate of data
for i in range(columns):
    normalized_data[i] = data.T[i] - means_of_columns[i] 
pca = PCA(n_components = 2) #if we want 2 components
pca.fit(X)
components = pca.components_ #2 vectors corresponding to the principle axes

# project each data point onto the 2 components
projection_matrix = np.array(components).T #columns are the components
projected_data = data @ projection_matrix #best representation of 22-dim in 2-dim