Python and K Means Clustering

K Means Clustering using Python

Iniziamo. È gratuito!
o registrati con il tuo indirizzo email
Python and K Means Clustering da Mind Map: Python and K Means Clustering

1. Create clusters

1.1. As K Means Clustering is an unsupervised learning model, there is no concept of training or testing

1.1.1. We just make a decision on the number of K clusters and fit our data to a new model to get the predicted categories

1.2. from sklearn.cluster import KMeans

1.3. kmeans = KMeans(n_clusters=4)

1.3.1. We choose 4 clusters here only because we created our data artificially and already know that we specified 4 clusters

1.3.1.1. In the real world, we would use the elbow method to choose an appropriate K value

1.4. kmeans.fit(data[0])

1.4.1. Here we simply fit the data features

1.4.1.1. Remember that unsupervised learning does not involve the use of labels during the learning process

2. Review model metrics

2.1. After the data has been fitted to the model, we can see the resulting predictions

2.1.1. We can see the predicted cluster centroids

2.1.1.1. kmeans.cluster_centers_

2.1.2. We can also see the predicted labels, showing the predicted K cluster value for every sample record

2.1.2.1. kmeans.labels_

3. Compare original vs K Means

3.1. This is not a normal step, but is made possible by the fact we created data artificially and have labels for that data

3.2. import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

3.3. f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6)) ax1.set_title('K Means') ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow') ax2.set_title("Original") ax2.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

3.3.1. Here's the resulting plot

3.3.1.1. Note: the colors are meaningless in reference between the two plots

4. Finding K using the elbow method

4.1. Here is some code we could have used to find the optimal value for K if we did not have the pre-existing knowledge that the data splits into 4 groups

5. from sklearn.cluster import KMeans from sklearn import metrics from scipy.spatial.distance import cdist import numpy as np import matplotlib.pyplot as plt # create new plot and data plt.plot() X = data[0] colors = ['b', 'g', 'r'] markers = ['o', 'v', 's'] # k means determine k distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(X) distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) # Plot the elbow plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show()

5.1. We can see the bend of the elbow is most prominent when K = 4

6. Theory

6.1. K Means Clustering is an unsupervised machine learning model, using unlabelled training data to group data points into clusters

6.1.1. Typical use cases include:

6.1.1.1. Grouping together similar documents

6.1.1.2. Grouping customers based on features

6.1.1.3. Market segmentation

6.2. The overall goal is to label every data point to a distinct cluster, where the number of clusters is represented by K

6.2.1. Here we see the transformation of unlabelled training data into 3 distinct K clusters

6.3. The process of clustering the data is as follows:

6.3.1. 1. Choose a value for K

6.3.2. 2. Randomly assign each data point to one of the K clusters

6.3.3. 3. Repeat the following sequence until no more data point re-assignments occur:

6.3.3.1. a) Calculate the centroid for each cluster

6.3.3.1.1. This is the mean vector of points in the cluster

6.3.3.2. b) Re-assign each data point to the K cluster with the nearest centroid

6.3.4. We can visualize it like this

6.3.4.1. In step 1, we see the random assignment, and in step 2a, we see the initial centroid calculation

6.3.4.1.1. As the initial assignment is random, the centroids generally begin very close together, often overlapping

6.3.4.2. In step 2b we see the 1st iteration re-assignment based on the centroids, and then in the next iteration, the centroids shift and we repeat the process again

6.3.4.2.1. The final results (when no more change to centroids and no more re-assignment) typically comes after around 10 iterations or so

6.4. Choosing a K value is not intuitive

6.4.1. Here we can visualize the use of different K values but which one will deliver the best results?

6.4.1.1. One way to decide on the K value is to use the elbow method

6.4.1.1.1. Using the elbow method, we calculate the sum of squared errors (SSE) for a range of K values

6.4.1.1.2. The idea is to identify the K value at which the SSE decreases abruptly

7. Data profiling

7.1. In the Udemy course lecture we used the sklearn library to create some artificial data

7.1.1. from sklearn.datasets import make_blobs

7.1.2. data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.8,random_state=101)

7.1.2.1. This creates a dataset with 200 samples, 2 features and belonging to 4 groups

7.1.3. The make_blobs() function actually returns a tuple of 2 elements:

7.1.3.1. The first element at index 0 is a numpy array of shape (200,2)

7.1.3.1.1. The 200 represents the samples and the 2 represents the randomly generated features per sample

7.1.3.2. The second element at index 1 is also a numpy array, but this one with a shape of (200,)

7.1.3.2.1. These 200 values hold values 0 to 3, representing the 4 different groups

7.1.3.2.2. These values represent labels, which is not something you will normally have in your training data when using the K means clustering model

7.2. See Python and logistic regression map for more info on data profiling in general

8. Exploratory Data Analysis (EDA)

8.1. In the Udemy course lecture we took advantage of the labels available in the artificial dataset to visualise it via a scatterplot

8.1.1. This is not something we would do with real data because it wouldn't have labels

8.1.2. plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

8.1.2.1. Here's the scatterplot

8.2. See Python and logistic regression map for more info on EDA in general

9. Data cleaning

9.1. In the Udemy course lecture, we didn't do any data cleaning for K Means Clustering

9.1.1. See Python and logistic regression map for more info on data cleaning in general