Cluster Analysis
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.
A good clustering method produces high quality clusters with minimum within-cluster distance (high similarity) and maximum inter-class distance (low similarity).
Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).
Examples of Clustering Applications
Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.
Games: Identify player groups on the basis of their age groups, location and types of games they have shown interest in the past.
Internet: Clustering webpages based on their content.
Quality of Clustering
Ways to measure Distance
Manhattan distance: |x2-x1| + |y2-y1|
Data Preparation
Assess clustering tendency (clusterability)
It is important to assess cluster tendency (i.e. to determine whether data contain meaningful clusters) before running any clustering algorithms. In unsupervised learning, the clustering methods returns clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.
Determine the optimal number of clusters
In R, there is a package called "NbClust" that provides 30 indices to determine the optimal number of clusters. The 2 important methods out of 30 methods are as follows -
- Adequate Sample Size
- Standardize Continuous Variables
- Remove outliers
- Variable Type : Continuous or binary variable
- Check Multicollinearity
Assess clustering tendency (clusterability)
It is important to assess cluster tendency (i.e. to determine whether data contain meaningful clusters) before running any clustering algorithms. In unsupervised learning, the clustering methods returns clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.
Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution (i.e. no meaningful clusters).
The null and the alternative hypotheses are defined as follow:
- Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters)
- Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters)
If the value of Hopkins statistic is close to 0,it means that the data is highly clusterable. If the value is close to 0.5, that means the data contains no meaningful clusters.
Determine the optimal number of clusters
In R, there is a package called "NbClust" that provides 30 indices to determine the optimal number of clusters. The 2 important methods out of 30 methods are as follows -
- Look for a bend or elbow in the sum of squared error (SSE) scree plot. The location of the elbow in the plot suggests a suitable number of clusters for the kmeans
- Silhouette width is a measure to estimate the dissimilarity between clusters. A higher silhouette width is preferred to determine the optimal number of clusters
# Loading data
data<-iris[,-c(5)]
# To standarize the variables
data <- scale(data)
# Assessing cluster tendency
if(!require(clustertend)) install.packages("clustertend")
library(clustertend)
# Compute Hopkins statistic for the dataset
set.seed(123)
hopkins(data, n = nrow(data)-1)
#Since the H value = 0.1815 which is far below the threshold 0.5, it is highly clusterable
###########################################################################
####################### K Means clustering ################################
###########################################################################
# K-mean - Determining optimal number of clusters
# NbClust Package : 30 indices to determine the number of clusters in a dataset
# If index = 'all' - run 30 indices to determine the optimal no. of clusters
# If index = "silhouette" - It is a measure to estimate the dissimilarity between clusters.
# A higher silhouette width is preferred to determine the optimal number of clusters
if(!require(NbClust)) install.packages("NbClust")
nb <- NbClust(data, distance = "euclidean", min.nc=2, max.nc=15, method = "kmeans",
index = "silhouette")
nb$All.index
nb$Best.nc
#Method II : Same Silhouette Width analysis with fpc package
library(fpc)
pamkClus <- pamk(data, krange = 2:15, criterion="multiasw", ns=2, critout=TRUE)
pamkClus$nc
cat("number of clusters estimated by optimum average silhouette width:", pamkClus$nc, "\n")
#Method III : Scree plot to deterine the number of clusters
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(data,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(data,pamkClus$nc)
# get cluster means
aggregate(data,by=list(fit$cluster),FUN=mean)
# append cluster assignment
data <- data.frame(data, clusterid=fit$cluster)
###########################################################################
####################### Hierarchical clustering############################
###########################################################################
# Hierarchical clustering - Determining optimal number of clusters
library(NbClust)
res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6,
method = "ward.D2", index = "kl")
res$All.index
res$Best.nc
# Ward Hierarchical Clustering
d <- dist(data, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit) # display dendogram
# cluster assignment (members)
groups <- cutree(fit, k=2)
data = cbind(data,groups)
# draw dendogram with red borders around the 2 clusters
rect.hclust(fit, k=2, border="red")