Cluster Analysis with R

Cluster Analysis

Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

Games: Identify player groups on the basis of their age groups, location and types of games they have shown interest in the past.

Internet: Clustering webpages based on their content.

Quality of Clustering

A good clustering method produces high quality clusters with minimum within-cluster distance (high similarity) and maximum inter-class distance (low similarity).

Ways to measure Distance

Manhattan distance: |x2-x1| + |y2-y1|

Data Preparation

Adequate Sample Size
Standardize Continuous Variables
Remove outliers
Variable Type : Continuous or binary variable
Check Multicollinearity

Detailed Tutorial - How Cluster Analysis works

Assess clustering tendency (clusterability)

It is important to assess cluster tendency (i.e. to determine whether data contain meaningful clusters) before running any clustering algorithms. In unsupervised learning, the clustering methods returns clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.

Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution (i.e. no meaningful clusters).

The null and the alternative hypotheses are defined as follow:

Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters)
Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters)

If the value of Hopkins statistic is close to 0,it means that the data is highly clusterable. If the value is close to 0.5, that means the data contains no meaningful clusters.

Determine the optimal number of clusters

In R, there is a package called "NbClust" that provides 30 indices to determine the optimal number of clusters. The 2 important methods out of 30 methods are as follows -

Look for a bend or elbow in the sum of squared error (SSE) scree plot. The location of the elbow in the plot suggests a suitable number of clusters for the kmeans
Silhouette width is a measure to estimate the dissimilarity between clusters. A higher silhouette width is preferred to determine the optimal number of clusters

R Code : Cluster Analysis

# Loading data
data<-iris[,-c(5)]

# To standarize the variables
data <- scale(data)

# Assessing cluster tendency
if(!require(clustertend)) install.packages("clustertend")
library(clustertend)
# Compute Hopkins statistic for the dataset
set.seed(123)
hopkins(data, n = nrow(data)-1)
#Since the H value = 0.1815 which is far below the threshold 0.5, it is highly clusterable

###########################################################################
####################### K Means clustering ################################
###########################################################################

# K-mean - Determining optimal number of clusters
# NbClust Package : 30 indices to determine the number of clusters in a dataset
# If index = 'all' - run 30 indices to determine the optimal no. of clusters
# If index = "silhouette" - It is a measure to estimate the dissimilarity between clusters.
# A higher silhouette width is preferred to determine the optimal number of clusters

if(!require(NbClust)) install.packages("NbClust")
nb <- NbClust(data, distance = "euclidean", min.nc=2, max.nc=15, method = "kmeans",
index = "silhouette")
nb$All.index
nb$Best.nc

#Method II : Same Silhouette Width analysis with fpc package
library(fpc)
pamkClus <- pamk(data, krange = 2:15, criterion="multiasw", ns=2, critout=TRUE)
pamkClus$nc
cat("number of clusters estimated by optimum average silhouette width:", pamkClus$nc, "\n")

#Method III : Scree plot to deterine the number of clusters
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(data,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# K-Means Cluster Analysis
fit <- kmeans(data,pamkClus$nc)

# get cluster means
aggregate(data,by=list(fit$cluster),FUN=mean)

# append cluster assignment
data <- data.frame(data, clusterid=fit$cluster)

###########################################################################
####################### Hierarchical clustering############################
###########################################################################

# Hierarchical clustering - Determining optimal number of clusters
library(NbClust)
res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6,
method = "ward.D2", index = "kl")
res$All.index
res$Best.nc

# Ward Hierarchical Clustering
d <- dist(data, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit) # display dendogram

# cluster assignment (members)
groups <- cutree(fit, k=2)
data = cbind(data,groups)

# draw dendogram with red borders around the 2 clusters
rect.hclust(fit, k=2, border="red")

Cluster Analysis with R

Trending Articles

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Thomas Grundy – Bradwell

Who Is Sisanda Jonas? | Biography| Profile| History Of South African Media...

[MP3] Texzy Ft Dr. Ritzy –“Leg Over” (Prod. @DrRitzy & @KezzyKlef)

18A St. Fintan's Villas, Deansgrange, Co. Dublin - €365,000

Walkthrough Pokemon Victory Fire Complete | English Language

Breaking Down Bumpy’s Boys: NYC Black Mob Boss Of Old Surrounded Himself With...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Sarangapur Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers List...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Suspected burglar to know fate in January

Practice Sheet of Right form of verbs for HSC Students

God of war 3 PPSSPP Download For Android 1.3 GB

The 10 Tennessee Cities With The Largest Black Population For 2021

M23 northbound reopened after lorry fire causes chaos

99 God Status for Whatsapp, Facebook

Attharintiki Daaredhi: Bappu Gari Bommo Lyrics Translation

Not much punishment for substantial benefit fraud

Cattivissimo.Me.3.2017.iTALiAN.MD.WEBDL.XviD-iSTANCE Seed (318)/Leech (148)