Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.

Cluster Analysis using R

Hello Data Experts,
Before I start what is Cluster Analysis (CA), could you think of how we as an individual help perform Cluster Analysis. Let me help you surface out how we are contributing to CA for Retail or Hospitality Industry. We all are so charmed with reward points and why not if it helps me get additional benefit over time for either staying in a hotel or doing shopping. Both Retailers or Hoteliers have come up with the mechanism to collect structure data to their benefit for their business. They do analysis on the Patterns and Trends to decipher individuals or groups demographics and behavioral styles. Based on these interpretations customized promotion are launched to increase their bottom line. Coming back to the question, how we contribute in cluster analysis is by unknowingly allowing business entities to collect data (lakes) with every action associated with the reward points. While reading this article, continue to correlate details wit rewards points analogy.
What is Cluster Analysis?
Cluster is simple English means a group, so executing a structured algorithm to come up with the likeminded objects with similar properties. Objects with similar properties can be form different type of groups. Grouping cannot be generalized, it is not an acceptable outcome. Transaction data plays important role in defining the clusters but variables and number of observations are key to make it more relevant to domain. More we explore, higher similarities will be observed within each cluster thus it is more of exploratory approach. Clustering is a kind of Unsupervised clustering approach where Y (output) is unknown.
Why do we need Cluster Analysis?
Clustering helps label the data having same attributes for future actions, like promotions by retailers or Hoteliers. If Retailers are unaware of who is their segmented customer, they might not be successful in launching right commercial. Cluster Analysis algorithms are based on proximity rather than correlations. Outcome of CA will be to get the classified data which is much more manageable than dataset with only transaction data. 
Let us wear a statistician hat now and change gears to better understand how as Data Scientist we should proceed with Cluster Analysis.
Few key points to keep in mind while working on Cluster Analysis:
  • There are different techniques or approaches for clustering,
    • Hierarchal: Dendrogram clustering
    • Non-Hierarchal: k-means clustering
    • Hybrid mix of Hierarchal and Non-Hierarchal clustering
  • Cluster size can be defined in k-means whereas optimal number of cluster will be formed using Dendrogram.
  • Both techniques compute clusters using standardized data, i.e., Z value between Zero to 1. Please refer my blog on Z distribution to know more about Z score.
  • Graphical representation of clustering helps visualize similar and dissimilar group. Dendrogram depicts Tree structure whereas k-means depicts Cluster structure.
  • Easy identification of outliers, they form a single record cluster or very small sized cluster in case of set of outliers.
  • Distance is the measure to define proximity of the objects, smaller the distance higher the similarity between objects. Distance can be measure using
    • Euclidian distance equation (we will use this in this session
    • Manhattan distance equation.
    • Mahalanobis distance.
  • Distance is the key measure however what distance to calculate is defined by link between 2 points called linkage. What 2 points to pick to calculate distance is based on an appropriate algorithm i.e., types of linkages:
    • Single Linkage – Nearest neighborhood
    • Complete Linkage – Farthest neighborhood
    • Average Linkage – Average for all pair of data points
    • Centroid Linkage – Centers of clusters. 
Hierarchal clustering (H-Cluster):
  1. Dendrogram is the technique for Hierarchal clustering, where in outcome is reflected as hierarchy.
  2. This technique is used for medium to small size population size, anything less than 100 typically for easy visualization, however it can run on size of 1000 as well without constraints.
  3. There are 2 approaches to compute H-Cluster either top to bottom (Cluster break to get records or smaller clusters i.e., 1 to n approach) or bottom to top view (Records to groups to form a Cluster i.e., n to 1 approach). These techniques are called Decisive and Agglomerative approaches respectively.
  4. Quick and easy to approach however flip side is to negative impact of outlier that specific observation might have to be removed from dataset.
Non-Hierarchal Cluster (K-means):
  1. K-means is the technique for non-hierarchal clustering, where visualization is more in form of clusters.
  2. Less influenced by outlier and still club that observation to one of the cluster.
  3. User for large data set and it works based on Fit model against the defined cluster size.
  4. This approach help refines clustering over time and can be arrived to an optimize size by incrementing cluster size by 1 to the starting seed point.
  5. Requires iterations to get optimal cluster sizing, seeding point can be defined using scree plot leveraging elbow curve point. Scree plot is a graphical representation of variance between clusters. It represents Steep curve, Elbow curve and Horizontal Curve, where each curve reflects variance so pick Elbow curve cluster size to get optimal # of clusters.
Time for us to get to R programming to create clusters using Hierarchical and Non-Hierarchal approached.  Let me first pick up Hierarchical Approach using Dendrogram.
Let us take an example from Retail Industry. Nowadays we all shop online irrespective of tier of the city we live in and time of the day, hence it will be easy for us to group user to understand pattern. E-retailer LALARA, who in in the business of selling mobile accessories. They have a Loyalty program associated with their site, called MoRew. Over last 1 week, 500 customers did the shopping and below information is collected from those transactions, it is in excel “UP.xlsx”.    
#
Username
Total Purchase
City
Purchase Mode
# of Items
Time of Purchase
1
A
2387
1
CC
3
2300
2
B
1272
1
CC
2
2000
3
C
1468
3
COD
2
1300
Based on the above table data (assuming) there are 500 transactions for last 1 week. Objective is to cluster user’s buying pattern so that promotion can be launched for the right audience.
Let us use Dendrogram approach for this, CDS.csv has all the transaction details listed in above table  
Step: 1, Load csv file having all these data points
CDS <- read.csv("<FilePath>/CDS.csv")
View(CDS)
Step: 2, Execute plot command to understand high level how points are scattered.
plot(CDS)
Step: 3, As we can notice “Total Purchase” and “# of items” are having very high parity in terms of amount, so as not to get analysis skewed because of high value of total Purchase, let is first standardized this data. Standardizing data means normalizing the data. formula to standardize data points is Z = X-Mean/Standard Deviation i.e., (x-μ/σ). To get normalized data, we can either first calculate Mean and Standard Deviation then by applying formula we get each value converted to standardized value however using R we can apply scale command to get it done in a single run.
Execute below command
SCDS <- scale(CDS)    
Step: 4, Once data is standardized let us calculate distance between each point. By default, distance calculated by R is using Euclidian distance when we use “dist” function. This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
SCDSdistance <- dist(SCDS)
Step:5, Once we have distance between each transaction point, we are ready to draw Hierarchal Cluster Dendrogram. We will use “hclust” command to draw the graph. By default, Complete Linkage is used for calculations.
HCLUSTSCDS <- hclust(SCDSdistnace)
Once we have the hclust out, it is a time for us to plot Dendrogram plot.
# To get raw Dendrogram execute plot command.
plot(HCLUSTSCDS)
# To draw symmetrical Dendrogram add hang = -1 to the command.
plot(HCLUST, hang=-1)
Since Dendrogram plot will be huge for retail hence show below is the sample Dendrogram.
It is always good to execute Hierarchal clustering using various linkage approaches. By default, it works on “Complete” Linkage however if we need to change it to “Average”, add attribute method= “Average”.  Dendrogram shows how clustering will change as against Complete linkage above and Average Linkage below.
Visually it is difficult to understand clusters, so R allows us to create borders based on the numbers of clusters we would like to see. If we want to look at 3 clusters execute below command. Ideally, we should first understand optimal cluster size and then execute this command.
rect.hclust(HCLUST, k=3, border="red")

 
Let us draw a spree plot to understand optimal cluster size:
wss = (nrow(SCDS) - 1) *sum(apply(SCDS,2,var))
for (i in 2:5) wss[i]=sum(kmeans(SCDS,centers= i)$withinss)
plot(1:5, wss, type= "b")
after drawing the spree plot, identify elbow point to get the optimal cluster size.
 
Let us use k-means approach for the same dataset CDS.csv, which has all the transaction details.
Step: 1, Load csv file having all these data points
CDS <- read.csv("<FilePath>/CDS.csv")
View(CDS)
Step: 2, Execute plot command to understand high level how points are scattered.
plot(CDS)
Step: 3, Let us draw k-means command to k-means cluster, where 4 reflect number of clusters we would like to have however cluster size is defined by screeplot.
kmeans(CDS,4)
Let me conclude this session and would assume this blog helped you gather understanding of cluster analysis.

Comments

Popular posts from this blog

Z and T distribution values using R

Hello Data Experts, Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics. First step covers Mean, Median and Mode, it is a measure of central tendency. Second step covers Variance Standard Deviation, Range, it is a measure of dispersion. Third step covers Skewness, it is a measure of asymmetry. Fourth step covers Kurtosis, it is a measure of peakness. To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can e

Code Branch and Merge strategies

Learn Git in a Month of Lunches Hello Everyone, IT industry is going through a disruptive evolution where being AGILE and adopting DevOps is the key catalytic agent for accelerating the floor for success. As explained in my earlier blog, they complement each other rather than competing against one another. If Leaders will at the crossroad where in case they need to pick one what should be their pick. There is no right or wrong approaching, it depends on the scenario and dynamics for the program or project. I would personally pick #DevOps over Agile as its supremacy lies in ACCELERATING delivery with RELIABILITY and CONSISTENCY . This path will enable and empower development teams to be more productive and prone to less rework. Does this mean adopting DevOps with any standard will help reap benefits? In this blog, I will focus on importance of one of the standard and best practice around Code branching and merging strategy to get the desired outcome by adopting DevOps. To

“OUTCOME” or “OUTPUT” driven Agile

Hello All,     Nowadays IT industry is bombarded with articles on Agile with loud and clear message #BeLean. Everyone around teaches AGILE as in #GOAGILE, #BEAGILE, #AGILITYLEADS and many more hashtags around #ONLYAGILE. Lean Engineering gurus have been coaching corporates to go #AGILE and be #LEAN. Literal English meaning of being Agile is to be nimble, to be able to adapt to the changing needs of company to achieve goals as to what is desired by business. But why do we need Agility, is it to be able to achieve outcome i.e., #BusinessesNeed with speed i.e., #Velocity? I am perplexed with what I keep hearing around Agile practices and I firmly believe we should try to understand the rational for being Agile by choosing right “O”, either go #Outcome or #Output. What will you prefer without reading this blog, Output or Outcome?   Let me take you two decades back when there was a need for transformation. Transformation from big-bang i.e., #waterfall to iterative i.e., #lea