Continuation from part 1:

Saudi Arabia

SAData <- retailData[which(retailData$Country == "Saudi Arabia"),]
SAData$CustomerID <- as.factor(SAData$CustomerID)
SAData$StockCode <- as.factor(as.character(SAData$StockCode))
str(SAData)
## 'data.frame':    10 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 3903 3903 3903 3903 3903 3903 3903 3903 3903 22901
##  $ StockCode  : Factor w/ 9 levels "20781","22361",..: 8 4 3 2 5 6 7 1 9 2
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 255 1569 1570 1565 2775 2777 2774 1590 1807 1565
##  $ Quantity   : int  12 6 6 6 12 12 12 2 12 -5
##  $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 12543 12543 12543 12543 12543 12543 12543 12543 12543 16539
##  $ UnitPrice  : num  0.42 2.95 2.95 2.95 1.65 1.65 1.65 5.49 1.45 2.95
##  $ CustomerID : Factor w/ 1 level "12565": 1 1 1 1 1 1 1 1 1 1
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 30 30 30 30 30 30 30 30 30 30
matrixSA <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = SAData, addNA = TRUE, sparse = TRUE)
str(matrixSA)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:9] 0 0 0 0 0 0 0 0 0
##   ..@ p       : int [1:10] 0 1 2 3 4 5 6 7 8 9
##   ..@ Dim     : int [1:2] 1 9
##   ..@ Dimnames:List of 2
##   .. ..$ CustomerID: chr "12565"
##   .. ..$ StockCode : chr [1:9] "20781" "22361" "22362" "22363" ...
##   ..@ x       : num [1:9] 10.98 2.95 17.7 17.7 19.8 ...
##   ..@ factors : list()
SAData$StockCode
##  [1] 22915 22363 22362 22361 22553 22555 22556 20781 22969 22361
## Levels: 20781 22361 22362 22363 22553 22555 22556 22915 22969
emFit <- Mclust(matrixSA)
summary(emFit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 2 components: 
## 
##  log-likelihood n df       BIC      ICL
##       -22.76437 9  5 -56.51487 -56.5231
## 
## Clustering table:
## 1 2 
## 3 6
emFit$classification
## 20781 22361 22362 22363 22553 22555 22556 22915 22969 
##     1     1     2     2     2     2     2     1     2
ggplot(SAData, aes(x=reorder(SAData$Description, SAData$Quantity*SAData$UnitPrice), y=SAData$Quantity*SAData$UnitPrice, fill=SAData$InvoiceNo)) + geom_bar(stat = "identity") + labs(x="Product", y="Total Amount Spent") + geom_text(aes(label=(SAData$UnitPrice*SAData$Quantity)), position=position_stack(0.5)) + ggtitle("Top Products bought in Saudia Arabia") + theme(legend.position="none") + coord_flip() 


Observation

Saudia Arabia only consists of 1 customer and he/she bought a total of 9 products. However, it can be seen from the clustering result that there are 2 clusters. Upon further inspection, since the clusters are based on the total purchase volume, one of the clusters are similar in which the amount spent was not so much. The products for cluster 1 are: “Assorted Bottle Top Magnets”, “Gold Ear Muffin” and “Glass Jar Daisy Cotton Wool”. However, for the other cluster, the amount spent is much more compared to the previous cluster. This includes products such as “Glass Jar Marmalade” and “Peacock Bath Salts” and different variety of “Plasters” This explains why the clusters are separated into two different clusters, hence showing how the customer behaviour affects the clustering results. From the plot shown, there are two invoices indicated by the different colors. We can notice that one of the clusters should consist of the first six products in the diagram since each product has total amount spent more than $11 whereas the remaining products are another cluster. In addition to that, “Glass Jar Fresh Cotton Wool” contains negative value of the invoice thus, this suggests that it is a cancelled invoice. Upon further inspection, we realised that it is indeed a cancelled invoice as it contained a character “C”! Hence, it’s final amount will result in $2.95 which definitely belong to the second cluster.


Moving on, we will perform EM clustering for both Bahrain and Czech Republic.

Bahrain

BahrainData <- retailData[which(retailData$Country == "Bahrain"),]
BahrainData$CustomerID <- as.factor(BahrainData$CustomerID)
BahrainData$StockCode <- as.factor(as.character(BahrainData$StockCode))
str(BahrainData)
## 'data.frame':    17 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 7751 7751 7751 7751 7751 7751 7751 7751 7751 7751 ...
##  $ StockCode  : Factor w/ 16 levels "22423","22649",..: 3 8 9 7 2 1 16 6 4 5 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 1671 1830 1155 2344 3650 2984 3128 3085 1651 2730 ...
##  $ Quantity   : int  24 96 60 2 8 2 12 6 6 6 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 22981 22981 22981 22981 22981 22981 22981 22981 22981 22981 ...
##  $ UnitPrice  : num  1.25 1.25 1.25 9.95 4.95 ...
##  $ CustomerID : Factor w/ 2 levels "12353","12355": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 3 3 3 3 3 3 3 3 3 3 ...
matrixBahrain <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = BahrainData, addNA = TRUE, sparse = TRUE)
str(matrixBahrain)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:17] 1 1 1 1 1 1 0 1 1 1 ...
##   ..@ p       : int [1:17] 0 1 2 3 4 5 6 8 9 10 ...
##   ..@ Dim     : int [1:2] 2 16
##   ..@ Dimnames:List of 2
##   .. ..$ CustomerID: chr [1:2] "12353" "12355"
##   .. ..$ StockCode : chr [1:16] "22423" "22649" "22693" "22697" ...
##   ..@ x       : num [1:17] 25.5 39.6 30 17.7 17.7 17.7 39.8 19.9 120 75 ...
##   ..@ factors : list()
BahrainData$StockCode
##  [1] 22693  23076  23077  22890  22649  22423  85040A 22699  22697  22698 
## [11] 72802A 72802B 72802C 37449  37446  22890  37450 
## 16 Levels: 22423 22649 22693 22697 22698 22699 22890 23076 23077 ... 85040A
emFit <- Mclust(matrixBahrain)
summary(emFit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust XXI (diagonal multivariate normal) model with 1 component: 
## 
##  log-likelihood n df       BIC       ICL
##       -127.3295 2 32 -276.8398 -276.8398
## 
## Clustering table:
## 1 
## 2
emFit$classification
## [1] 1 1

Czech Republic

CZRData <- retailData[which(retailData$Country == "Czech Republic"),]
CZRData$CustomerID <- as.factor(CZRData$CustomerID)
CZRData$StockCode <- as.factor(as.character(CZRData$StockCode))
str(CZRData)
## 'data.frame':    30 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 4030 4030 4030 4030 4030 4030 4030 4030 4030 4030 ...
##  $ StockCode  : Factor w/ 25 levels "20972","20974",..: 17 23 8 7 9 11 22 1 12 6 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 308 891 3700 3892 1895 1023 2703 2640 3098 3429 ...
##  $ Quantity   : int  18 48 24 12 36 32 24 24 24 12 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 15237 15237 15237 15237 15237 15237 15237 15237 15237 15237 ...
##  $ UnitPrice  : num  2.55 0.65 0.85 1.25 1.45 0.85 1.49 1.25 2.95 4.25 ...
##  $ CustomerID : Factor w/ 1 level "12781": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 9 9 9 9 9 9 9 9 9 9 ...
matrixCZR <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = CZRData, addNA = TRUE, sparse = TRUE)
str(matrixCZR)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:25] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ p       : int [1:26] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ Dim     : int [1:2] 1 25
##   ..@ Dimnames:List of 2
##   .. ..$ CustomerID: chr "12781"
##   .. ..$ StockCode : chr [1:25] "20972" "20974" "20975" "21253" ...
##   ..@ x       : num [1:25] 30 15.6 15.6 35.4 18 51 15 20.4 8.7 23.4 ...
##   ..@ factors : list()
CZRData$StockCode
##  [1] 22930  84755  22216  21791  22231  22250  84459A 20972  22326  21428 
## [11] 22587  47594B 85206A 22244  22505  22231  84459A 47421  23271  22579 
## [21] 22578  21253  21373  20974  20975  84347  POST   84459A 22231  POST  
## 25 Levels: 20972 20974 20975 21253 21373 21428 21791 22216 22231 ... POST
emFit <- Mclust(matrixCZR)
summary(emFit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust X (univariate normal) model with 1 component: 
## 
##  log-likelihood  n df       BIC       ICL
##       -110.6767 25  2 -227.7911 -227.7911
## 
## Clustering table:
##  1 
## 25
emFit$classification
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Observation

Upon running both Bahrain and Czech Republic, both clustering produce only 1 cluster. We noticed that there is a similarity between these two coutries in which it will not print out the Stock IDs in that cluster upon running emFit$classification. This may be because there is only 1 cluster present after the clustering process. This could be due to the small number of records for these countries and it is most likely the case where it only consist of 1 or 2 customers. Thus, there is a high chance that their customer behaviour is similar and does not vary much which makes them grouped under the same cluster.


Based on countries with the most purchases

Moving on, we decided to work on the top countries such as France, Germany and Ireland. We noticed that the distance between these countries to UK is very near and are considered to be the neighbors of UK hence majority of the customers definitely come from those countries. Initally, we did try to run k-means on UK but due to the huge amount of data, performing EMfit was not possible. Hence, we performed k-means clustering on Germany and Ireland and the results of the clusters being produced by both countries were 9 and 2 clusters respectively. Upon further inspection of the data, we realised that for Germany, there are a number of single clusters which might mean that they are anamolies. Ireland on the other hand, only consists of 3 customers who did the purchasing hence, the number of clusters were only 2. Therefore, We will not be performing further extraction of the clusters for UK, Ireland and Germany.

Instead, we will look into France by firstly creating a customer-product matrix based on that country before continuing on the possible clustering algorithms. Let’s hope we will come out with promising clusters and insights!


Buiding Customer-Product matrix based on countries before clustering using EM

We start off by extracting France data before taking a quick look on the structure of that data. We then build a Customer-Product matrix from the FranceData. For this section, we will be trying an alternative approach of k-means which is to start off with hierarchical clustering before other clustering algorithms which is using k-means and then Emfit. The difference in using hierarchical clustering is such that it does not require us to specify the number of clusters being generated and it results in the tree-based dendogram. With the different clustering algorithms, this will help us to compare on the results thus helping us to further interpret the customer behaviours for each cluster.

FranceData <- retailData[which(retailData$Country == "France"),]
FranceData$CustomerID <- as.factor(FranceData$CustomerID)
FranceData$StockCode <- as.factor(as.character(FranceData$StockCode))
str(FranceData)
## 'data.frame':    8491 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ StockCode  : Factor w/ 1523 levels "10002","10120",..: 812 811 810 314 353 1 330 93 551 757 ...
##  $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 172 173 169 2485 3633 1841 3892 3377 3098 3605 ...
##  $ Quantity   : int  24 24 12 12 24 48 24 18 24 24 ...
##  $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 202 202 202 202 202 202 202 202 202 202 ...
##  $ UnitPrice  : num  3.75 3.75 3.75 0.85 0.65 0.85 1.25 2.95 2.95 1.95 ...
##  $ CustomerID : Factor w/ 87 levels "12413","12437",..: 27 27 27 27 27 27 27 27 27 27 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 14 14 14 14 14 14 14 14 14 14 ...
matrixFrance <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = FranceData, addNA = TRUE, sparse = TRUE)
str(matrixFrance)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:5641] 26 38 55 56 80 24 80 83 3 20 ...
##   ..@ p       : int [1:1524] 0 5 6 8 9 10 12 13 14 21 ...
##   ..@ Dim     : int [1:2] 87 1523
##   ..@ Dimnames:List of 2
##   .. ..$ CustomerID: chr [1:87] "12413" "12437" "12441" "12488" ...
##   .. ..$ StockCode : chr [1:1523] "10002" "10120" "10125" "10135" ...
##   ..@ x       : num [1:5641] 40.8 10.2 10.2 20.4 234.6 ...
##   ..@ factors : list()

Performing hierarchical clustering on France data

hiercFit <- hclust(dist(matrixFrance, method = "euclidean"),
                   method="ward.D")
plot(hiercFit)

hiercFit <- hclust(dist(matrixFrance, method = "maximum"), 
                   method="ward.D")

Observation

Initally, for the hierarchical clustering, we used “euclidean” as the method and the clustering result was fairly good. However, after changing it to “maximum”, the clusters are much more better such that the height are more higher compared to using “euclidean”. From the plot, we can notice that the dendogram shows explicitly the hierarchy of the clusters and by clipping the dendogram, there are a total of 3 clusters. The ones that are on the far left are most probably a cluster which can also be deemed as anamoly. From here, we can proceed to set k to be 3 and using rect.hclust() function to distinctly separate the clusters.

plot(hiercFit)
K <- 3
rect.hclust(hiercFit, k = K)

cutree(hiercFit, k = 3) #want to show the size of each cluster
## 12413 12437 12441 12488 12489 12490 12491 12493 12494 12506 12508 12509 
##     1     2     1     1     1     2     1     1     1     1     1     1 
## 12513 12523 12532 12535 12536 12553 12562 12564 12567 12571 12573 12574 
##     1     1     1     1     1     2     2     1     2     1     1     1 
## 12577 12579 12583 12589 12593 12598 12599 12602 12604 12615 12616 12617 
##     1     1     2     1     1     2     1     1     1     2     1     1 
## 12620 12624 12637 12640 12643 12650 12651 12652 12656 12657 12659 12660 
##     1     1     2     2     2     1     1     1     2     2     1     1 
## 12669 12670 12672 12674 12678 12679 12680 12681 12682 12683 12684 12685 
##     1     2     1     2     3     2     1     2     3     2     2     2 
## 12686 12689 12690 12691 12694 12695 12700 12707 12714 12716 12718 12719 
##     1     1     1     2     1     1     2     1     2     1     1     1 
## 12721 12722 12723 12724 12726 12727 12728 12729 12731 12732 12734 12735 
##     2     1     1     1     2     2     1     1     3     1     1     1 
## 12736 12740 14277 
##     1     1     1

Observation

As mentioned above, we can see that the 3 different clusters being identified can be seen from the distinct red boxes which separate the clusters. We can notice that the far left cluster will have the least number of data points inside hence there might be a possibility that it is an anamoly, hence, we can actually ignore it. The second cluster, which is the middle red box seemed to have the highest number of data points in it while the third cluster have a fairly descent amount of data points in it too. We will now perform k-means and EMfit to check if the number of clusters match the ones that we did for this clustering algorithm before extracting and looking into each of the clusters.


Performing k-means clustering on France data

In this section, we will be finding the optimal number of clusters by using the elbow method.

kMin <- 1
kMax <- 10
withinSS <- double(kMax - kMin + 1)
betweenSS <- double(kMax - kMin + 1)

for (K in kMin:kMax) {
  kMeansFit <- kmeans(matrixFrance, centers = K, nstart = 20)
  withinSS[K] <- sum(kMeansFit$withinss)
  betweenSS[K] <- kMeansFit$betweenss
}

plot(kMin:kMax, withinSS, pch=19, type="b", col="red",
     xlab = "Value of K", ylab = "Sum of Squares (Within and Between)")
points(kMin:kMax, betweenSS, pch=19, type="b", col="green")

Observation

From the output shown, we can see that the “elbow” on the arm is the value of the k that is best and the aim of using this way is to get the small Sum of Squared Error which is the point where the SSE tends to decrease towards 0 as we increase k. In this case, looking at the plot, the optimal k seemed to be 3. With that, we let k to be 3 and perform the optimal clustering (k-means) for France.

K <- 3
kMeansFit.optimalFrance <- kmeans(matrixFrance, centers = K, nstart = 20)
kMeansFit.optimalFrance$size
## [1] 83  3  1
kMeansFit.optimalFrance$withinss
## [1] 10820922  5253204        0
(kMeansFit.optimalFrance$betweenss / kMeansFit.optimalFrance$totss) * 100
## [1] 56.85748

Observation

From the hierarchical clustering and the optimal number of clusters using k-means, it seemed that both results lead to the number of clusters to be 3. However from the 3 clusters, we realised that the number of data points or size in each cluster to be 1, 83 and 3 and withinSS to be 0, 10820922 and 5253204 respectively. This is rather imbalance and hence strengthens our assumption that the clusters might not be spherical in shape. Hence, to fix this, we continue to perform EM.

emFit <- Mclust(matrixFrance)
summary(emFit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VII (spherical, varying volume) model with 2 components: 
## 
##  log-likelihood  n   df       BIC       ICL
##       -438736.7 87 3049 -891089.9 -891089.9
## 
## Clustering table:
##  1  2 
## 60 27

Observation

Oh, what an observation! Indeed, we can see that EM will return 2 clusters. This shows that when the previous hierarchical clustering was being done, our assumption on one of the clusters with the smallest points to be an anamoly hence there is a chance taht EM ignore that cluster. From the results obtained, we will get going with the extraction of the individual clusters.


Looking into each cluster

Similar to the extraction of individual clusters which was being done with the whole data previously, we will be performing the following in the individual clusters of France too:

  • Finding the most popular products bought in each cluster

To perform further clustering and analyzing, we made some observation in finding the top products being bought in each cluster. Since the Customer-Product matrix is built on the total amount spent by the customer per product, we can perform a market basket analysis and look into the most frequent items bought in each cluster.


Cluster 1

cluster1France <- emFit$classification[emFit$classification == "1"]
cluster2France <- emFit$classification[emFit$classification == "2"]
cluster1France.names <- names(cluster1France)
cluster2France.names <- names(cluster2France)
cluster1France.data <- FranceData[which(FranceData$CustomerID %in% cluster1France.names),]
cluster1France.agg <- aggregate(list(Quantity=cluster1France.data$Quantity), by=list(StockCode=cluster1France.data$StockCode), FUN=sum)
cluster1France.topQty <- head(cluster1France.agg[order(cluster1France.agg$Quantity, decreasing = TRUE),], n=10)
cluster1France.topProd <- distinct(merge(cluster1France.topQty, cluster1France.data[,2:3]), StockCode, .keep_all = TRUE)
cluster1France.topProd[order(cluster1France.topProd$Quantity, decreasing = TRUE),]
##    StockCode Quantity                         Description
## 5      21212      432     PACK OF 72 RETROSPOT CAKE CASES
## 9      22959      300              WRAP CHRISTMAS VILLAGE
## 6      21232      259      STRAWBERRY CERAMIC TRINKET BOX
## 3      21086      204         SET/6 RED SPOTTY PAPER CUPS
## 7      21731      199       RED TOADSTOOL LED NIGHT LIGHT
## 2      21080      192 SET/20 RED RETROSPOT PAPER NAPKINS 
## 4      21094      192       SET/6 RED SPOTTY PAPER PLATES
## 8      22554      180    PLASTERS IN TIN WOODLAND ANIMALS
## 10      POST      169                             POSTAGE
## 1      20979      160       36 PENCILS TUBE RED RETROSPOT
ggplot(cluster1France.topProd, aes(x=reorder(cluster1France.topProd$Description, cluster1France.topProd$Quantity), y= cluster1France.topProd$Quantity, fill=cluster1France.topProd$Description)) + geom_bar(stat = "identity") + labs(x="Product", y="Quantity bought", fill="Product descriptions") + geom_text(aes(label=cluster1France.topProd$Quantity), position=position_stack(0.5)) + ggtitle("Top Products bought in Cluster 1") + coord_flip() + theme(legend.position="none")

cluster2France.data <- FranceData[which(FranceData$CustomerID %in% cluster2France.names),]
cluster2France.agg <- aggregate(list(Quantity=cluster2France.data$Quantity), by=list(StockCode=cluster2France.data$StockCode), FUN=sum)
cluster2France.topQty <- head(cluster2France.agg[order(cluster2France.agg$Quantity, decreasing = TRUE),], n=10)
cluster2France.topProd <- distinct(merge(cluster2France.topQty, cluster2France.data[,2:3]), StockCode, .keep_all = TRUE)
cluster2France.topProd[order(cluster2France.topProd$Quantity, decreasing = TRUE),]
##    StockCode Quantity                      Description
## 9      23084     3845               RABBIT NIGHT LIGHT
## 5      22492     2052          MINI PAINT SET VINTAGE 
## 10     84879     1192    ASSORTED COLOUR BIRD ORNAMENT
## 4      21731     1091    RED TOADSTOOL LED NIGHT LIGHT
## 2      21086     1068      SET/6 RED SPOTTY PAPER CUPS
## 8      22556     1019   PLASTERS IN TIN CIRCUS PARADE 
## 7      22554      964 PLASTERS IN TIN WOODLAND ANIMALS
## 3      21094      924    SET/6 RED SPOTTY PAPER PLATES
## 6      22551      891         PLASTERS IN TIN SPACEBOY
## 1      20725      831          LUNCH BAG RED RETROSPOT
ggplot(cluster2France.topProd, aes(x=reorder(cluster2France.topProd$Description, cluster2France.topProd$Quantity), y= cluster2France.topProd$Quantity, fill=cluster2France.topProd$Description)) + geom_bar(stat = "identity") + labs(x="Product", y="Quantity bought", fill="Product descriptions") + geom_text(aes(label=cluster2France.topProd$Quantity), position=position_stack(0.5)) + ggtitle("Top Products bought in Cluster 2") + coord_flip() + theme(legend.position="none")


Interpretation

Cluster with “Packet of 72 Retrospot Cake Cases” being the top bought product

From the plot shown, there might be a possibility that the customers in this cluster are partyware and packaging suppliers. This can be seen by the products being bought from this cluster such as “Paper Plates”, “Paper Cups” and “Paper Napkins”. Even by looking at the top selling products at a quantity of 432 cake cases being bought, this strengthens our claim. With “Wrap Christmas Village” being the second most bought product in this cluster, this might mean that there is a high possibility that this shop is trying to restock before the festive season kicks in.

Cluster with “Rabbit Night Light” being the top bought product

From the plot shown, there might be a possibility that the customers in this cluster are most probably business owners selling childrens’ accessories. This is because, we can notice that from majority of the products shown in cluster 2, it consists of products such as “Night Light” for the kids to decorate their room and 3 of the products are “Plasters in Tin” just that they are of different designs hence this shows that these products are most likely to be very suitable for kids. We notice that “Rabbit Night Light” is the most distinguished product as the number of quantity bought amounts to twice of the second most popular product, “Mini Paint Set Vintage”. There might be a possibility that it is a popular cartoon character among kids. Apart from that, one of the products was a lunch bag thus this fits our assumption.


Insights for this assignment

In conclusion, the following are some insights and learning points which we had discovered throughout this team assignment.

  1. Working on huge dataset is merely impossible to do clustering by EMfit to cluster based on the whole data as it got stuck when the process of fitting was at 11%. Hierarchical is a definite no as it will be too large to plot. We then realised that only k-means is a suitable clustering algorithm. However, the downside of k-means was that it will be not take into consideration ellipsoidal shaped clusters hence this will affect the clustering. Hence, in this assignment, it was only being done using k-means on the whole dataset. We found out a package called clara which could help in clustering on huge datasets but since the algoritm requires selecting sample points from the whole dataset hence this may also not be a feasible way to perform clustering on huge data. Picking sample points may improve the speed of the clustering however, accuracy is the main focus for this assignment hence, we won’t want to deal with the risk of information loss.

  2. We are unable to perform k-means on smaller countries because there was lack of customer purchasing behaviour. For example, when we tried to input the number of cluster to be 2 for Bahrain or Ireland, this error message will be prompted - Error: number of cluster centres must lie between 1 and nrow(x).

  3. When we were doing the EMfit, it makes use of the mclust package and when we were performing the clusters based on the countries, we realised that there was a difference in the models being used during the EM clustering for multivariate mixture, nivariate mixture and single component.

  4. Another way to approach this assignment is by the RFM Analysis way which is a marketing technique and helps to determine which customers are the best ones according to:
  • How recent a customer has purchased?
  • How often do they purchase?
  • How do they spend?
  1. In this assignment, we didn’t look into the time stamps of the invoice. However, if we were to consider the time component, we can obtain better analysis by looking into the similarity of products bought and amount spent across the different month. Apart from that, we only removed the Customer IDs which contained NAs. However, there can be more preprocessing being done to reduce the size of the dataset such as feature selection using PCA.