SAData <- retailData[which(retailData$Country == "Saudi Arabia"),]
SAData$CustomerID <- as.factor(SAData$CustomerID)
SAData$StockCode <- as.factor(as.character(SAData$StockCode))
str(SAData)
## 'data.frame': 10 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 3903 3903 3903 3903 3903 3903 3903 3903 3903 22901
## $ StockCode : Factor w/ 9 levels "20781","22361",..: 8 4 3 2 5 6 7 1 9 2
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 255 1569 1570 1565 2775 2777 2774 1590 1807 1565
## $ Quantity : int 12 6 6 6 12 12 12 2 12 -5
## $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 12543 12543 12543 12543 12543 12543 12543 12543 12543 16539
## $ UnitPrice : num 0.42 2.95 2.95 2.95 1.65 1.65 1.65 5.49 1.45 2.95
## $ CustomerID : Factor w/ 1 level "12565": 1 1 1 1 1 1 1 1 1 1
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 30 30 30 30 30 30 30 30 30 30
matrixSA <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = SAData, addNA = TRUE, sparse = TRUE)
str(matrixSA)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:9] 0 0 0 0 0 0 0 0 0
## ..@ p : int [1:10] 0 1 2 3 4 5 6 7 8 9
## ..@ Dim : int [1:2] 1 9
## ..@ Dimnames:List of 2
## .. ..$ CustomerID: chr "12565"
## .. ..$ StockCode : chr [1:9] "20781" "22361" "22362" "22363" ...
## ..@ x : num [1:9] 10.98 2.95 17.7 17.7 19.8 ...
## ..@ factors : list()
SAData$StockCode
## [1] 22915 22363 22362 22361 22553 22555 22556 20781 22969 22361
## Levels: 20781 22361 22362 22363 22553 22555 22556 22915 22969
emFit <- Mclust(matrixSA)
summary(emFit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 2 components:
##
## log-likelihood n df BIC ICL
## -22.76437 9 5 -56.51487 -56.5231
##
## Clustering table:
## 1 2
## 3 6
emFit$classification
## 20781 22361 22362 22363 22553 22555 22556 22915 22969
## 1 1 2 2 2 2 2 1 2
ggplot(SAData, aes(x=reorder(SAData$Description, SAData$Quantity*SAData$UnitPrice), y=SAData$Quantity*SAData$UnitPrice, fill=SAData$InvoiceNo)) + geom_bar(stat = "identity") + labs(x="Product", y="Total Amount Spent") + geom_text(aes(label=(SAData$UnitPrice*SAData$Quantity)), position=position_stack(0.5)) + ggtitle("Top Products bought in Saudia Arabia") + theme(legend.position="none") + coord_flip()
Saudia Arabia only consists of 1 customer and he/she bought a total of 9 products. However, it can be seen from the clustering result that there are 2 clusters. Upon further inspection, since the clusters are based on the total purchase volume, one of the clusters are similar in which the amount spent was not so much. The products for cluster 1 are: “Assorted Bottle Top Magnets”, “Gold Ear Muffin” and “Glass Jar Daisy Cotton Wool”. However, for the other cluster, the amount spent is much more compared to the previous cluster. This includes products such as “Glass Jar Marmalade” and “Peacock Bath Salts” and different variety of “Plasters” This explains why the clusters are separated into two different clusters, hence showing how the customer behaviour affects the clustering results. From the plot shown, there are two invoices indicated by the different colors. We can notice that one of the clusters should consist of the first six products in the diagram since each product has total amount spent more than $11 whereas the remaining products are another cluster. In addition to that, “Glass Jar Fresh Cotton Wool” contains negative value of the invoice thus, this suggests that it is a cancelled invoice. Upon further inspection, we realised that it is indeed a cancelled invoice as it contained a character “C”! Hence, it’s final amount will result in $2.95 which definitely belong to the second cluster.
Moving on, we will perform EM clustering for both Bahrain and Czech Republic.
BahrainData <- retailData[which(retailData$Country == "Bahrain"),]
BahrainData$CustomerID <- as.factor(BahrainData$CustomerID)
BahrainData$StockCode <- as.factor(as.character(BahrainData$StockCode))
str(BahrainData)
## 'data.frame': 17 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 7751 7751 7751 7751 7751 7751 7751 7751 7751 7751 ...
## $ StockCode : Factor w/ 16 levels "22423","22649",..: 3 8 9 7 2 1 16 6 4 5 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 1671 1830 1155 2344 3650 2984 3128 3085 1651 2730 ...
## $ Quantity : int 24 96 60 2 8 2 12 6 6 6 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 22981 22981 22981 22981 22981 22981 22981 22981 22981 22981 ...
## $ UnitPrice : num 1.25 1.25 1.25 9.95 4.95 ...
## $ CustomerID : Factor w/ 2 levels "12353","12355": 2 2 2 2 2 2 2 2 2 2 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 3 3 3 3 3 3 3 3 3 3 ...
matrixBahrain <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = BahrainData, addNA = TRUE, sparse = TRUE)
str(matrixBahrain)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:17] 1 1 1 1 1 1 0 1 1 1 ...
## ..@ p : int [1:17] 0 1 2 3 4 5 6 8 9 10 ...
## ..@ Dim : int [1:2] 2 16
## ..@ Dimnames:List of 2
## .. ..$ CustomerID: chr [1:2] "12353" "12355"
## .. ..$ StockCode : chr [1:16] "22423" "22649" "22693" "22697" ...
## ..@ x : num [1:17] 25.5 39.6 30 17.7 17.7 17.7 39.8 19.9 120 75 ...
## ..@ factors : list()
BahrainData$StockCode
## [1] 22693 23076 23077 22890 22649 22423 85040A 22699 22697 22698
## [11] 72802A 72802B 72802C 37449 37446 22890 37450
## 16 Levels: 22423 22649 22693 22697 22698 22699 22890 23076 23077 ... 85040A
emFit <- Mclust(matrixBahrain)
summary(emFit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust XXI (diagonal multivariate normal) model with 1 component:
##
## log-likelihood n df BIC ICL
## -127.3295 2 32 -276.8398 -276.8398
##
## Clustering table:
## 1
## 2
emFit$classification
## [1] 1 1
CZRData <- retailData[which(retailData$Country == "Czech Republic"),]
CZRData$CustomerID <- as.factor(CZRData$CustomerID)
CZRData$StockCode <- as.factor(as.character(CZRData$StockCode))
str(CZRData)
## 'data.frame': 30 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 4030 4030 4030 4030 4030 4030 4030 4030 4030 4030 ...
## $ StockCode : Factor w/ 25 levels "20972","20974",..: 17 23 8 7 9 11 22 1 12 6 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 308 891 3700 3892 1895 1023 2703 2640 3098 3429 ...
## $ Quantity : int 18 48 24 12 36 32 24 24 24 12 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 15237 15237 15237 15237 15237 15237 15237 15237 15237 15237 ...
## $ UnitPrice : num 2.55 0.65 0.85 1.25 1.45 0.85 1.49 1.25 2.95 4.25 ...
## $ CustomerID : Factor w/ 1 level "12781": 1 1 1 1 1 1 1 1 1 1 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 9 9 9 9 9 9 9 9 9 9 ...
matrixCZR <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = CZRData, addNA = TRUE, sparse = TRUE)
str(matrixCZR)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:25] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ p : int [1:26] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ Dim : int [1:2] 1 25
## ..@ Dimnames:List of 2
## .. ..$ CustomerID: chr "12781"
## .. ..$ StockCode : chr [1:25] "20972" "20974" "20975" "21253" ...
## ..@ x : num [1:25] 30 15.6 15.6 35.4 18 51 15 20.4 8.7 23.4 ...
## ..@ factors : list()
CZRData$StockCode
## [1] 22930 84755 22216 21791 22231 22250 84459A 20972 22326 21428
## [11] 22587 47594B 85206A 22244 22505 22231 84459A 47421 23271 22579
## [21] 22578 21253 21373 20974 20975 84347 POST 84459A 22231 POST
## 25 Levels: 20972 20974 20975 21253 21373 21428 21791 22216 22231 ... POST
emFit <- Mclust(matrixCZR)
summary(emFit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust X (univariate normal) model with 1 component:
##
## log-likelihood n df BIC ICL
## -110.6767 25 2 -227.7911 -227.7911
##
## Clustering table:
## 1
## 25
emFit$classification
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Upon running both Bahrain and Czech Republic, both clustering produce only 1 cluster. We noticed that there is a similarity between these two coutries in which it will not print out the Stock IDs in that cluster upon running
emFit$classification
. This may be because there is only 1 cluster present after the clustering process. This could be due to the small number of records for these countries and it is most likely the case where it only consist of 1 or 2 customers. Thus, there is a high chance that their customer behaviour is similar and does not vary much which makes them grouped under the same cluster.
Moving on, we decided to work on the top countries such as France, Germany and Ireland. We noticed that the distance between these countries to UK is very near and are considered to be the neighbors of UK hence majority of the customers definitely come from those countries. Initally, we did try to run k-means on UK but due to the huge amount of data, performing EMfit was not possible. Hence, we performed k-means clustering on Germany and Ireland and the results of the clusters being produced by both countries were 9 and 2 clusters respectively. Upon further inspection of the data, we realised that for Germany, there are a number of single clusters which might mean that they are anamolies. Ireland on the other hand, only consists of 3 customers who did the purchasing hence, the number of clusters were only 2. Therefore, We will not be performing further extraction of the clusters for UK, Ireland and Germany.
Instead, we will look into France by firstly creating a customer-product matrix based on that country before continuing on the possible clustering algorithms. Let’s hope we will come out with promising clusters and insights!
We start off by extracting France data before taking a quick look on the structure of that data. We then build a Customer-Product matrix from the FranceData
. For this section, we will be trying an alternative approach of k-means which is to start off with hierarchical clustering before other clustering algorithms which is using k-means and then Emfit. The difference in using hierarchical clustering is such that it does not require us to specify the number of clusters being generated and it results in the tree-based dendogram. With the different clustering algorithms, this will help us to compare on the results thus helping us to further interpret the customer behaviours for each cluster.
FranceData <- retailData[which(retailData$Country == "France"),]
FranceData$CustomerID <- as.factor(FranceData$CustomerID)
FranceData$StockCode <- as.factor(as.character(FranceData$StockCode))
str(FranceData)
## 'data.frame': 8491 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ StockCode : Factor w/ 1523 levels "10002","10120",..: 812 811 810 314 353 1 330 93 551 757 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 172 173 169 2485 3633 1841 3892 3377 3098 3605 ...
## $ Quantity : int 24 24 12 12 24 48 24 18 24 24 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/11/11 10:01",..: 202 202 202 202 202 202 202 202 202 202 ...
## $ UnitPrice : num 3.75 3.75 3.75 0.85 0.65 0.85 1.25 2.95 2.95 1.95 ...
## $ CustomerID : Factor w/ 87 levels "12413","12437",..: 27 27 27 27 27 27 27 27 27 27 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 14 14 14 14 14 14 14 14 14 14 ...
matrixFrance <- xtabs(Quantity*UnitPrice ~ CustomerID + StockCode, data = FranceData, addNA = TRUE, sparse = TRUE)
str(matrixFrance)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:5641] 26 38 55 56 80 24 80 83 3 20 ...
## ..@ p : int [1:1524] 0 5 6 8 9 10 12 13 14 21 ...
## ..@ Dim : int [1:2] 87 1523
## ..@ Dimnames:List of 2
## .. ..$ CustomerID: chr [1:87] "12413" "12437" "12441" "12488" ...
## .. ..$ StockCode : chr [1:1523] "10002" "10120" "10125" "10135" ...
## ..@ x : num [1:5641] 40.8 10.2 10.2 20.4 234.6 ...
## ..@ factors : list()
hiercFit <- hclust(dist(matrixFrance, method = "euclidean"),
method="ward.D")
plot(hiercFit)
hiercFit <- hclust(dist(matrixFrance, method = "maximum"),
method="ward.D")
Initally, for the hierarchical clustering, we used “euclidean” as the method and the clustering result was fairly good. However, after changing it to “maximum”, the clusters are much more better such that the height are more higher compared to using “euclidean”. From the plot, we can notice that the dendogram shows explicitly the hierarchy of the clusters and by clipping the dendogram, there are a total of 3 clusters. The ones that are on the far left are most probably a cluster which can also be deemed as anamoly. From here, we can proceed to set k to be 3 and using
rect.hclust()
function to distinctly separate the clusters.
plot(hiercFit)
K <- 3
rect.hclust(hiercFit, k = K)
cutree(hiercFit, k = 3) #want to show the size of each cluster
## 12413 12437 12441 12488 12489 12490 12491 12493 12494 12506 12508 12509
## 1 2 1 1 1 2 1 1 1 1 1 1
## 12513 12523 12532 12535 12536 12553 12562 12564 12567 12571 12573 12574
## 1 1 1 1 1 2 2 1 2 1 1 1
## 12577 12579 12583 12589 12593 12598 12599 12602 12604 12615 12616 12617
## 1 1 2 1 1 2 1 1 1 2 1 1
## 12620 12624 12637 12640 12643 12650 12651 12652 12656 12657 12659 12660
## 1 1 2 2 2 1 1 1 2 2 1 1
## 12669 12670 12672 12674 12678 12679 12680 12681 12682 12683 12684 12685
## 1 2 1 2 3 2 1 2 3 2 2 2
## 12686 12689 12690 12691 12694 12695 12700 12707 12714 12716 12718 12719
## 1 1 1 2 1 1 2 1 2 1 1 1
## 12721 12722 12723 12724 12726 12727 12728 12729 12731 12732 12734 12735
## 2 1 1 1 2 2 1 1 3 1 1 1
## 12736 12740 14277
## 1 1 1
As mentioned above, we can see that the 3 different clusters being identified can be seen from the distinct red boxes which separate the clusters. We can notice that the far left cluster will have the least number of data points inside hence there might be a possibility that it is an anamoly, hence, we can actually ignore it. The second cluster, which is the middle red box seemed to have the highest number of data points in it while the third cluster have a fairly descent amount of data points in it too. We will now perform k-means and EMfit to check if the number of clusters match the ones that we did for this clustering algorithm before extracting and looking into each of the clusters.
In this section, we will be finding the optimal number of clusters by using the elbow method.
kMin <- 1
kMax <- 10
withinSS <- double(kMax - kMin + 1)
betweenSS <- double(kMax - kMin + 1)
for (K in kMin:kMax) {
kMeansFit <- kmeans(matrixFrance, centers = K, nstart = 20)
withinSS[K] <- sum(kMeansFit$withinss)
betweenSS[K] <- kMeansFit$betweenss
}
plot(kMin:kMax, withinSS, pch=19, type="b", col="red",
xlab = "Value of K", ylab = "Sum of Squares (Within and Between)")
points(kMin:kMax, betweenSS, pch=19, type="b", col="green")
From the output shown, we can see that the “elbow” on the arm is the value of the k that is best and the aim of using this way is to get the small Sum of Squared Error which is the point where the SSE tends to decrease towards 0 as we increase k. In this case, looking at the plot, the optimal k seemed to be 3. With that, we let k to be 3 and perform the optimal clustering (k-means) for France.
K <- 3
kMeansFit.optimalFrance <- kmeans(matrixFrance, centers = K, nstart = 20)
kMeansFit.optimalFrance$size
## [1] 83 3 1
kMeansFit.optimalFrance$withinss
## [1] 10820922 5253204 0
(kMeansFit.optimalFrance$betweenss / kMeansFit.optimalFrance$totss) * 100
## [1] 56.85748
From the hierarchical clustering and the optimal number of clusters using k-means, it seemed that both results lead to the number of clusters to be 3. However from the 3 clusters, we realised that the number of data points or size in each cluster to be 1, 83 and 3 and withinSS to be 0, 10820922 and 5253204 respectively. This is rather imbalance and hence strengthens our assumption that the clusters might not be spherical in shape. Hence, to fix this, we continue to perform EM.
emFit <- Mclust(matrixFrance)
summary(emFit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VII (spherical, varying volume) model with 2 components:
##
## log-likelihood n df BIC ICL
## -438736.7 87 3049 -891089.9 -891089.9
##
## Clustering table:
## 1 2
## 60 27
Oh, what an observation! Indeed, we can see that EM will return 2 clusters. This shows that when the previous hierarchical clustering was being done, our assumption on one of the clusters with the smallest points to be an anamoly hence there is a chance taht EM ignore that cluster. From the results obtained, we will get going with the extraction of the individual clusters.
Similar to the extraction of individual clusters which was being done with the whole data previously, we will be performing the following in the individual clusters of France too:
To perform further clustering and analyzing, we made some observation in finding the top products being bought in each cluster. Since the Customer-Product matrix is built on the total amount spent by the customer per product, we can perform a market basket analysis and look into the most frequent items bought in each cluster.
cluster1France <- emFit$classification[emFit$classification == "1"]
cluster2France <- emFit$classification[emFit$classification == "2"]
cluster1France.names <- names(cluster1France)
cluster2France.names <- names(cluster2France)
cluster1France.data <- FranceData[which(FranceData$CustomerID %in% cluster1France.names),]
cluster1France.agg <- aggregate(list(Quantity=cluster1France.data$Quantity), by=list(StockCode=cluster1France.data$StockCode), FUN=sum)
cluster1France.topQty <- head(cluster1France.agg[order(cluster1France.agg$Quantity, decreasing = TRUE),], n=10)
cluster1France.topProd <- distinct(merge(cluster1France.topQty, cluster1France.data[,2:3]), StockCode, .keep_all = TRUE)
cluster1France.topProd[order(cluster1France.topProd$Quantity, decreasing = TRUE),]
## StockCode Quantity Description
## 5 21212 432 PACK OF 72 RETROSPOT CAKE CASES
## 9 22959 300 WRAP CHRISTMAS VILLAGE
## 6 21232 259 STRAWBERRY CERAMIC TRINKET BOX
## 3 21086 204 SET/6 RED SPOTTY PAPER CUPS
## 7 21731 199 RED TOADSTOOL LED NIGHT LIGHT
## 2 21080 192 SET/20 RED RETROSPOT PAPER NAPKINS
## 4 21094 192 SET/6 RED SPOTTY PAPER PLATES
## 8 22554 180 PLASTERS IN TIN WOODLAND ANIMALS
## 10 POST 169 POSTAGE
## 1 20979 160 36 PENCILS TUBE RED RETROSPOT
ggplot(cluster1France.topProd, aes(x=reorder(cluster1France.topProd$Description, cluster1France.topProd$Quantity), y= cluster1France.topProd$Quantity, fill=cluster1France.topProd$Description)) + geom_bar(stat = "identity") + labs(x="Product", y="Quantity bought", fill="Product descriptions") + geom_text(aes(label=cluster1France.topProd$Quantity), position=position_stack(0.5)) + ggtitle("Top Products bought in Cluster 1") + coord_flip() + theme(legend.position="none")
cluster2France.data <- FranceData[which(FranceData$CustomerID %in% cluster2France.names),]
cluster2France.agg <- aggregate(list(Quantity=cluster2France.data$Quantity), by=list(StockCode=cluster2France.data$StockCode), FUN=sum)
cluster2France.topQty <- head(cluster2France.agg[order(cluster2France.agg$Quantity, decreasing = TRUE),], n=10)
cluster2France.topProd <- distinct(merge(cluster2France.topQty, cluster2France.data[,2:3]), StockCode, .keep_all = TRUE)
cluster2France.topProd[order(cluster2France.topProd$Quantity, decreasing = TRUE),]
## StockCode Quantity Description
## 9 23084 3845 RABBIT NIGHT LIGHT
## 5 22492 2052 MINI PAINT SET VINTAGE
## 10 84879 1192 ASSORTED COLOUR BIRD ORNAMENT
## 4 21731 1091 RED TOADSTOOL LED NIGHT LIGHT
## 2 21086 1068 SET/6 RED SPOTTY PAPER CUPS
## 8 22556 1019 PLASTERS IN TIN CIRCUS PARADE
## 7 22554 964 PLASTERS IN TIN WOODLAND ANIMALS
## 3 21094 924 SET/6 RED SPOTTY PAPER PLATES
## 6 22551 891 PLASTERS IN TIN SPACEBOY
## 1 20725 831 LUNCH BAG RED RETROSPOT
ggplot(cluster2France.topProd, aes(x=reorder(cluster2France.topProd$Description, cluster2France.topProd$Quantity), y= cluster2France.topProd$Quantity, fill=cluster2France.topProd$Description)) + geom_bar(stat = "identity") + labs(x="Product", y="Quantity bought", fill="Product descriptions") + geom_text(aes(label=cluster2France.topProd$Quantity), position=position_stack(0.5)) + ggtitle("Top Products bought in Cluster 2") + coord_flip() + theme(legend.position="none")
From the plot shown, there might be a possibility that the customers in this cluster are partyware and packaging suppliers. This can be seen by the products being bought from this cluster such as “Paper Plates”, “Paper Cups” and “Paper Napkins”. Even by looking at the top selling products at a quantity of 432 cake cases being bought, this strengthens our claim. With “Wrap Christmas Village” being the second most bought product in this cluster, this might mean that there is a high possibility that this shop is trying to restock before the festive season kicks in.
From the plot shown, there might be a possibility that the customers in this cluster are most probably business owners selling childrens’ accessories. This is because, we can notice that from majority of the products shown in cluster 2, it consists of products such as “Night Light” for the kids to decorate their room and 3 of the products are “Plasters in Tin” just that they are of different designs hence this shows that these products are most likely to be very suitable for kids. We notice that “Rabbit Night Light” is the most distinguished product as the number of quantity bought amounts to twice of the second most popular product, “Mini Paint Set Vintage”. There might be a possibility that it is a popular cartoon character among kids. Apart from that, one of the products was a lunch bag thus this fits our assumption.
In conclusion, the following are some insights and learning points which we had discovered throughout this team assignment.
Working on huge dataset is merely impossible to do clustering by EMfit to cluster based on the whole data as it got stuck when the process of fitting was at 11%. Hierarchical is a definite no as it will be too large to plot. We then realised that only k-means is a suitable clustering algorithm. However, the downside of k-means was that it will be not take into consideration ellipsoidal shaped clusters hence this will affect the clustering. Hence, in this assignment, it was only being done using k-means on the whole dataset. We found out a package called clara
which could help in clustering on huge datasets but since the algoritm requires selecting sample points from the whole dataset hence this may also not be a feasible way to perform clustering on huge data. Picking sample points may improve the speed of the clustering however, accuracy is the main focus for this assignment hence, we won’t want to deal with the risk of information loss.
We are unable to perform k-means on smaller countries because there was lack of customer purchasing behaviour. For example, when we tried to input the number of cluster to be 2 for Bahrain or Ireland, this error message will be prompted - Error: number of cluster centres must lie between 1 and nrow(x).
When we were doing the EMfit, it makes use of the mclust
package and when we were performing the clusters based on the countries, we realised that there was a difference in the models being used during the EM clustering for multivariate mixture, nivariate mixture and single component.