User:Timothee Flutre/Notebook/Postdoc/2011/11/04: Difference between revisions
From OpenWetWare
(→Entry title: in R, kmeans, scatterplot3d and heatmap + try cluto) |
(→Entry title: add title) |
||
Line 6: | Line 6: | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
== | ==K-means has the bad tendency to build clusters of similar size== | ||
* '''R''' is very efficient to sketch an analysis, but it '''usually cannot handle very large datasets''' (matrix with <math>>10^6</math> rows), thus it often happens that I need to find other tools. | * '''R''' is very efficient to sketch an analysis, but it '''usually cannot handle very large datasets''' (matrix with <math>>10^6</math> rows), thus it often happens that I need to find other tools. |
Revision as of 08:10, 21 November 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
K-means has the bad tendency to build clusters of similar size
for k in {1..8}; do vcluster matrix.txt $k -clabelfile=colnames.txt -plotclusters=plot_k${k}.ps -clustercolumns > stdout_k${k}; done It works well and it still finishes on a large, "real" dataset. However, should I trust the results? Indeed, it is well-known that kmeans suffers from its tendency to build clusters of similar size. And, as shown by the figure below, it can provides bad results...
low.mean <- 0 high.mean <- 2 mysd <- 0.1 mult <- 1000 mydata.all <- rbind(matrix(rnorm(0.7*mult*3, mean=low.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(rnorm(0.2*mult*3, mean=high.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(c(rnorm(0.05*mult, mean=low.mean, sd=mysd), rnorm(0.1*mult, mean=high.mean, sd=mysd)), ncol=3, byrow=FALSE), matrix(c(rnorm(0.05*mult, mean=high.mean, sd=mysd), rnorm(0.1*mult, mean=low.mean, sd=mysd)), ncol=3, byrow=FALSE)) mydata.all <- cbind(mydata.all, c(rep("000", 0.7*mult), rep("111", 0.2*mult), rep("011", 0.05*mult), rep("100", 0.05*mult))) colnames(mydata.all) <- c("F", "L", "T", "truth") head(mydata.all) Now, let's use kmeans and plot the results: mydata <- matrix(as.numeric(mydata.all[sample(nrow(mydata.all)), 1:3]), ncol=3, byrow=FALSE) colnames(mydata) <- c("F","L","T") head(mydata) res.km <- kmeans(mydata, 4) aggregate(mydata, by=list(res.km$cluster), FUN=mean) table(res.km$cluster) library(scatterplot3d) scatterplot3d(mydata[,"F"], mydata[,"L"], mydata[,"T"], color=res.km$cluster, main="kmeans") It's pretty wrong, isn't it? And as a bonus, here is how to plot the corresponding heatmap (as I spent some time to find the proper way to do it): mydata.sort <- cbind(mydata, res.km$cluster)[order(res.km$cluster),] heatmap(mydata.sort[,1:3], Rowv=NA, Colv=NA, labRow=NA, scale="none", col=heat.colors(10)) |