# User:Timothee Flutre/Notebook/Postdoc/2011/11/04

### From OpenWetWare

(Difference between revisions)

(Autocreate 2011/11/04 Entry for User:Timothee_Flutre/Notebook/Postdoc) |
(→Entry title: in R, kmeans, scatterplot3d and heatmap + try cluto) |
||

Line 7: | Line 7: | ||

<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||

==Entry title== | ==Entry title== | ||

- | |||

+ | * '''R''' is very efficient to sketch an analysis, but it '''usually cannot handle very large datasets''' (matrix with <math>>10^6</math> rows), thus it often happens that I need to find other tools. | ||

+ | |||

+ | * As a first try to replace [http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html kmeans] in R, I launched the [http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview CLUTO] clustering program on a simulated dataset (file ''matrix.txt''): | ||

+ | |||

+ | for k in {1..8}; do vcluster matrix.txt $k -clabelfile=colnames.txt -plotclusters=plot_k${k}.ps -clustercolumns > stdout_k${k}; done | ||

+ | |||

+ | It works well and it still finishes on a large, "real" dataset. However, should I trust the results? Indeed, it is well-known that kmeans suffers from its tendency to build clusters of similar size. And, as shown by the figure below, it can provides bad results... | ||

+ | |||

+ | * To show this, let's '''simulate a dataset''', realistic enough for what I am analyzing. There are 1000 items, each of 3 dimensions (x, y and z). The data belong to 4 clusters, the first of size 700 around 000, the second of size 200 around 111, the third of size 50 around 011, and the fourth of size 50 around 100. Here is the R code: | ||

+ | |||

+ | low.mean <- 0 | ||

+ | high.mean <- 2 | ||

+ | mysd <- 0.1 | ||

+ | mult <- 1000 | ||

+ | mydata.all <- rbind(matrix(rnorm(0.7*mult*3, mean=low.mean, sd=mysd), ncol=3, byrow=TRUE), | ||

+ | matrix(rnorm(0.2*mult*3, mean=high.mean, sd=mysd), ncol=3, byrow=TRUE), | ||

+ | matrix(c(rnorm(0.05*mult, mean=low.mean, sd=mysd), rnorm(0.1*mult, mean=high.mean, sd=mysd)), ncol=3, byrow=FALSE), | ||

+ | matrix(c(rnorm(0.05*mult, mean=high.mean, sd=mysd), rnorm(0.1*mult, mean=low.mean, sd=mysd)), ncol=3, byrow=FALSE)) | ||

+ | mydata.all <- cbind(mydata.all, c(rep("000", 0.7*mult), rep("111", 0.2*mult), rep("011", 0.05*mult), rep("100", 0.05*mult))) | ||

+ | colnames(mydata.all) <- c("F", "L", "T", "truth") | ||

+ | head(mydata.all) | ||

+ | |||

+ | Now, let's '''use kmeans and plot the results''': | ||

+ | |||

+ | mydata <- matrix(as.numeric(mydata.all[sample(nrow(mydata.all)), 1:3]), ncol=3, byrow=FALSE) | ||

+ | colnames(mydata) <- c("F","L","T") | ||

+ | head(mydata) | ||

+ | res.km <- kmeans(mydata, 4) | ||

+ | aggregate(mydata, by=list(res.km$cluster), FUN=mean) | ||

+ | table(res.km$cluster) | ||

+ | library(scatterplot3d) | ||

+ | scatterplot3d(mydata[,"F"], mydata[,"L"], mydata[,"T"], color=res.km$cluster, main="kmeans") | ||

+ | |||

+ | [[Image:Kmeans unequal-clusters bad-results.png|400px]] | ||

+ | |||

+ | It's pretty wrong, isn't it? | ||

+ | |||

+ | And as a bonus, here is how to plot the corresponding [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/heatmap.html heatmap] (as I spent some time to find the proper way to do it): | ||

+ | |||

+ | mydata.sort <- cbind(mydata, res.km$cluster)[order(res.km$cluster),] | ||

+ | heatmap(mydata.sort[,1:3], Rowv=NA, Colv=NA, labRow=NA, scale="none", col=heat.colors(10)) | ||

<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |

## Revision as of 21:53, 4 November 2011

Project name | Main project page Previous entry Next entry |

## Entry title-
**R**is very efficient to sketch an analysis, but it**usually cannot handle very large datasets**(matrix with > 10^{6}rows), thus it often happens that I need to find other tools.
- As a first try to replace kmeans in R, I launched the CLUTO clustering program on a simulated dataset (file
*matrix.txt*):
for k in {1..8}; do vcluster matrix.txt $k -clabelfile=colnames.txt -plotclusters=plot_k${k}.ps -clustercolumns > stdout_k${k}; done It works well and it still finishes on a large, "real" dataset. However, should I trust the results? Indeed, it is well-known that kmeans suffers from its tendency to build clusters of similar size. And, as shown by the figure below, it can provides bad results... - To show this, let's
**simulate a dataset**, realistic enough for what I am analyzing. There are 1000 items, each of 3 dimensions (x, y and z). The data belong to 4 clusters, the first of size 700 around 000, the second of size 200 around 111, the third of size 50 around 011, and the fourth of size 50 around 100. Here is the R code:
low.mean <- 0 high.mean <- 2 mysd <- 0.1 mult <- 1000 mydata.all <- rbind(matrix(rnorm(0.7*mult*3, mean=low.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(rnorm(0.2*mult*3, mean=high.mean, sd=mysd), ncol=3, byrow=TRUE), matrix(c(rnorm(0.05*mult, mean=low.mean, sd=mysd), rnorm(0.1*mult, mean=high.mean, sd=mysd)), ncol=3, byrow=FALSE), matrix(c(rnorm(0.05*mult, mean=high.mean, sd=mysd), rnorm(0.1*mult, mean=low.mean, sd=mysd)), ncol=3, byrow=FALSE)) mydata.all <- cbind(mydata.all, c(rep("000", 0.7*mult), rep("111", 0.2*mult), rep("011", 0.05*mult), rep("100", 0.05*mult))) colnames(mydata.all) <- c("F", "L", "T", "truth") head(mydata.all) Now, let's mydata <- matrix(as.numeric(mydata.all[sample(nrow(mydata.all)), 1:3]), ncol=3, byrow=FALSE) colnames(mydata) <- c("F","L","T") head(mydata) res.km <- kmeans(mydata, 4) aggregate(mydata, by=list(res.km$cluster), FUN=mean) table(res.km$cluster) library(scatterplot3d) scatterplot3d(mydata[,"F"], mydata[,"L"], mydata[,"T"], color=res.km$cluster, main="kmeans") It's pretty wrong, isn't it? And as a bonus, here is how to plot the corresponding heatmap (as I spent some time to find the proper way to do it): mydata.sort <- cbind(mydata, res.km$cluster)[order(res.km$cluster),] heatmap(mydata.sort[,1:3], Rowv=NA, Colv=NA, labRow=NA, scale="none", col=heat.colors(10)) |