Eigencluster: Meeting with Grant 10/13

From OpenWetWare
Jump to navigationJump to search

Grant thinks that the strength of the EigenCluster algorithm lies in the mathematical proof that the resulting clustering is within a given proximity to the "ideal" clustering. Imagine the search for optimal clustering as being analogous to the search for a global minimum on a corrugated (hyper-)surface. The popular algorithms used now (mainly k-means) are very likely to get stuck in a local minimum, i.e. they converge to a non-ideal clustering of the data. In practice this is observed as a dependence of the clustering result on the initial conditions. Consequently, if EigenCluster gives you a bad clustering result, it means that there are no "good" clusters. With other algorithms it might still depend on your starting conditions. So EigenCluster transforms clustering from some kind of "black-art" and being data-set-dependent into more of a "black-box" application (this is my simplification of his explanations).

The second strength of the algorithm lies in a novel definition of how to optimally cut a chunk of data in two clusters, the so-called conductance. Using this criterion, the whole data set is, in a first step, consecutively split up into a tree structure until one arrives at the individual elements.

In the second step, going up from the individual elements of the tree, clusters are formed which optimize a given object function, such as the maximum diameter of the cluster, the mean square displacement from the cluster center, ... Starting out from the whole data set (without the "preconditioning" into the tree), this problem is 'NP-hard', which means that the problem is intrinsically harder than those that can be solved by a nondeterministic Turing machine in polynomial time (=a tough ass problem ;-)). However, on the basis of the tree structure the problem can be solved exactly.

Another nice feature is that the EigenCluster algorithm is runtime-limited. It allows the user to predict when the code will be finished. Other algorithms solve the problem iteratively, i.e. they starting from a guess and try to find better and better solutions until consecutive solutions do not differ markedly from each other. This so-called convergence might be achieved fast or very slow. However, the average runtime of the algorithms is comparable.

Concerning the "life-time" of the algorithm (or whether it is likely that it is being replaced soon by a better one): the field is academically very active, there are lots of algorithms out there but they either don't have the proof to be close to optimum or they have not been evaluated so far (and have thus not proven that they perform at least equally good as traditional ones). EigenCluster can thus be considered to be the best proved method so far.

In terms of numerical comparison: They use datasets where they know the optimum clustering and then calculate the overlap between the optimum clustering and the result of different clustering algorithms (compare e.g. table 2 in the EigenCluster publication of Vempala. So if EigenCluster has 93% overlap (as in the first line) vs. 89% overlap for another algorithm, it means that it performs 11/7=1.6 times better than this algorithm. Given that definition one cannot assume that an algorithm is "10x better". Especially for medical applications we could argue that missing about 40% less data than the other one. But Grant stressed that the important thing is the proof. This should be enough even if the clustering in these examples is similarly good.