Hi,

--e89a8f838e09bc7e690512690bb4--
I have opened a couple of threads a=
sking about k-means performance problem in Spark. I think I made a little p=
rogress.

Previous I use the simplest way of KMeans=
.train(rdd, k, maxIterations). It uses the "kmeans||" initializat=
ion algorithm which supposedly to be a faster version of kmeans++ and give =
better results in general.

But I observed that if =
the k is very large, the initialization step takes a long time. From the CP=
U utilization chart, it looks like only one thread is working. Please see=
=C2=A0https://stackoverflow.com/questions/29326433/cpu-=
gap-when-doing-k-means-with-spark.

I read the =
paper,=C2=A0http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and=
it points out kmeans++ initialization algorithm will suffer if k is large.=
That's why the paper contributed the kmeans|| algorithm.

If I invoke KMeans.train by using the random initia=
lization algorithm, I do not observe this problem, even with very large k, =
like k=3D5000. This makes me suspect that the kmeans|| in Spark is not prop=
erly implemented and do not utilize parallel implementation.

I have also tested my co=
de and data set with Spark 1.3.0, and I still observe this problem. I quick=
ly checked the PR regarding the KMeans algorithm change from 1.2.0 to 1.3.0=
. It seems to be only code improvement and polish, not changing/improving t=
he algorithm.

I originally worked o=
n Windows 64bit environment, and I also tested on Linux 64bit environment. =
I could provide the code and data set if anyone want to reproduce this prob=
lem.

I hope a Spark developer could=
comment on this problem and help identifying if it is a bug.