k-means algorithm

The R approximation to k-means
If I had to suggest a starter dish for a data-mining students dinner, K-Means algorithm would be, with no doubt, an excellent option. 

Its recipe is extremely simple and its good performance on producing clusters, makes this technique to be an outstanding tool to teach "how data mining works".

Easy to guess possible applications, easy to calculate and easy to deploy, these are the three must that an algorithm should comply in order to help data mining analysts in their job.

It is also relevant to mention that in spite of the fact that K-Means is unbeatable for education purposes, Support Vector Machine algorithm (SVM) is much better at clustering, but at the same time, much more hard on calculations and difficult to read into the clustering results.

To understand one basic limitation of k-means algorithm, let's see the following image: 
  1. The first slide shows two clusters. Something easily achievable by k-means.
  2. The second slide shows also two clusters, but this time one inside the other one. Something no achievable by k-means, since it cannot generate concentric shape clusters.
  3. The third slide shows how SVM can do this job by using a workaround strategy. The point is to use a kernel function ( a cone generator in our example ) to elevate the inside cluster, so that a simple plane can easily identify the inside cluster. In fact, this kernel function what is doing is to turn a non linear problem into a linear problem.

This strategy arises one interesting question:

Why not to use kernel functions in k-means algorithm, empowering its possibilities ?
... Yes ..., this is just the kind of queries that a Data Mining analyst should face and should answer as well ....

But let's focus on K-Means in this post. We will develop and example to clarify how a clustering algorithm like K-Means works.

.ılılıll|    1. How k-means works    |llılılı.

.ılılıll|    2. Building our data set    |llılılı.

.ılılıll|    3. Running k-means algorithm    |llılılı.

.ılılıll|    4. k-means returning ratios    |llılılı.

In R language, k-means function returns a number of ratios like $betweenss, $withinss, $tot.withinss and $totss. Well, the question we are facing in this section is to guess which is the meaning of each one.
To start off, we could say that ss stands for sum of squares since it is the metric that k-means uses to know how compact is a cluster and how different are several clusters among themselves.

The basic ones
$betweenss: is the between clusters sum of squares. In fact it is the mean of distances between cluster centers. One expects, this ratio, to be as higher as possible, since we would like to have heterogenous clusters.

2 · ( ∑n | CmP - Cn|2 )  /  p · p - 1

$withinss: is the within cluster sum of squares. So it results in a vector with a number for each cluster. One expects, this ratio, to be as lower as possible for each cluster, since we would like to have homogeneity within the clusters.
( ∑ | Xm - C |2 )  /  p

Some equalities may help to understand:
$tot.withinss = sum ( $withinss )
$totss = $tot.withinss + $betweenss

Some situations may help to understand:
If we just have 1 cluster, then ... $betweenss = 0  and  $tot.withinss = $withinss  and  $totss = $tot.withinss

.ılılıll|    5. How many k clusters to work with ?    |llılılı.

The main weak point of k-means is that the number of cluster to be identified is an input parameter. This is quite annoying since many times the dataset does not give any clue of its data structure. This point arises the question of how many k clusters best fits to a given dataset.

Combining ratios $withinss and $betweenss with an appropriate algorithm may help to choose the right number of k clusters.

Prior algorithm to get the right number of k clusters:

The point is to run the algorithm for 1,2,3,4,... k possible clusters and for each one to save the value $withinss.
The k value which loses the highest $withinss is the best choice, because we expect the within sum of squares ratio to be as lower as possible.

The following code may help:

# OutL2 is the dataset we are working with

# We take the first iteration
> wss <- sum(kmeans(OutL2,centers=1)$withinss)

# We take iteration 2 to 15
> for (i in 2:15) wss[i] <- sum(kmeans(OutL2,centers=i)$withinss)

# We plot the 15 withinss values. One for each k
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Choose just the k value where the function withinss stops decreasing significantly.

 How Nature uses Maths 

The sphere protects. By not having any edges, it is the most difficult shape to catch or bite, additionally, it is a democratic shape, since all inside points are equally protected from the outside.
  dIi   ,IICGG8888@b

Bees know that the hexagon bricks can fill all the space.
     __    __
  __/  \__/  \
 /  \__/  \__/
 \__/  \__/  \
    \__/  \__/
  __/  \__/  \
 /  \__/  \__/
 \__/  \__/  \
    \__/  \__/

The spiral packs, and it is indeed an excellent strategy to grow.

The hélix fastens. The more twists it uses, the better it holds down.

 /=/   \=\
 |=|   |=|
 \=\   /=/
  `.\ /,'
  ,'/ \`.
 /=/   \=\
 |=|   |=|
 \=\   /=/
  `.\ /,'
  ,'/ \`.
 /=/   \=\
 |=|   |=|
 \=\   /=/
  `.\ /,'

Fractals reach everywhere. The best way to reach every nook and cranny of a surface.
                  \      /
          __/\__/     \__/\__
          \                      /
          /__                __\
                \           /
/\__        __/           \__ 
    /        \                 / 
    \__/\__/                 \__

Waves move. It is an harmonic way to shift and to communicate.
   .-.-.   .-.-.   
  / / \ \ / / \ \ 
 `-'   `-`-'   `-`

Angle punctures. It locks and penetrates like the teeth of a predator. 

Nature uses mathematics to solve problems in an astonishing, simple and effective way. Observing this, is always a good strategy to face new issues.