Dr. Peter Späth

Field Clustering

Abstract

With big data approaches more and more gaining influence in various fields of science and economics, clustering algorithms which try to answer the question which data points of a data set belong together or show some kind of similarity have become crucial. With several already developed algorithms, like hierarchical clustering, means related clustering, distribution related clustering, density related clustering with developing paths through neighboring nodes, neural network means, amongst others, the author suggests a new methodology for clustering, based on a field spanned by the data points, and relations determined by a simulated movement along that field.

Introduction

We have N points P(i) with i=0,...,N-1 in an M dimensional space. The following two questions are important for us:

Do we have clusters, that is regions where more points gather than elsewhere?
Does any point P(i) belong to a cluster, and if yes, to which one?

While the dimension M can be any number, for the rest of this paper we assume that M equals 2. However, the algorithm developed here can in principle be applied to higher dimensions as well.

The Field

While for a data set like shown in

the human eye can immediately see that there are maybe four clusters in the lower right edge of the coordiates system. It just sees there are so and so many regions where more points gather than elsewhere. The computer has a harder job to find out. Why is that? Well, it can't tell because it does not see the whole picture the same way the human brain does. The brain has a way to notice cloud densities at once. While we do not further dive into an explenation for that, we still want to describe a new method which goes into a similar direction. We want to have a density field D(x,y) as well, and a possible way shows an analogy from physics. Look at N charged particles spread throughout some region in a two dimensional space. The physicist say, they span up a continuous electrical field everywhere in the space. This field can likewise be described by force vectors F(x,y) everywhere in the plane, or a potential field P(x,y) everywhere (the potential describes the work needed to move a charged particle from infinity to a certain point). Force vectors on a plane are two-dimensional as well, but the potential is just a scalar. This makes it somewhat easier to handle and we concentrate on that one.

The formula for a potential in an electrical field given by charges at positions \( C(i) = (c_x,c_y)_i \), for any point r(x,y) reads:

\[ PE(r) = const. \cdot \sum_{i} \frac{1}{ \sqrt{(c_{x,i} - x)^2 + (c_{y,i} - y)^2} } \]

For our cluster analysis purpose, we alter this formula and instead take

\[ P(r) = \sum_{i} \frac{1}{\beta + \| c_{i} - (x,y) \|^\alpha } \: , \: \beta > 0 \]

where the Euclidian norm \( \| (x,y) || = \sqrt{x^2+y^2} \) is just one possibility. Others do as well and might serve better depending on circumstances, but for the rest of this article we will be using that one. We introduced β here to first avoid problems if the square root calculates to zero, second control smoothing (explanation below). The parameter α controls the influence of each cluster. Thus the formula we will be using reads:

\[ P(r) = \sum_{i} \frac{1}{\beta + \left( \sqrt{ \left(c_{x,i} - x\right)^2 + \left(c_{y,i} - y\right)^2 } \right) ^\alpha } \: , \: \beta > 0 \]

For the data set above the field looks like:

Here more intense red parts correspond to higher potential values. For this image all coordinates have been scaled and shifted to lie inside [0;0]-[1;1], and β = 0.02

As expected, high potential values correspond to clusters. This can be seen, but still not yet calculated.

The Algorithm Explained

The suggested algorithm to actually determine the clusters now goes as follows:

Give each data point C(i) an unique ID (the index i can be used for that). Rationale: Because we will be gathering points to clusters, we need the points' IDs, so we can navigte from clusters to associated points.
For each data point with a potential above a certain threshold, generate a cluster point CL(i). Each cluster point has a 1:n association to data points (save their IDs inside a list) - at the beginning, there is only one associated data point. For each cluster point, calculate and save its potential P according to the formula above. Rationale: The threshold must be applied for the definition of the initial cluster points to both avoid having too many clusters at the beginning, and rule out points which are far away from a potential peak and thus should not be considered to belong to a cluster. Since the algorithm works with the potential both at data points and cluster points, we calculate and save the cluster's potential as well
Loop over all cluster points. Move each cluster point a distance D(x,y). Calculate D(x,y) by either of:
1. Move a step of length L at a random direction.
2. Move a step of length L along the gradient of P(x,y)
Only if CL(i) + D(x,y) leads us to a point with a higher potential, actually do the move. Otherwise, do not move and increment a "not moved" counter inside the cluster point. If moved, update the cluster point's potential. Rationale: Moving the clusters to nearby potential peaks is one of the central points of the algorithm. Because we will have small clusters in the viccinity of a potential peak (or center of a bigger cluster) moving towards the same point (the peak), we can gather the smaller clusters for they will be approaching each other. Note: the data points do not move, only the cluster points! For determining where to move, we choose random directions and see whether we will be approaching a potential peak, or move along the potential gradient (which will take more calculations). If we can't move, because we are at or very close to a potential peak, we are done for that cluster point's movement (an exception is if cluster points combine, see below). Since random movements by definition fail, we need to count non-moves and assume after a certain number of invalid tries that movement is not possible.
Loop over all pairs of cluster points. If cluster points inside a pair approach each other and get closer than some constant R, combine them into one. To combine two cluster points CL₁ and Cl₂ into a new one CL_NEW, for the association to data points just make a union of the associated data IDs of CL₁ and CL₂. For its position, first take the weights w₁ and w₂, which is just the number of associated data points for each cluster point. The new position then reads: CL_NEW = ( w₁ ∙ CL₁ + w₂ ∙ CL₂ ) / (w₁ + w₂) Rationale: Another central point of the algorithm. If cluster points are close to other cluster points, we combine both to one new cluster point. This effectively reduces the number of clusters while the algorithm works.
If neither any cluster point got moved for ATTEMPT attempts, nor any cluster combination took place, the algorithm is to be finished. Rationale: If we cannot move nor combine further, cluster points reached potential peaks and define the analysis outcome.

We end up with a list of cluster points, their positions and the associated data points. To discard clusters which are too small, you can apply a filter ruling out cluster points which have weights (number of associated data points) below a certain minimum weight.

The Algorithm Applied

applying the algorithm just depicted to the dataset shown above, and looking for the four biggest clusters, we will arrive at:

The algorithm is obviously able to identify the four clusters which are visible by just looking at the data set with the human eye. The yellow cluster however seems to gather too many points from the top area of the data set. There is a remedy for that - by changing α and/or β we can adjust the algorithm to give better results. More precisely, with changing β from 0.02 to 0.01 and α from 1.0 to 0.8 we instead get:

Using the very same parameter set for totally different data gives results like:

Data Point Categorization

In order to find out whether for a finished cluster analysis at hand, newly arriving data points C_NEW do or do not belong to one of the existing clusters, we have two options: either repeat the whole process from scratch with the new data point added, which is of course a time consuming process. Or, use the following simplified algorithm:

Calculate the potential P(C_NEW). If it is below the threshold, discard it as a cluster candidate and we are done.
Let the point move as described for the main algorithm above, but do not gather. If it hits one of the clusters selected from the main algorithm (those with enough weight to be considered as clusters), it is said to belong to that cluster and we are done. Otherwise the point is considered to not belong to a cluster.

Possible Modifications

With many points, the part of the algorithm which calculates the potential will be getting slow - it has to scan over all data points each time the potential is needed. As a performance optimization, the points can be pushed into an R-Tree and the calculation for the potential can be changed to just sum over a certain neighborhood.

Another performance improvement for the data point categorization above will be changing the potential calculation to use the cluster points (including weights) instead of the data points. This will potentially give more errors, but it will speed up the calculation considerably.