With big data approaches more and more gaining influence in various fields of science and economics, clustering algorithms which try to answer the question which data points of a data set belong together or show some kind of similarity have become crucial. With several already developed algorithms, like hierarchical clustering, means related clustering, distribution related clustering, density related clustering with developing paths through neighboring nodes, neural network means, amongst others, the author suggests a new methodology for clustering, based on a field spanned by the data points, and relations determined by a simulated movement along that field.

We have N points P(i) with i=0,...,N-1 in an M dimensional space. The following two questions are important for us:

- Do we have clusters, that is regions where more points gather than elsewhere?
- Does any point P(i) belong to a cluster, and if yes, to which one?

While the dimension M can be any number, for the rest of this paper we assume that M equals 2. However, the algorithm developed here can in principle be applied to higher dimensions as well.

While for a data set like shown in

the human eye can immediately see that there are maybe four clusters in the lower right edge of the coordiates system. It just sees there are so and so many regions where more points gather than elsewhere. The computer has a harder job to find out. Why is that? Well, it can't tell because it does not see the whole picture the same way the human brain does. The brain has a way to notice cloud densities at once. While we do not further dive into an explenation for that, we still want to describe a new method which goes into a similar direction. We want to have a density field D(x,y) as well, and a possible way shows an analogy from physics. Look at N charged particles spread throughout some region in a two dimensional space. The physicist say, they span up a continuous electrical field *everywhere* in the space. This field can likewise be described by force vectors F(x,y) everywhere in the plane, or a potential field P(x,y) everywhere (the potential describes the work needed to move a charged particle from infinity to a certain point). Force vectors on a plane are two-dimensional as well, but the potential is just a scalar. This makes it somewhat easier to handle and we concentrate on that one.

The formula for a potential in an electrical field given by charges at positions \( C(i) = (c_x,c_y)_i \), for any point r(x,y) reads:

\[
PE(r) = const. \cdot \sum_{i} \frac{1}{ \sqrt{(c_{x,i} - x)^2 + (c_{y,i} - y)^2} }
\]

For our cluster analysis purpose, we alter this formula and instead take

\[
P(r) = \sum_{i} \frac{1}{\beta + \| c_{i} - (x,y) \|^\alpha } \: , \: \beta > 0
\]

where the Euclidian norm \( \| (x,y) || = \sqrt{x^2+y^2} \) is just one possibility. Others do as well and might serve better depending on circumstances, but for the rest of this article we will be using that one. We introduced β here to first avoid problems if the square root calculates to zero, second control smoothing (explanation below). The parameter α controls the influence of each cluster. Thus the formula we will be using reads:
\[
P(r) = \sum_{i} \frac{1}{\beta + \left( \sqrt{ \left(c_{x,i} - x\right)^2 + \left(c_{y,i} - y\right)^2 } \right) ^\alpha } \: , \: \beta > 0
\]

For the data set above the field looks like:

Here more intense red parts correspond to higher potential values. For this image all coordinates have been scaled and shifted to lie inside [0;0]-[1;1], and β = 0.02

As expected, high potential values correspond to clusters. This can be *seen*, but still not yet calculated.

The suggested algorithm to actually determine the clusters now goes as follows:

- Give each data point C(i) an unique ID (the index i can be used for that). Rationale: Because we will be gathering points to clusters, we need the points' IDs, so we can navigte from clusters to associated points.
- For each data point with a potential above a certain threshold, generate a cluster point CL(i). Each cluster point has a 1:n association to data points (save their IDs inside a list) - at the beginning, there is only one associated data point. For each cluster point, calculate and save its potential P according to the formula above. Rationale: The threshold must be applied for the definition of the initial cluster points to both avoid having too many clusters at the beginning, and rule out points which are far away from a potential peak and thus should not be considered to belong to a cluster. Since the algorithm works with the potential both at data points and cluster points, we calculate and save the cluster's potential as well
- Loop over all cluster points. Move each cluster point a distance D(x,y). Calculate D(x,y) by either of:
- Move a step of length L at a random direction.
- Move a step of length L along the gradient of P(x,y)

- Loop over all pairs of cluster points. If cluster points inside a pair approach each other and get closer than some constant R,
combine them into one. To combine two cluster points CL
_{1}and Cl_{2}into a new one CL_{NEW}, for the association to data points just make a union of the associated data IDs of CL_{1}and CL_{2}. For its position, first take the weights w_{1}and w_{2}, which is just the number of associated data points for each cluster point. The new position then reads: CL_{NEW}= ( w_{1}∙ CL_{1}+ w_{2}∙ CL_{2}) / (w_{1}+ w_{2}) Rationale: Another central point of the algorithm. If cluster points are close to other cluster points, we combine both to one new cluster point. This effectively reduces the number of clusters while the algorithm works. - If neither any cluster point got moved for ATTEMPT attempts, nor any cluster combination took place, the algorithm is to be finished. Rationale: If we cannot move nor combine further, cluster points reached potential peaks and define the analysis outcome.

We end up with a list of cluster points, their positions and the associated data points. To discard clusters which are too small, you can apply a filter ruling out cluster points which have weights (number of associated data points) below a certain minimum weight.

applying the algorithm just depicted to the dataset shown above, and looking for the four biggest clusters, we will arrive at:

The algorithm is obviously able to identify the four clusters which are visible by just looking at the data set with the human eye. The yellow cluster however seems to gather too many points from the top area of the data set. There is a remedy for that - by changing α and/or β we can adjust the algorithm to give better results. More precisely, with changing β from 0.02 to 0.01 and α from 1.0 to 0.8 we instead get:

Using the very same parameter set for totally different data gives results like:

In order to find out whether for a finished cluster analysis at hand, newly arriving data points C_{NEW} do or do not belong to one of the existing clusters, we have two options: either repeat the whole process from scratch with the new data point added, which is of course a time consuming process. Or, use the following simplified algorithm:

- Calculate the potential P(C
_{NEW}). If it is below the threshold, discard it as a cluster candidate and we are done. - Let the point move as described for the main algorithm above, but do not gather. If it hits one of the clusters selected from the main algorithm (those with enough weight to be considered as clusters), it is said to belong to that cluster and we are done. Otherwise the point is considered to not belong to a cluster.

With many points, the part of the algorithm which calculates the potential will be getting slow - it has to scan over all data points each time the potential is needed. As a performance optimization, the points can be pushed into an R-Tree and the calculation for the potential can be changed to just sum over a certain neighborhood.

Another performance improvement for the data point categorization above will be changing the potential calculation to use the cluster points (including weights) instead of the data points. This will potentially give more errors, but it will speed up the calculation considerably.