Cluster analyses are often conducted with the goal to characterize an
underlying probability density, for which the data-point density serves as
an estimate for this probability density. We here test and benchmark the
common nearest neighbor (CNN) cluster algorithm. This algorithm assigns a
spherical neighborhood R to each data point and estimates the data-point
density between two data points as the number of data points N in the
overlapping region of their neighborhoods (step 1). The main principle in
the CNN cluster algorithm is cluster growing. This grows the clusters by
sequentially adding data points and thereby effectively positions the
border of the clusters along an iso-surface of the underlying probability
density. This yields a strict partitioning with outliers, for which the
cluster represents peaks in the underlying probability density—termed core
sets (step 2). The removal of the outliers on the basis of a threshold
criterion is optional (step 3). The benchmark datasets address a series of
typical challenges, including datasets with a very high dimensional state
space and datasets in which the cluster centroids are aligned along an
underlying structure (Birch sets). The performance of the CNN algorithm is
evaluated with respect to these challenges. The results indicate that the
CNN cluster algorithm can be useful in a wide range of settings. Cluster
algorithms are particularly important for the analysis of molecular
dynamics (MD) simulations. We demonstrate how the CNN cluster results can
be used as a discretization of the molecular state space for the
construction of a core-set model of the MD improving the accuracy compared
to conventional full-partitioning models. The software for the CNN
clustering is available on GitHub.