Elsevier

Applied Soft Computing

Volume 85, December 2019, 105867
Applied Soft Computing

Density-based clustering using approximate natural neighbours

https://doi.org/10.1016/j.asoc.2019.105867Get rights and content

Highlights

  • We propose a computationally efficient natural neighbour based metric.

  • The metric is based on discrete Choquet integral and a special fuzzy measure.

  • The time complexity of the metric calculation is the same as using Euclidean distance.

  • The proposed metric improves three density-based clustering algorithms.

Abstract

We propose a computationally efficient natural neighbour based metric for discovering clusters of arbitrary shape based on fuzzy measures. The approximate natural neighbours are found with the help of the Choquet integral with respect to a specially designed two-additive fuzzy measure. Fuzzy betweenness relation is used to construct such a measure and helps determine the natural neighbours of a query point. The natural neighbours of a datum allow the computation of point density estimate, which in turn defines a density-based metric suitable for clustering. The proposed method overcomes the exponential computational complexity of the Delaunay triangulation traditionally used to identify the natural neighbours. The run-time of the density estimate by this metric keeps the same quadratic trends as the Euclidean distance based estimates with respect to the data size. Empirical evaluation based on 20 synthetic and real-world datasets shows that this metric has a higher clustering accuracy for existing state-of-the-art density-based clustering algorithms, such as DBSCAN, SNN and DP. Furthermore, the proposed metric is easily combined with these algorithms, and the enhanced clustering algorithms inherit positive features such as resistance to noise and imbalanced data.

Introduction

Clustering has been widely used for partitioning data such that similar data instances join the same groups called clusters [1]. It is the most important unsupervised learning technique for automatic data-labelling in various areas, such as information retrieval, image segmentation, and pattern recognition [2]. Based on specific assumptions and models, there are different kinds of clustering algorithms including partitioning clustering, density-based clustering, hierarchical clustering, and graph clustering [3].

Partitioning clustering methods are the simplest and most fundamental clustering methods. They are relatively fast, and easy to understand and implement. They organise the data points into c (the number of clusters) non-overlapping (possibly fuzzy) partitions where each partition represents a cluster and each point only belongs to one cluster [1]. However, traditional partitioning methods usually cannot find clusters with arbitrary shapes [4].

Clustering results usually depend on the measure of similarity (dissimilarity) used in the algorithm. The common dissimilarity measure is expressed as a distance function, which is known as, or referred to as a metric, even though not all the axioms of a metric are always satisfied. Traditional metrics on d-dimensional Euclidean space Rd, such as the Euclidean distance, lp-norms and Mahalanobis distance are commonly used in partitioning algorithms [5], [6] to help identifying well-separated and convex clusters.

Many methods have been proposed to discover clusters of non-convex shapes, such as C-shells, c-variety and c-mixed prototypes methods [5]. Given different geometrical prototypes, they use special distance functions to calculate distances to objects, such as planes, rectangles and ellipses. However, the information about such prototypes, which must be specified a priori, is mostly unavailable. In addition, there are some graph-based clustering algorithms [7], [8], which are able to detect non-spherical clusters. However, they still need the number of clusters as one of the inputs for clustering.

In contrast, density-based clustering algorithms, such as DBSCAN [9] and DENCLUE [10], can find clusters with arbitrary sizes and shapes while effectively filtering out noise. Density-based clustering defines clusters as high-density regions that are separated by low-density regions [9], [10], [11]. As a result, density-based clustering has attracted considerable research interest recently [12].

Many density estimation methods have been proposed using Voronoi diagrams and Delaunay triangulations, e.g., [13], [14], [15], [16]. Voronoi diagrams and their duals, Delaunay triangulations, [17], [18] are classical constructions in mathematics and computational geometry. The main issue with Delaunay tessellation is its complexity: the number of Delaunay cells grows exponentially with the dimension d of the space, a manifestation of the course of dimensionality. Therefore Delaunay tessellation methods are not practical for d>8.

In this paper, we address the issue of the computational complexity of the natural neighbours scheme and present a method of computing the natural neighbours without an expensive Delaunay tessellation. We propose a new similarity measure using fuzzy betweenness relation based on the nearest neighbours graph. In this method, we take a sufficiently large value of k in the kNN density estimate, but ensure that only the neighbours located all around a query point are counted. That is, we combine the kNN method with the natural neighbours approach, but without performing computationally expensive Delaunay tessellation. Instead, we apply a soft computing approach based on fuzzy measures. The point estimate of the density is computed using the notion of the discrete Choquet integral with respect to a specially constructed fuzzy measure. This takes into account spatial correlations between the neighbours of a datum.

Thus, the main objectives and contributions of this paper are: (a) to design a new similarity measure that accounts for redundancy of data located in the same direction from a given point, (b) to employ Choquet integration in order to approximate the natural neighbours of a point, (c) to design natural neighbours based clustering algorithm without an expensive Delaunay tessellation, and (d) to validate and benchmark the proposed method against state-of-the-art alternative density-based clustering methods using 20 synthetic and real-world datasets.

The rest of the paper is organised as follows. Section 2 provides the preliminaries and Section 3 describes the problem of density estimators and density-based clustering. Section 4 presents the approximation to the natural neighbour graphs and proposes the fuzzy betweenness relation for density estimation. An overview of density-based clustering and its related work are provided in Section 5. Section 6 presents the empirical evaluation results of the proposed fuzzy betweenness relation using DBSCAN, SNN and DP on synthetic and real-world datasets. Conclusions are provided in the last section.

Section snippets

Preliminaries

This section presents the background of the problem of density estimation and density-based clustering .

Let D={xij}={(x1j,,xdj)},i=1,,d,j=1,,n, xijRd denotes a d-dimensional dataset of n points each is uniformly sampled from a distribution with probability density ρ:Rd[0,1], where i indicates the element of the vector of an instance and j indicates an instance. The goal of density estimation is to recover an approximation to ρ, denoted ρˆ, i.e., find a density estimate approximating the

Density estimators

There are various methods for density estimation [19]. Kernel based estimates by Parzen and Rosenblatt [19], [20] are used in DBSCAN [9] and DENCLUE [10] clustering methods. A point density estimate is constructed by averaging the values of a kernel function of the distances between a fixed point and the data. One problem with kernel density estimates is the bandwidth selection, which includes the smoothing parameter in this process. The values of the bandwidth parameter which are too small

Density-based metric

We propose a method of density estimation based on an approximation to the natural neighbours graph, aiming to reduce the computational complexity of the Delaunay tessellation and to avoid the connectivity and oversmoothing issues of the kNN method. In our method we take the kNN estimate as the basis, with a sufficiently large value of k. To ensure that only the neighbours located all around a query point are counted we use a re-weighting scheme based on the notion of the discrete Choquet

Density-based clustering

The classic density-based clustering algorithms, e.g., DBSCAN [11] and DENCLUE [37], link neighbouring high-density points together to identify arbitrarily-shaped clusters. DBSCAN estimates the density of a point as the number of data points from the dataset that locate in its ϵ-neighbourhood, defined as ρˆϵ(x)=|Nϵ(x)|nVol(Nϵ(x))where Nϵ(x)={yDdis(x,y)ϵ} is the ϵ-neighbourhood of x, and DBSCAN use Euclidean distance for the dissimilarity function.

Then the density-based cluster is defined by

Empirical evaluation

Here, we present experiments designed to evaluate the effectiveness of the proposed natural neighbour based metric. We compare it with the Euclidean distance using three density-based clustering algorithms (DBSCAN [11], SNN [27] and DP [12]) in terms of best F-measure through systematic parameter search: given a clustering result, for each cluster Ci, we calculate the precision Pi and the recall Ri based on the confusion matrix, i.e., we use the Hungarian algorithm (()) to search the optimal

Conclusion

In this paper, we proposed a new point estimate of the density of the data computed by a discrete Choquet integral with respect to specially constructed fuzzy measure. The Choquet integral redistributes the contribution of the nearest neighbours according to their mutual spatial positions and thus accounts for input redundancies. We present a construction of the fuzzy measure based on the fuzzy betweenness relationship between the data. It allows accounting for spatial correlations between the

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105867.

References (46)

  • AggarwalC.C. et al.

    Data Clustering: Algorithms and Applications

    (2013)
  • HöppnerF. et al.

    Fuzzy Cluster Analysis: Methods for Classification, Data Analysis, and Image Recognition

    (1999)
  • SametH.

    Foundations of Multidimensional and Metric Data Structures

    (2006)
  • EsterM. et al.

    Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications

    Data Min. Knowl. Discov.

    (1998)
  • A. Hinneburg, D. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Knowledge...
  • EsterM. et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

  • RodriguezA. et al.

    Clustering by fast search and find of density peaks

    Science

    (2014)
  • PelupessyF.I. et al.

    Density estimators in particle hydrodynamics: DTFE versus regular SPH

    Astron. Astrophys.

    (2003)
  • MillerE.G.

    A new class of entropy estimators for multi-dimensional densities

  • MeloS.N.d. et al.

    Voronoi diagrams and spatial analysis of crime

    Prof. Geogr.

    (2017)
  • AurenhammerF.

    Voronoi diagrams - a survey of a fundamental data structure

    ACM Comput. Surv.

    (1991)
  • OkabeA. et al.

    Spatial Tessellations: Concepts and Applications of Voronoi Diagrams

    (2000)
  • ScottD.

    Multivariate Density Estimation

    (2015)
  • Cited by (6)

    • Adaptive core fusion-based density peak clustering for complex data with arbitrary shapes and densities

      2020, Pattern Recognition
      Citation Excerpt :

      To solve with this problem, Du et al. [38] utilized K-Nearest Neighbors (KNN) for improving the density estimation to create more accurate clustering results. Maia et al. [39] used the Choquet integral-based approximate natural neighbors to realize improved fuzzy density estimation. Ruhui et al. [40] adopted semi-supervised constraints in CSDP to improve the clustering performance influenced by the density estimation results.

    • Density peaks clustering based on k-nearest neighbors and self-recommendation

      2021, International Journal of Machine Learning and Cybernetics
    View full text