Density-based clustering using approximate natural neighbours
Introduction
Clustering has been widely used for partitioning data such that similar data instances join the same groups called clusters [1]. It is the most important unsupervised learning technique for automatic data-labelling in various areas, such as information retrieval, image segmentation, and pattern recognition [2]. Based on specific assumptions and models, there are different kinds of clustering algorithms including partitioning clustering, density-based clustering, hierarchical clustering, and graph clustering [3].
Partitioning clustering methods are the simplest and most fundamental clustering methods. They are relatively fast, and easy to understand and implement. They organise the data points into (the number of clusters) non-overlapping (possibly fuzzy) partitions where each partition represents a cluster and each point only belongs to one cluster [1]. However, traditional partitioning methods usually cannot find clusters with arbitrary shapes [4].
Clustering results usually depend on the measure of similarity (dissimilarity) used in the algorithm. The common dissimilarity measure is expressed as a distance function, which is known as, or referred to as a metric, even though not all the axioms of a metric are always satisfied. Traditional metrics on -dimensional Euclidean space , such as the Euclidean distance, -norms and Mahalanobis distance are commonly used in partitioning algorithms [5], [6] to help identifying well-separated and convex clusters.
Many methods have been proposed to discover clusters of non-convex shapes, such as C-shells, c-variety and c-mixed prototypes methods [5]. Given different geometrical prototypes, they use special distance functions to calculate distances to objects, such as planes, rectangles and ellipses. However, the information about such prototypes, which must be specified a priori, is mostly unavailable. In addition, there are some graph-based clustering algorithms [7], [8], which are able to detect non-spherical clusters. However, they still need the number of clusters as one of the inputs for clustering.
In contrast, density-based clustering algorithms, such as DBSCAN [9] and DENCLUE [10], can find clusters with arbitrary sizes and shapes while effectively filtering out noise. Density-based clustering defines clusters as high-density regions that are separated by low-density regions [9], [10], [11]. As a result, density-based clustering has attracted considerable research interest recently [12].
Many density estimation methods have been proposed using Voronoi diagrams and Delaunay triangulations, e.g., [13], [14], [15], [16]. Voronoi diagrams and their duals, Delaunay triangulations, [17], [18] are classical constructions in mathematics and computational geometry. The main issue with Delaunay tessellation is its complexity: the number of Delaunay cells grows exponentially with the dimension of the space, a manifestation of the course of dimensionality. Therefore Delaunay tessellation methods are not practical for .
In this paper, we address the issue of the computational complexity of the natural neighbours scheme and present a method of computing the natural neighbours without an expensive Delaunay tessellation. We propose a new similarity measure using fuzzy betweenness relation based on the nearest neighbours graph. In this method, we take a sufficiently large value of in the kNN density estimate, but ensure that only the neighbours located all around a query point are counted. That is, we combine the kNN method with the natural neighbours approach, but without performing computationally expensive Delaunay tessellation. Instead, we apply a soft computing approach based on fuzzy measures. The point estimate of the density is computed using the notion of the discrete Choquet integral with respect to a specially constructed fuzzy measure. This takes into account spatial correlations between the neighbours of a datum.
Thus, the main objectives and contributions of this paper are: (a) to design a new similarity measure that accounts for redundancy of data located in the same direction from a given point, (b) to employ Choquet integration in order to approximate the natural neighbours of a point, (c) to design natural neighbours based clustering algorithm without an expensive Delaunay tessellation, and (d) to validate and benchmark the proposed method against state-of-the-art alternative density-based clustering methods using 20 synthetic and real-world datasets.
The rest of the paper is organised as follows. Section 2 provides the preliminaries and Section 3 describes the problem of density estimators and density-based clustering. Section 4 presents the approximation to the natural neighbour graphs and proposes the fuzzy betweenness relation for density estimation. An overview of density-based clustering and its related work are provided in Section 5. Section 6 presents the empirical evaluation results of the proposed fuzzy betweenness relation using DBSCAN, SNN and DP on synthetic and real-world datasets. Conclusions are provided in the last section.
Section snippets
Preliminaries
This section presents the background of the problem of density estimation and density-based clustering .
Let , denotes a -dimensional dataset of points each is uniformly sampled from a distribution with probability density , where indicates the element of the vector of an instance and indicates an instance. The goal of density estimation is to recover an approximation to , denoted , i.e., find a density estimate approximating the
Density estimators
There are various methods for density estimation [19]. Kernel based estimates by Parzen and Rosenblatt [19], [20] are used in DBSCAN [9] and DENCLUE [10] clustering methods. A point density estimate is constructed by averaging the values of a kernel function of the distances between a fixed point and the data. One problem with kernel density estimates is the bandwidth selection, which includes the smoothing parameter in this process. The values of the bandwidth parameter which are too small
Density-based metric
We propose a method of density estimation based on an approximation to the natural neighbours graph, aiming to reduce the computational complexity of the Delaunay tessellation and to avoid the connectivity and oversmoothing issues of the kNN method. In our method we take the kNN estimate as the basis, with a sufficiently large value of . To ensure that only the neighbours located all around a query point are counted we use a re-weighting scheme based on the notion of the discrete Choquet
Density-based clustering
The classic density-based clustering algorithms, e.g., DBSCAN [11] and DENCLUE [37], link neighbouring high-density points together to identify arbitrarily-shaped clusters. DBSCAN estimates the density of a point as the number of data points from the dataset that locate in its -neighbourhood, defined as where is the -neighbourhood of , and DBSCAN use Euclidean distance for the dissimilarity function.
Then the density-based cluster is defined by
Empirical evaluation
Here, we present experiments designed to evaluate the effectiveness of the proposed natural neighbour based metric. We compare it with the Euclidean distance using three density-based clustering algorithms (DBSCAN [11], SNN [27] and DP [12]) in terms of best F-measure through systematic parameter search: given a clustering result, for each cluster , we calculate the precision and the recall based on the confusion matrix, i.e., we use the Hungarian algorithm (()) to search the optimal
Conclusion
In this paper, we proposed a new point estimate of the density of the data computed by a discrete Choquet integral with respect to specially constructed fuzzy measure. The Choquet integral redistributes the contribution of the nearest neighbours according to their mutual spatial positions and thus accounts for input redundancies. We present a construction of the fuzzy measure based on the fuzzy betweenness relationship between the data. It allows accounting for spatial correlations between the
Declaration of Competing Interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105867.
References (46)
- et al.
Spectral clustering and semi-supervised learning using evolving similarity graphs
Appl. Soft Comput.
(2015) - et al.
Graph clustering using k-neighbourhood attribute structural similarity
Appl. Soft Comput.
(2016) - et al.
Delaunay triangulation-based pit density estimation for the classification of polyps in high-magnification chromo-colonoscopy
Comput. Methods Programs Biomed.
(2012) k-order additive discrete fuzzy measures and their representation
Fuzzy Sets and Systems
(1997)- et al.
Nonadditivity index and capacity identification method in the context of multicriteria decision making
Inform. Sci.
(2018) - et al.
Nonmodularity index for capacity identifying with multiple criteria preference information
Inform. Sci.
(2019) - et al.
Density-ratio based clustering for discovering clusters with varying densities
Pattern Recognit.
(2016) - et al.
Data Mining: Concepts and Techniques
(2011) - et al.
Finding Groups in Data: An Introduction to Cluster Analysis
(1990) - et al.
Data Mining and Analysis: Fundamental Concepts and Algorithms
(2014)
Data Clustering: Algorithms and Applications
Fuzzy Cluster Analysis: Methods for Classification, Data Analysis, and Image Recognition
Foundations of Multidimensional and Metric Data Structures
Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications
Data Min. Knowl. Discov.
A density-based algorithm for discovering clusters in large spatial databases with noise
Clustering by fast search and find of density peaks
Science
Density estimators in particle hydrodynamics: DTFE versus regular SPH
Astron. Astrophys.
A new class of entropy estimators for multi-dimensional densities
Voronoi diagrams and spatial analysis of crime
Prof. Geogr.
Voronoi diagrams - a survey of a fundamental data structure
ACM Comput. Surv.
Spatial Tessellations: Concepts and Applications of Voronoi Diagrams
Multivariate Density Estimation
Cited by (6)
Adaptive core fusion-based density peak clustering for complex data with arbitrary shapes and densities
2020, Pattern RecognitionCitation Excerpt :To solve with this problem, Du et al. [38] utilized K-Nearest Neighbors (KNN) for improving the density estimation to create more accurate clustering results. Maia et al. [39] used the Choquet integral-based approximate natural neighbors to realize improved fuzzy density estimation. Ruhui et al. [40] adopted semi-supervised constraints in CSDP to improve the clustering performance influenced by the density estimation results.
Density peaks clustering based on k-nearest neighbors and self-recommendation
2021, International Journal of Machine Learning and CyberneticsDensity estimates on the unit simplex and calculation of the mode of a sample
2020, International Journal of Intelligent SystemsConstraint Rules and Matching Micro-clusters Based Affinity Propagation Clustering Algorithm
2020, Studies in Informatics and Control