Grouping points by shared subspaces for effective subspace clustering

doi:10.1016/j.patcog.2018.05.027

Pattern Recognition

Volume 83, November 2018, Pages 230-244

https://doi.org/10.1016/j.patcog.2018.05.027 Get rights and content

Highlights

•
Compare properties of different subspace clustering algorithms.
•
Introduce a new subspace clustering framework CSSub based on shared subspaces.
•
CSSub decouples the candidate subspace selection process from the clustering process.
•
CSSub enables different types of cluster definitions to be employed easily.
•
Provide different subspace generation and scoring methods.

Abstract

Clusters may exist in different subspaces of a multidimensional dataset. Traditional full-space clustering algorithms have difficulty in identifying these clusters. Various subspace clustering algorithms have used different subspace search strategies. They require clustering to assess whether cluster(s) exist in a subspace. In addition, all of them perform clustering by measuring similarity between points in the given feature space. As a result, the subspace selection and clustering processes are tightly coupled. In this paper, we propose a new subspace clustering framework named CSSub (Clustering by Shared Subspaces). It enables neighbouring core points to be clustered based on the number of subspaces they share. It explicitly splits candidate subspace selection and clustering into two separate processes, enabling different types of cluster definitions to be employed easily. Through extensive experiments on synthetic and real-world datasets, we demonstrate that CSSub discovers non-redundant subspace clusters with arbitrary shapes in noisy data; and it significantly outperforms existing state-of-the-art subspace clustering algorithms.

Introduction

Clustering detects groups of similar points while separating dissimilar points into different groups [1]. Many clustering algorithms have been studied in the research community over the past few decades. Clustering is a fundamental data analysis technique for data mining and knowledge discovery and has been applied to various fields such as engineering, medical and biological sciences, social sciences and economics.

Traditional “full-space clustering” algorithms become ineffective when clusters exist in different subspaces¹ as many irrelevant attributes may cause similarity measurements used by the algorithms to become unreliable [1]. To tackle this problem, the similarity of points should be assessed within subspaces (over a subset of relevant attributes).

Subspace clustering aims to discover clusters which exist in different subspaces [2]. It works by combining subspace search and clustering procedures. The number of possible axis-parallel subspaces is exponential in the dimensionality and the number of possible arbitrarily oriented (non axis-parallel) subspaces is infinite. To deal with the huge number of possible subspaces, subspace clustering algorithms usually rely on heuristics for subspace search, e.g., top-down search [3] and bottom-up search [4]. All these existing subspace clustering algorithms perform clustering by measuring the similarity between points in the given feature space, and the subspace selection and clustering processes are tightly coupled.

In this paper, we focus on clustering in axis-parallel subspaces. We contribute a new subspace clustering framework, named CSSub (Clustering by Shared Subspaces),² which has the following three unique features:

1.
CSSub groups points by their shared subspaces. It performs clustering by measuring the similarity between points based on the number of subspaces they share. This enables CSSub to detect non-redundant/non-overlapping subspace clusters directly by running a clustering method only once. In contrast, many existing subspace clustering algorithms need to run a chosen clustering method for each subspace; and this must be repeated for an exponentially large number of subspaces. As a consequence, it produces many redundant subspace clusters.
2.
CSSub decouples the candidate subspace selection process from the clustering process. By explicitly splitting them into independent processes, it enables candidate subspaces to be selected independent of the clustering process—eliminating the need to repeat the clustering step a large number of times. In contrast, many existing subspace clustering algorithms which have tightly-coupled processes must rely on an anti-monotonicity property to prune the search space.
3.
The decoupling approach has an added advantage that allows different types of cluster definitions to be employed easily. The time cost of the entire process is dominated by the subspace scoring function. We show that by changing the cluster definition such that subspaces can be evaluated with a linear scoring function, enables the runtime of CSSub to be reduced from quadratic time to linear time. A similar change is difficult, if not impossible, for existing algorithms because of the tightly coupled processes.

We present an extensive empirical evaluation on synthetic and real-world datasets to demonstrate the effectiveness of CSSub. The experiments show that CSSub discovers subspace clusters with arbitrary shapes in noisy data; and it significantly outperforms existing state-of-the-art subspace clustering algorithms. In addition, CSSub has only one parameter k (the number of clusters) which needs to be manually tuned, other parameters can be automatically set based on a heuristic method.

The rest of this paper is organised as follows. We provide an overview of subspace clustering algorithms and related work in Section 2. Section 3 discusses key weaknesses of existing bottom-up subspace clustering algorithms. Section 4 details the subspace clustering framework based on the shared subspaces and presents a density-based approach for subspace scoring. Section 5 presents the algorithms for CSSub. In Section 6, we empirically evaluate the performance of the proposed algorithms on different datasets. Discussion and the conclusions are provided in the last two sections.

Section snippets

Related work

The key task of clustering in subspaces is to develop appropriate subspace search heuristics [2]. There are two basic techniques for subspace search, namely top-down search [3] and bottom-up search [4]. Different subspace clustering algorithms have been proposed based on these two search directions [5], [6], [7], [8], [9]. Search strategies can be further subdivided into systematic and non-systematic search techniques as discussed below.

Key weaknesses of existing bottom-up subspace clustering algorithms

The majority of density-based subspace clustering algorithms rely on a bottom-up search strategy and use anti-monotonicity of the density as a means to reduce the search space. This approach has two weaknesses:

1.
The approach tightly couples the candidate subspace selection process with the clustering process, i.e., it must run a clustering method for each subspace; and there are an exponentially large number of subspaces. As a result, many redundant subspace clusters are produced during the

Clustering by shared subspaces

In this section, we present an effective subspace clustering framework in order to overcome the weaknesses of existing bottom-up subspace clustering algorithms.

Algorithms in CSSub

In this section, we provide the algorithms for the proposed CSSub framework shown in Fig. 3. The stages shown in Algorithm 1 correspond to the stages in Fig. 3.

The first stage in Algorithm 1 generates the set of subspaces by enumeration to the largest subspace dimensionality $d \geq 1$ such that $| S | = \sum_{i = 1}^{d} (\binom{d}{i}) = m < n$ . Here we set the number of candidate subspaces less than the data size is to keep the quadratic time complexity. Different ways to generate the initial set of subspaces may be used instead in

Empirical evaluation

This section presents experiments designed to evaluate the performance of CSSub. We selected seven state-of-the-art non-overlapping subspace clustering algorithms for comparison: a top-down subspace clustering algorithm PROCLUS [3], a bottom-up subspace clustering algorithm P3C [13] and five soft subspace clustering algorithms LAC [25], EWKM [26], FSC [27] ESSC [28] and FG-k-means [29]. In addition, we include a full-space clustering algorithms k-medoids [17] in order to judge the effectiveness

Discussion

CSSub is effective on low-to-medium dimensional datasets having subspace clusters with different irrelevant attributes. The experiment using the three artificial datasets, 2T, S1500 and D50, indicates that CSSub is particularly good at identifying subspace clusters of non-globular shape. The robustness analysis shows that it is tolerant to both noise points and noise attributes.

The CSSub framework provides a flexible platform for developing new subspace clustering algorithms which group points

Conclusions

Grouping points by shared subspaces is a unique clustering approach which has never been attempted before, as far as we know. The approach exhibits three unique features:

1.
The clustering is performed by measuring the similarity between points based on the number of subspaces they share, without examining the attribute values in the given dataset. This enables the subspace clustering to be conducted only once to produce non-redundant/non-overlapping subspace clusters. In contrast, existing

Acknowledgments

Most of this work was done when Ye Zhu was a Ph.D. student at Monash University, Australia. The anonymous reviewers have provided many helpful suggestions to improve this paper.

Ye Zhu received his Ph.D. from Monash University, Australia. He has been awarded a Mollie Holman Medal for the faculty's best doctoral thesis of the year. He is now a research fellow in complex system data analytics at Deakin University, Australia. His research interests are in the areas of clustering and anomaly detection.

References (42)

Z. Deng et al.
A survey on soft subspace clustering
Inf. Sci.
(2016)
Z. Deng et al.
Enhanced soft subspace clustering integrating within-cluster and between-cluster information
Pattern Recognit.
(2010)
J. Wang et al.
Distance metric learning for soft subspace clustering in composite kernel space
Pattern Recognit.
(2016)
N.X. Vinh et al.
Discovering outlying aspects in large datasets
Data Mining Knowl. Disc.
(2016)
I. Assent et al.
Dusc: dimensionality unbiased subspace clustering
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
(2007)
J. Han et al.
Data Mining: Concepts and Techniques
(2011)
C.C. Aggarwal et al.
Data Clustering: Algorithms and Applications
(2013)
C.C. Aggarwal et al.
Fast algorithms for projected clustering
Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99
(1999)
R. Agrawal et al.
Automatic subspace clustering of high dimensional data for data mining applications
Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98
(1998)
K. Sim et al.
A survey on enhanced subspace clustering
Data Mining Knowl. Disc.
(2013)

H.-P. Kriegel et al.

Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Trans. Knowl. Disc. Data (TKDD)

(2009)

L. Parsons et al.

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explor. Newsletter

(2004)

E. Müller et al.

Evaluating clustering in subspace projections of high dimensional data

Proc. VLDB Endow.

(2009)

A. Zimek et al.

The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives

Mach. Learn.

(2015)

S. Goil et al.

Mafia: efficient and scalable subspace clustering for very large data sets

Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

(1999)

C.-H. Cheng et al.

Entropy-based subspace clustering for mining numerical data

Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

(1999)

E. Müller et al.

Scalable density-based subspace clustering

Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11

(2011)

G. Moise et al.

Robust projected clustering

Knowl. Inf. Syst.

(2008)

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. (Stat. Methodol.)

(1977)

K. Kailing et al.

Density-connected subspace clustering for high-dimensional data

Proceedings of the 2004 SIAM International Conference on Data Mining

(2004)

H.-P. Kriegel et al.

A generic framework for efficient subspace clustering of high-dimensional data

Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM ’05

(2005)

Cited by (15)

Graph-based adaptive and discriminative subspace learning for face image clustering
2022, Expert Systems with Applications
Current graph-based subspace clustering methods have achieved some results for the clustering of face images. The core of those methods lies in graph learning. However, they still have the following problems when learning the graph. Firstly, the graph learning processes of those methods do not consider the alignment of the images. It is well known that the obtained images of the same category may not be aligned due to different devices and shooting angles. The unaligned images used for graph learning directly affect the accuracy of the resulting graph. Hence, the graphs obtained by these methods are not accurate. We know that the inaccuracy of the learned graph will directly reduce the clustering performance of those methods. Secondly, they believe that important features, redundant features, and noise play the same contribution in the process of graph construction and feature representation. Redundant features and noise are not beneficial to graph reconstruction and feature representation, and even cause the learned graph to be inaccurate. Thirdly, the intrinsic structural correlation between samples is rarely considered for graph learning, which makes it difficult for the learned graph to reflect the structural correlation, then a good clustering performance cannot be obtained. To address those problems, this paper proposes a graph-based adaptive and discriminative subspace learning method (GADSL). In GADSL, image alignment is introduced and unified with subspace learning under the same graph learning framework which helps reduce the impact of different shooting equipment. Besides, GADSL can adaptively assign large weights to the important features and small weights to the unimportant features by introducing the weighting matrix. Moreover, in order to consider the correlation between samples, the structural consistency constraint is introduced into the subspace learning process so that the intra-class difference decreases and the inter-class difference increases. The experimental results show that GADSL used for face image clustering has achieved better clustering performance than many state-of-the-art methods.
A three-way density peak clustering method based on evidence theory
2021, Knowledge-Based Systems
Citation Excerpt :
As a powerful data analysis technique, clustering plays an important role in data mining [1–3]. Its objective is to find groups or clusters of objects that are similar to one another in the same cluster but dissimilar to objects in any other clusters [4–6]. Clustering has been widely used in various areas including image analysis [7,8], community detections [9] and bioinformatics [10].
Density peaks clustering (DPC) algorithm is an efficient and simple clustering method attracting the attention of many researchers. However, its strategy of assigning each non-grouped object to the same cluster depends on its nearest neighbors having a higher local density. This may lead to the cluster label error propagation problem, i.e. if an object is wrong-labeled during the clustering process, its label will be propagated in the subsequent assignment. To overcome this defect, in this paper we propose a three-way density peak clustering method based on evidence theory, referred to as 3W-DPET. 3W-DPET forms clusters as interval sets using three-way clustering representation including three disjoint regions called positive region (POS), boundary region (BND) and negative region (NEG). 3W-DPET mainly consists of three steps: (1) finding out cluster centers and noises before clustering; (2) using a midrange distance comparison method to detect positive regions of clusters; and (3) allocating the remaining non-grouped objects, including noises, to the boundary region or the negative region of clusters. The distinguishing feature of 3W-DPET is that evidence theory is used to construct and collect the information of K-nearest neighbors in order to assign non-grouped objects to the most suitable cluster, which can effectively solve the problem of cluster label error propagation. In order to validate 3W-DPET, we test it on 18 datasets using three benchmarks (ACC, ARI and NMI), and compare it to K-means, FCM, DPC, KNN-DPC, DPCSA, SNN-DPC and CE3-kmeans methods. Experimental results suggest that 3W-DPET can effectively find clusters and its results conform with human cognition.
Improved subspace clustering algorithm using multi-objective framework and subspace optimization
2020, Expert Systems with Applications
Subspace clustering technique divides the data set into different groups or clusters where each cluster comprises of objects that share some similar properties. Again, the feature sets or the subspace features that are used to represent clusters are different for different clusters. Moreover, in subspace clustering, the grouping of similar objects and the subspace feature set representing that group are identified simultaneously. In evolutionary-based machine learning problems, two critical measures to determine the quality of the generated clusters are compactness within and separation between the clusters. However, the distance-based separation between two clusters may not be useful in the context of subspace clustering, as the clusters may belong to two different subspaces. Again, in the case of subspace clustering, the selection of relevant subspace features plays a primary role in generating good quality subspace clusters. Therefore, the proposed approach optimizes the subspace features by considering two new objective functions, feature non-redundancy (FNR) and feature per cluster (FPC) represented in the form of PSM-index. Another objective function, intra-cluster compactness (ICC-index), is modified and used to optimize the compactness among objects within the cluster. Finally, an evolutionary-based multi-objective subspace clustering technique is developed in this paper optimizing these validity indices. A new mutation operator, namely duplication and deletion along with the modified version of the exogenous genetic material uptake, are developed to explore the search space effectively. The developed algorithm is tested on sixteen synthetic data sets and seven standard real-life data sets for identifying different subspace clusters. Again, to show the effectiveness of using multiple objectives, the algorithm is also tested on three big data sets and a MNIST data set. Also, an application of the proposed method is shown in bi-clustering the gene expression data. The results obtained by the proposed algorithm are compared against some state-of-the-art methods. Experimentation reveals that the proposed algorithm can take advantage of its evolvable genomic structure and the newly defined objective functions on the multi-objective based framework.
Salience-aware adaptive resonance theory for large-scale sparse data clustering
2019, Neural Networks
Sparse data is known to pose challenges to cluster analysis, as the similarity between data tends to be ill-posed in the high-dimensional Hilbert space. Solutions in the literature typically extend either k-means or spectral clustering with additional steps on representation learning and/or feature weighting. However, adding these usually introduces new parameters and increases computational cost, thus inevitably lowering the robustness of these algorithms when handling massive ill-represented data. To alleviate these issues, this paper presents a class of self-organizing neural networks, called the salience-aware adaptive resonance theory (SA-ART) model. SA-ART extends Fuzzy ART with measures for cluster-wise salient feature modeling. Specifically, two strategies, i.e. cluster space matching and salience feature weighting, are incorporated to alleviate the side-effect of noisy features incurred by high dimensionality. Additionally, cluster weights are bounded by the statistical means and minimums of the samples therein, making the learning rate also self-adaptable. Notably, SA-ART allows clusters to have their own sets of self-adaptable parameters. It has the same time complexity of Fuzzy ART and does not introduce additional hyperparameters that profile cluster properties. Comparative experiments have been conducted on the ImageNet and BlogCatalog datasets, which are large-scale and include sparsely-represented data. The results show that, SA-ART achieves 51.8% and 18.2% improvement over Fuzzy ART, respectively. While both have a similar time cost, SA-ART converges faster and can reach a better local minimum. In addition, SA-ART consistently outperforms six other state-of-the-art algorithms in terms of precision and F1 score. More importantly, it is much faster and exhibits stronger robustness to large and complex data.
A three-way clustering method based on an improved DBSCAN algorithm
2019, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
The objective of clustering is to group similar objects into the same cluster while dissimilar objects into different clusters [10–12]. It is an unsupervised learning method without any prior information to find potential similar patterns from a dataset [13–16]. Many existing clustering methods assume that each object must be assigned to exactly one cluster, which results in the type of hard clustering, namely an object only belongs to one cluster.
Clustering is a fundamental research field and plays an important role in data analysis. To better address the relationship between an element and a cluster, a Three-Way clustering method based on an Improved DBSCAN (3W-DBSCAN) algorithm is proposed in this paper. 3W-DBSCAN represents a cluster by a pair of nested sets called lower bound and upper bound respectively. The two bounds classify objects into three status: belong-to, not belong-to and ambiguity. Objects in lower bound certainly belong to the cluster. Objects in upper bound while not in the lower bound are ambiguous because they are in a boundary region and might belong to one or more clusters. Objects beyond the upper bound certainly do not belong to the same cluster. This clustering representation can well explain the clustering result and consist with human cognitive thinking. By improving similarity calculation, improved DBSCAN is presented to obtain initial clustering results, then three-way decision strategies are used to acquire the positive and boundary regions of a cluster. Three benchmarks Accuracy ( $A c c$ ), F-measure ( $F_{1}$ ), $N M I$ and ten datasets including three synthetic datasets, three UCI datasets and four shape datasets are used in experiments to evaluate the effectiveness of 3W-DBSCAN. Experimental results suggest that 3W-DBSCAN has a good performance and is effective in clustering.
Transfer learning-assisted multi-objective evolutionary clustering framework with decomposition for high-dimensional data
2019, Information Sciences
Citation Excerpt :
These applications also help to understand the performance of the knowledge transfer within the same problem domain. To illustrate the competitiveness of the proposed framework, we compare TrMOEC/D with six representative subspace clustering algorithms, i.e., FSC [17], EWKM [23], ESSC [12], CSSub [50], CKS-EWFC [39] and MOEASSC [43]. Among them, FSC and EWKM are two representative FCM/k-means-based SSC methods; ESSC is an extension of EWKM, which further considers the between-class separation; CSSub and CKS-EWFC are two state-of-the-art HSC and SSC approaches, respectively; MOEASSC is a multi-objective SSC algorithm optimizing the within-class compactness and between-class separation simultaneously.
Although multi-objective evolutionary subspace clustering approaches have shown promise in handling high-dimensional datasets, their performance is restricted by two main drawbacks. First, their local search strategies have not been well investigated. Second, while exploring the search space, they neglect the useful knowledge from previously solved problems. To tackle these issues, this paper proposes a transfer learning-assisted multi-objective evolutionary clustering framework with decomposition. Firstly, we provide a decomposition-based local search strategy. To capture a comprehensive data structure, this strategy updates the weights of features by considering both the within-class compactness and between-class separation, and spontaneously balances the two properties. Secondly, we develop a knowledge transfer strategy. By transferring search experience from a previously solved clustering problem, the strategy improves the search efficiency, consequently enhances the clustering accuracy of the current problem. It has a closed-form solution and can transfer knowledge across both homogeneous and heterogeneous problems from either different or the same domains. Finally, we conduct an extensive experimental study on the framework by comparing with six representative subspace clustering approaches on a wide range of benchmarks and real-world applications. Results demonstrate the superiority of our framework.

View all citing articles on Scopus

Kai Ming Ting received his Ph.D. from the University of Sydney, Australia. He is a Professor in the Faculty of Science and Technology at Federation University, Australia.His current research interests are in the areas of mass estimation and mass-based approaches, ensemble approaches and data stream data mining.

Mark Carman received his research doctorate from the University of Trento in Italy before doing a postdoc at the University of Lugano, Switzerland. He is now a Senior Lecturer in the Faculty of Information Technology at Monash University, Australia. His research interests include information retrieval and web mining.

View full text

Grouping points by shared subspaces for effective subspace clustering

Highlights

Abstract

Introduction

Section snippets

Related work

Key weaknesses of existing bottom-up subspace clustering algorithms

Clustering by shared subspaces

Algorithms in CSSub

Empirical evaluation

Discussion

Conclusions

Acknowledgments

Inf. Sci.

Pattern Recognit.

Pattern Recognit.

Data Mining Knowl. Disc.

Data Mining: Concepts and Techniques

Data Clustering: Algorithms and Applications

Fast algorithms for projected clustering

Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99

Automatic subspace clustering of high dimensional data for data mining applications

Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98

A survey on enhanced subspace clustering

Data Mining Knowl. Disc.

Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Trans. Knowl. Disc. Data (TKDD)

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explor. Newsletter

Evaluating clustering in subspace projections of high dimensional data

Proc. VLDB Endow.

The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives

Mach. Learn.

Mafia: efficient and scalable subspace clustering for very large data sets

Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Entropy-based subspace clustering for mining numerical data

Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Scalable density-based subspace clustering

Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11

Robust projected clustering

Knowl. Inf. Syst.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. (Stat. Methodol.)

Density-connected subspace clustering for high-dimensional data

Proceedings of the 2004 SIAM International Conference on Data Mining

A generic framework for efficient subspace clustering of high-dimensional data

Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM ’05