Grouping points by shared subspaces for effective subspace clustering
Introduction
Clustering detects groups of similar points while separating dissimilar points into different groups [1]. Many clustering algorithms have been studied in the research community over the past few decades. Clustering is a fundamental data analysis technique for data mining and knowledge discovery and has been applied to various fields such as engineering, medical and biological sciences, social sciences and economics.
Traditional “full-space clustering” algorithms become ineffective when clusters exist in different subspaces1 as many irrelevant attributes may cause similarity measurements used by the algorithms to become unreliable [1]. To tackle this problem, the similarity of points should be assessed within subspaces (over a subset of relevant attributes).
Subspace clustering aims to discover clusters which exist in different subspaces [2]. It works by combining subspace search and clustering procedures. The number of possible axis-parallel subspaces is exponential in the dimensionality and the number of possible arbitrarily oriented (non axis-parallel) subspaces is infinite. To deal with the huge number of possible subspaces, subspace clustering algorithms usually rely on heuristics for subspace search, e.g., top-down search [3] and bottom-up search [4]. All these existing subspace clustering algorithms perform clustering by measuring the similarity between points in the given feature space, and the subspace selection and clustering processes are tightly coupled.
In this paper, we focus on clustering in axis-parallel subspaces. We contribute a new subspace clustering framework, named CSSub (Clustering by Shared Subspaces),2 which has the following three unique features:
- 1.
CSSub groups points by their shared subspaces. It performs clustering by measuring the similarity between points based on the number of subspaces they share. This enables CSSub to detect non-redundant/non-overlapping subspace clusters directly by running a clustering method only once. In contrast, many existing subspace clustering algorithms need to run a chosen clustering method for each subspace; and this must be repeated for an exponentially large number of subspaces. As a consequence, it produces many redundant subspace clusters.
- 2.
CSSub decouples the candidate subspace selection process from the clustering process. By explicitly splitting them into independent processes, it enables candidate subspaces to be selected independent of the clustering process—eliminating the need to repeat the clustering step a large number of times. In contrast, many existing subspace clustering algorithms which have tightly-coupled processes must rely on an anti-monotonicity property to prune the search space.
- 3.
The decoupling approach has an added advantage that allows different types of cluster definitions to be employed easily. The time cost of the entire process is dominated by the subspace scoring function. We show that by changing the cluster definition such that subspaces can be evaluated with a linear scoring function, enables the runtime of CSSub to be reduced from quadratic time to linear time. A similar change is difficult, if not impossible, for existing algorithms because of the tightly coupled processes.
We present an extensive empirical evaluation on synthetic and real-world datasets to demonstrate the effectiveness of CSSub. The experiments show that CSSub discovers subspace clusters with arbitrary shapes in noisy data; and it significantly outperforms existing state-of-the-art subspace clustering algorithms. In addition, CSSub has only one parameter k (the number of clusters) which needs to be manually tuned, other parameters can be automatically set based on a heuristic method.
The rest of this paper is organised as follows. We provide an overview of subspace clustering algorithms and related work in Section 2. Section 3 discusses key weaknesses of existing bottom-up subspace clustering algorithms. Section 4 details the subspace clustering framework based on the shared subspaces and presents a density-based approach for subspace scoring. Section 5 presents the algorithms for CSSub. In Section 6, we empirically evaluate the performance of the proposed algorithms on different datasets. Discussion and the conclusions are provided in the last two sections.
Section snippets
Related work
The key task of clustering in subspaces is to develop appropriate subspace search heuristics [2]. There are two basic techniques for subspace search, namely top-down search [3] and bottom-up search [4]. Different subspace clustering algorithms have been proposed based on these two search directions [5], [6], [7], [8], [9]. Search strategies can be further subdivided into systematic and non-systematic search techniques as discussed below.
Key weaknesses of existing bottom-up subspace clustering algorithms
The majority of density-based subspace clustering algorithms rely on a bottom-up search strategy and use anti-monotonicity of the density as a means to reduce the search space. This approach has two weaknesses:
- 1.
The approach tightly couples the candidate subspace selection process with the clustering process, i.e., it must run a clustering method for each subspace; and there are an exponentially large number of subspaces. As a result, many redundant subspace clusters are produced during the
Clustering by shared subspaces
In this section, we present an effective subspace clustering framework in order to overcome the weaknesses of existing bottom-up subspace clustering algorithms.
Algorithms in CSSub
In this section, we provide the algorithms for the proposed CSSub framework shown in Fig. 3. The stages shown in Algorithm 1 correspond to the stages in Fig. 3.
The first stage in Algorithm 1 generates the set of subspaces by enumeration to the largest subspace dimensionality such that . Here we set the number of candidate subspaces less than the data size is to keep the quadratic time complexity. Different ways to generate the initial set of subspaces may be used instead in
Empirical evaluation
This section presents experiments designed to evaluate the performance of CSSub. We selected seven state-of-the-art non-overlapping subspace clustering algorithms for comparison: a top-down subspace clustering algorithm PROCLUS [3], a bottom-up subspace clustering algorithm P3C [13] and five soft subspace clustering algorithms LAC [25], EWKM [26], FSC [27] ESSC [28] and FG-k-means [29]. In addition, we include a full-space clustering algorithms k-medoids [17] in order to judge the effectiveness
Discussion
CSSub is effective on low-to-medium dimensional datasets having subspace clusters with different irrelevant attributes. The experiment using the three artificial datasets, 2T, S1500 and D50, indicates that CSSub is particularly good at identifying subspace clusters of non-globular shape. The robustness analysis shows that it is tolerant to both noise points and noise attributes.
The CSSub framework provides a flexible platform for developing new subspace clustering algorithms which group points
Conclusions
Grouping points by shared subspaces is a unique clustering approach which has never been attempted before, as far as we know. The approach exhibits three unique features:
- 1.
The clustering is performed by measuring the similarity between points based on the number of subspaces they share, without examining the attribute values in the given dataset. This enables the subspace clustering to be conducted only once to produce non-redundant/non-overlapping subspace clusters. In contrast, existing
Acknowledgments
Most of this work was done when Ye Zhu was a Ph.D. student at Monash University, Australia. The anonymous reviewers have provided many helpful suggestions to improve this paper.
Ye Zhu received his Ph.D. from Monash University, Australia. He has been awarded a Mollie Holman Medal for the faculty's best doctoral thesis of the year. He is now a research fellow in complex system data analytics at Deakin University, Australia. His research interests are in the areas of clustering and anomaly detection.
References (42)
- et al.
A survey on soft subspace clustering
Inf. Sci.
(2016) - et al.
Enhanced soft subspace clustering integrating within-cluster and between-cluster information
Pattern Recognit.
(2010) - et al.
Distance metric learning for soft subspace clustering in composite kernel space
Pattern Recognit.
(2016) - et al.
Discovering outlying aspects in large datasets
Data Mining Knowl. Disc.
(2016) - et al.
Dusc: dimensionality unbiased subspace clustering
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
(2007) - et al.
Data Mining: Concepts and Techniques
(2011) - et al.
Data Clustering: Algorithms and Applications
(2013) - et al.
Fast algorithms for projected clustering
Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99
(1999) - et al.
Automatic subspace clustering of high dimensional data for data mining applications
Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98
(1998) - et al.
A survey on enhanced subspace clustering
Data Mining Knowl. Disc.
(2013)
Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering
ACM Trans. Knowl. Disc. Data (TKDD)
Subspace clustering for high dimensional data: a review
ACM SIGKDD Explor. Newsletter
Evaluating clustering in subspace projections of high dimensional data
Proc. VLDB Endow.
The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives
Mach. Learn.
Mafia: efficient and scalable subspace clustering for very large data sets
Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Entropy-based subspace clustering for mining numerical data
Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Scalable density-based subspace clustering
Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11
Robust projected clustering
Knowl. Inf. Syst.
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. (Stat. Methodol.)
Density-connected subspace clustering for high-dimensional data
Proceedings of the 2004 SIAM International Conference on Data Mining
A generic framework for efficient subspace clustering of high-dimensional data
Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM ’05
Cited by (15)
Graph-based adaptive and discriminative subspace learning for face image clustering
2022, Expert Systems with ApplicationsA three-way density peak clustering method based on evidence theory
2021, Knowledge-Based SystemsCitation Excerpt :As a powerful data analysis technique, clustering plays an important role in data mining [1–3]. Its objective is to find groups or clusters of objects that are similar to one another in the same cluster but dissimilar to objects in any other clusters [4–6]. Clustering has been widely used in various areas including image analysis [7,8], community detections [9] and bioinformatics [10].
Improved subspace clustering algorithm using multi-objective framework and subspace optimization
2020, Expert Systems with ApplicationsSalience-aware adaptive resonance theory for large-scale sparse data clustering
2019, Neural NetworksA three-way clustering method based on an improved DBSCAN algorithm
2019, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :The objective of clustering is to group similar objects into the same cluster while dissimilar objects into different clusters [10–12]. It is an unsupervised learning method without any prior information to find potential similar patterns from a dataset [13–16]. Many existing clustering methods assume that each object must be assigned to exactly one cluster, which results in the type of hard clustering, namely an object only belongs to one cluster.
Transfer learning-assisted multi-objective evolutionary clustering framework with decomposition for high-dimensional data
2019, Information SciencesCitation Excerpt :These applications also help to understand the performance of the knowledge transfer within the same problem domain. To illustrate the competitiveness of the proposed framework, we compare TrMOEC/D with six representative subspace clustering algorithms, i.e., FSC [17], EWKM [23], ESSC [12], CSSub [50], CKS-EWFC [39] and MOEASSC [43]. Among them, FSC and EWKM are two representative FCM/k-means-based SSC methods; ESSC is an extension of EWKM, which further considers the between-class separation; CSSub and CKS-EWFC are two state-of-the-art HSC and SSC approaches, respectively; MOEASSC is a multi-objective SSC algorithm optimizing the within-class compactness and between-class separation simultaneously.
Ye Zhu received his Ph.D. from Monash University, Australia. He has been awarded a Mollie Holman Medal for the faculty's best doctoral thesis of the year. He is now a research fellow in complex system data analytics at Deakin University, Australia. His research interests are in the areas of clustering and anomaly detection.
Kai Ming Ting received his Ph.D. from the University of Sydney, Australia. He is a Professor in the Faculty of Science and Technology at Federation University, Australia.His current research interests are in the areas of mass estimation and mass-based approaches, ensemble approaches and data stream data mining.
Mark Carman received his research doctorate from the University of Trento in Italy before doing a postdoc at the University of Lugano, Switzerland. He is now a Senior Lecturer in the Faculty of Information Technology at Monash University, Australia. His research interests include information retrieval and web mining.