A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems

doi:10.1016/j.ejor.2004.06.014

European Journal of Operational Research

Volume 170, Issue 2, 16 April 2006, Pages 578-596

https://doi.org/10.1016/j.ejor.2004.06.014 Get rights and content

Abstract

The minimum sum-of-squares clustering problem is formulated as a problem of nonsmooth, nonconvex optimization, and an algorithm for solving the former problem based on nonsmooth optimization techniques is developed. The issue of applying this algorithm to large data sets is discussed. Results of numerical experiments have been presented which demonstrate the effectiveness of the proposed algorithm.

Introduction

Clustering is the unsupervised classification of the patterns. Cluster analysis deals with the problems of organization of a collection of patterns into clusters based on similarity. It has found many applications, including information retrieval, document extraction, image segmentation, etc.

In cluster analysis we assume that we have been given a set X of a finite number of points of d-dimensional space $R^{d}$ , that is $X = {x^{1}, \dots, x^{n}}, where x^{i} \in R^{d}, i = 1, \dots, n .$ The subject of cluster analysis is the partition of the set X into a given number q of overlapping or disjoint subsets C_i, i = 1,…, q, with respect to predefined criteria such that $X = ⋃_{i = 1}^{q} C_{i} .$ The sets C_i, i = 1, … , q, are called clusters. The clustering problem is said to be hard clustering if every data point belongs to one and only one cluster. Unlike hard clustering in the fuzzy clustering problem the clusters are allowed to overlap and instances have degrees of appearance in each cluster. In this paper we will exclusively consider the hard unconstrained clustering problem, that is we additionally assume that $C_{i} \cap C_{k} = \emptyset, \forall i, k = 1, \dots, q, i \neq k,$ and no constraints are imposed on the clusters C_i, i = 1, … , q. Thus every point x∈X is contained in exactly one and only one set C_i.

Each cluster C_i can be identified by its center (or centroid). Then the clustering problem can be reduced to the following optimization problem (see [7], [8], [38]): $\begin{matrix} minimize ϕ (C, a) = \frac{1}{n} \sum_{i = 1}^{q} \sum_{x \in C_{i}} ∥ a^{i} - x ∥^{2} \\ subject to C \in \bar{C}, a = (a^{1}, \dots, a^{q}) \in R^{d \times q}, \end{matrix}$ where ∥ · ∥ denotes the Euclidean norm, C = {C₁, … , C_q} is a set of clusters, $\bar{C}$ is a set of all possible q-partitions of the set X, aⁱ is the center of the cluster C_i, i = 1, … , q, $a^{i} = \frac{1}{| C_{i} |} \sum_{x \in C_{i}} x,$ and ∣C_i∣ is a cardinality of the set C_i, i = 1, … , q. The problem (1) is also known as the minimum sum-of-squares clustering. The combinatorial formulation (1) of the minimum sum-of-squares clustering is not suitable for direct application of mathematical programming techniques. The problem (1) can be rewritten as the following mathematical programming problem: $\begin{matrix} minimize ψ (a, w) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{q} w_{ij} ∥ a^{j} - x^{i} ∥^{2} \\ subject to \sum_{j = 1}^{q} w_{ij} = 1, i = 1, \dots, n, \end{matrix}$ and $w_{ij} \in {0, 1}, i = 1, \dots, n, j = 1, \dots, q .$ Here $a^{j} = \frac{\sum_{i = 1}^{n} w_{ij} x^{j}}{\sum_{i = 1}^{n} w_{ij}}, j = 1, \dots, q,$ and w_ij is the association weight of pattern xⁱ with cluster j (to be found), given by $w_{ij} = \{\begin{matrix} 1 & if pattern i is allocated to cluster j \forall i = 1, \dots, n, j = 1, \dots, q, \\ 0 & otherwise, \end{matrix}$ w is an n × q matrix.

There exist different approaches to clustering including agglomerative and divisive hierarchical clustering algorithms as well as algorithms based on mathematical programming techniques. Descriptions of many of these algorithms can be found, for example, in [14], [23], [38]. An excellent up-to-date survey of existing approaches is provided in [24] and a comprehensive list of literature on clustering algorithms is available in this paper.

Problem (2) is a global optimization problem. Therefore different algorithms of mathematical programming can be applied to solve this problem. Some review of these algorithms can be found in [18] with dynamic programming, branch and bound, cutting planes, k-means algorithms being among them. Dynamic programming approach can be effectively applied to the clustering problem when the number of instances n⩽20, which means that this method is not effective to solve real-world problems (see [25]). However, when q = 1 the clustering problem can be solved exactly by dynamic programming, in polynomial time [38].

Branch and bound algorithms are effective when the database contain only hundreds of records and the number of clusters is not large (less than 5) (see [13], [17], [18], [27]). For these methods the solution of clustering problems with n ⩾ 1000 and q ⩾ 10 is out of reach. Different heuristics can be used for solving large clustering problems and k-means is one such algorithm. Different versions of this algorithm have been studied by many authors (see [38]). This is a very fast algorithm and it is suitable for solving clustering problems in large data sets. k-means gives good results when there are few clusters but deteriorates when there are many [18]. This algorithm achieves a local minimum of problem (1) (see [36]), however, results of numerical experiments presented, for example, in [21] show that the best clustering found with k-means may be more than 50% worse than the best known one.

Much better results have been obtained with metaheuristics, such as simulated annealing, tabu search and genetic algorithms [34]. The simulated annealing approaches to clustering have been studied, for example, in [9], [37], [39]. Application of tabu search methods for solving clustering problem is studied in [1]. Genetic algorithms for clustering have been described in [34]. The results of numerical experiments, presented in paper [2] show that even for small problems of cluster analysis when the number of entities n⩽100 and the number of clusters q⩽5 these algorithms take 500–700 (sometimes several thousands) times more CPU time than the k-means algorithms. For relatively large databases one can expect that this difference will increase. This makes metaheuristic algorithms of global optimization ineffective for solving many clustering problems. However, these algorithms can be applied to large clustering problems if combined with decomposition (see [20]).

An approach to cluster analysis problems based on bilinear programming techniques has been described in [29]. The paper [5] describes the global optimization approach to clustering and demonstrates how the supervised data classification problem can be solved via clustering. The objective function in this problem is both nonsmooth and nonconvex and this function has a large number of local minimizers. Problems of this type are quite challenging for general-purpose global optimization techniques. Due to the large number of variables and the complexity of the objective function these techniques, as a rule, fail to solve such problems.

In [15] an interior point method for minimum sum-of-squares clustering problem is developed. The papers [20], [30] develops variable neighborhood search algorithm and the paper [19] presents j-means algorithm which extends k-means by adding a jump move. The global k-means heuristic, which is an incremental approach to minimum sum-of-squares clustering problem, is developed in [28]. The incremental approach is also studied in the paper [21]. Results of numerical experiments presented show the high effectiveness of these algorithms for many clustering problems.

As mentioned above, the problem (2) is the global optimization problem and the objective function in this problem has many local minima. However, global optimization techniques are highly time-consuming for solving many clustering problems. It is very important, therefore, to develop clustering algorithms based on optimization techniques that compute “deep” local minimizers of the objective function. The clustering algorithm proposed and studied in this paper is of this type and is based on nonsmooth optimization techniques. The algorithm provides the capability of calculating clusters step-by-step, gradually increasing the number of data clusters until termination conditions are met, that is it allows one to calculate as many cluster as a data set contains with respect to some tolerance.

The paper is organized as follows: the nonsmooth optimization approach to clustering is presented in Section 2. Section 3 describes an algorithm for solving clustering problems. An algorithm for solving optimization problems is discussed in Section 4. The issues of the complexity reduction for clustering in large data set are discussed in Section 5; while Section 6 presents discussion of the results of the numerical experiments. Section 7 concludes the paper.

Section snippets

The nonsmooth optimization approach to minimum sum-of-squares clustering

In this section we present a formulation of the clustering problem in terms of nonsmooth, nonconvex optimization.

The problems (1), (2) can be reformulated as the following mathematical programming problem (see [5], [6], [7], [8]): $\begin{matrix} minimize f (a^{1}, \dots, a^{q}) \\ subject to a = (a^{1}, \dots, a^{q}) \in R^{d \times q}, \end{matrix}$ where $f (a^{1}, \dots, a^{q}) = \frac{1}{n} \sum_{i = 1}^{n} \min_{j = 1, \dots, q} ∥ a^{j} - x^{i} ∥^{2} .$ It is shown in [7] that problems (1), (2), (3) are equivalent. The number of variables in problem (2) is (n + d) × q whereas in problem (3) this number is only d × q and the number of

An optimization clustering algorithm

In this section we will describe a clustering algorithm.

Algorithm 1

An algorithm for solving a cluster analysis problem.

Step 1.
(Initialization). Select a tolerance ϵ > 0. Select a starting point $a^{0} = (a_{1}^{0}, \dots, a_{n}^{0}) \in R^{d}$ and solve the minimization problem (3) with q₀ = 1. Let $a^{1 *} \in R^{d}$ be a solution to this problem and f^1∗ be the corresponding objective function value. Set k = 1.
Step 2.
(Computation of the next cluster center). Select a point $y^{0} \in R^{d}$ and solve the following minimization problem: $\begin{matrix} minimize {\bar{f}}^{k} (y) \\ subject to y \in R^{d}, \end{matrix}$ where ${\bar{f}}^{k} (y) = \sum_{i = 1}^{n}$

Solving optimization problems

In this section we will discuss an algorithm for solving problems (5), (6) in the clustering algorithm. Both these problems are nonsmooth optimization problems. Therefore, first we recall some definitions from nonsmooth analysis.

Let Φ be a function defined on $R^{p}$ . This function is said to be a locally Lipschitz continuous on $R^{p}$ if for any bounded subset $S \in R^{p}$ there exists a constant L > 0 such that $| Φ (y) - Φ (u) | ⩽ L ‖ y - u ‖ \forall y, u \in S .$ The function Φ is differentiable almost everywhere and one can define for it

Complexity reduction for large-scale data sets

Due to the highly combinatorial nature of clustering problems, two characteristics of a given data set can severely affect the performance of a clustering tool: the number of data records (instances) and the number of data attributes (features). In many cases the development of effective tools requires the reduction of both the number of features and the number of instances without loss of knowledge generating ability. In this section we will consider one scheme for reducing the number of

Results and discussion

To verify the effectiveness of the clustering algorithm a number of numerical experiments with small and middle-sized data sets have been carried out on a Pentium-4, 1.7 GHz, PC.

First we consider three standard test problems to compare our algorithm with the following heuristics and metaheuristics: the k-means algorithm (K-M), the tabu search (TS) method, a genetic algorithm (GA) and the simulated annealing (SA) method. Then we use four other test data sets to compare Algorithm 1 with the

Conclusions

In this paper a nonsmooth nonconvex optimization-based algorithm for solving cluster analysis problems has been proposed. As this algorithm calculates clusters step by step, it allows the decision maker to easily vary the number of clusters according to the criteria suggested by the nature of the decision making situation not incurring the obvious costs of the increased complexity of the solution procedure. The suggested approach utilizes a stopping criterion that prevents the appearance of

Acknowledgements

The authors would like to thank the three anonymous referees whose very detailed comments have considerably improved this paper.

This research was supported by the Australian Research Council.

References (39)

K.S. Al-Sultan
A tabu search approach to the clustering problem
Pattern Recognition
(1995)
D.E. Brown et al.
A practical application of simulated annealing to the clustering problem
Pattern Recognition
(1992)
R. Dubes et al.
Clustering techniques: The user’s dilemma
Pattern Recognition
(1976)
P. Hanjoul et al.
A comparison of two dual-based procedures for solving the p-median problem
European Journal of Operational Research
(1985)
P. Hansen et al.
J-means: A new heuristic for minimum sum-of-squares clustering
Pattern Recognition
(2001)
A. Likas et al.
The global k-means clustering algorithm
Pattern Recognition
(2003)
N. Mladenovic et al.
Variable neighborhood search
Computer and Operations Research
(1997)
S.Z. Selim et al.
A simulated annealing algorithm for the clustering
Pattern Recognition
(1991)
L.X. Sun et al.
Cluster analysis by simulated annealing
Computers and Chemistry
(1994)
K.S. Al-Sultan et al.
Computational experience on four algorithms for the hard clustering problem
Pattern Recognition Letters
(1996)

A.M. Bagirov

Minimization methods for one class of nonsmooth functions and calculation of semi-equilibrium prices

A.M. Bagirov

A method for minimization of quasidifferentiable functions

Optimization Methods and Software

(2002)

A.M. Bagirov et al.

Using global optimization to improve classification for medical diagnosis and prognosis

Topics in Health Information Management

(2001)

A.M. Bagirov et al.

Global optimization approach to classification

Optimization and Engineering

(2001)

H.H. Bock

Automatische Klassifikation

(1974)

H.H. Bock

Clustering and neural networks

F.H. Clarke

Optimization and Nonsmooth Analysis

(1983)

V.F. Demyanov et al.

Constructive Nonsmooth Analysis

(1995)

I.S. Dhillon et al.

Efficient clustering of very large document collections

Cited by (85)

A novel optimization approach towards improving separability of clusters
2023, Computers and Operations Research
The objective functions in optimization models of the sum-of-squares clustering problem reflect intra-cluster similarity and inter-cluster dissimilarities and in general, optimal values of these functions can be considered as appropriate measures for compactness of clusters. However, the use of the objective function alone may not lead to the finding of separable clusters. To address this shortcoming in existing models for clustering, we develop a new optimization model where the objective function is represented as a sum of two terms reflecting the compactness and separability of clusters. Based on this model we develop a two-phase incremental clustering algorithm. In the first phase, the clustering function is minimized to find compact clusters and in the second phase, a new model is applied to improve the separability of clusters. The Davies–Bouldin cluster validity index is applied as an additional measure to compare the compactness of clusters and silhouette coefficients are used to estimate the separability of clusters. The performance of the proposed algorithm is demonstrated and compared with that of four other algorithms using synthetic and real-world data sets. Numerical results clearly show that in comparison with other algorithms the new algorithm is able to find clusters with better separability and similar compactness.
Finding compact and well-separated clusters: Clustering using silhouette coefficients
2023, Pattern Recognition
Finding compact and well-separated clusters in data sets is a challenging task. Most clustering algorithms try to minimize certain clustering objective functions. These functions usually reflect the intra-cluster similarity and inter-cluster dissimilarity. However, the use of such functions alone may not lead to the finding of well-separated and, in some cases, compact clusters. Therefore additional measures, called cluster validity indices, are used to estimate the true number of well-separated and compact clusters. Some of these indices are well-suited to be included into the optimization model of the clustering problem. Silhouette coefficients are among such indices. In this paper, a new optimization model of the clustering problem is developed where the clustering function is used as an objective and silhouette coefficients are used to formulate constraints. Then an algorithm, called CLUSCO (CLustering Using Silhouette COefficients), is designed to construct clusters incrementally. Three schemes are discussed to reduce the computational complexity of the algorithm. Its performance is evaluated using fourteen real-world data sets and compared with that of three state-of-the-art clustering algorithms. Results show that the CLUSCO is able to compute compact clusters which are significantly better separable in comparison with those obtained by other algorithms.
Responsive threshold search based memetic algorithm for balanced minimum sum-of-squares clustering
2021, Information Sciences
Clustering is a common task in data mining for constructing well-separated groups (clusters) from a large set of data points. The balanced minimum sum-of-squares clustering problem is a variant of the classic minimum sum-of-squares clustering (MSSC) problem and arises from broad real-life applications where the cardinalities of any two clusters differ by at most one. This study presents the first memetic algorithm for solving the balanced MSSC problem. The proposed algorithm combines a backbone-based crossover operator for generating offspring solutions and a responsive threshold search that alternates between a threshold-based exploration procedure and a descent-based improvement procedure for improving new offspring solutions. Numerical results on 16 real-life datasets show that the proposed algorithm competes very favorably with several state-of-the-art methods from the literature. Key components of the proposed algorithm are investigated to understand their effects on the performance of the algorithm.
Optimization problems for machine learning: A survey
2021, European Journal of Operational Research
This paper surveys the machine learning literature and presents in an optimization framework several commonly used machine learning approaches. Particularly, mathematical optimization models are presented for regression, classification, clustering, deep learning, and adversarial learning, as well as new emerging applications in machine teaching, empirical model learning, and Bayesian network structure learning. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. The strengths and the shortcomings of these models are discussed and potential research directions and open problems are highlighted.
Merging anomalous data usage in wireless mobile telecommunications: Business analytics with a strategy-focused data-driven approach for sustainability
2020, European Journal of Operational Research
Citation Excerpt :
We face the same clustering problem as Santi et al. (2016) such that all available dissimilarity matrices are used to deal with heterogeneity. Numerous studies use the axiomatic fuzzy set (AFS) clustering methodology (see Bagirov & Yearwood, 2006; Xie et al., 2016; Xu, Liu, & Chen, 2009). Kim, Lee, Lee, and Lee (2005) conduct a kernel-based classification with four clustering algorithms (i.e., K-means, Fuzzy C-means, average linkage, and mountain algorithm) and evaluate their performances for various datasets.
Mobile internet usage has exploded with the mass popularity of smartphones that offer more convenient and efficient ways of doing anything from watching movies, playing games, and streaming music. Understanding the patterns of data usage is thus essential for strategy-focused data-driven business analytics. However, data usage has several unique stylized facts (such as high dimensionality, heteroscedasticity, and sparsity) due to a great variety of user behaviour. To manage these facts, we propose a novel density-based subspace clustering approach (i.e., a three-stage iterative optimization procedure) for intelligent segmentation of consumer data usage/demand. We discuss the characteristics of the proposed method and illustrate its performance in both simulation with synthetic data and business analytics with real data. In a field experiment of wireless mobile telecommunications for data-driven strategic design and managerial implementation, we show that our method is adequate for business analytics and plausible for sustainability in search of business value.
Clustering in large data sets with the limited memory bundle method
2018, Pattern Recognition
The aim of this paper is to design an algorithm based on nonsmooth optimization techniques to solve the minimum sum-of-squares clustering problems in very large data sets. First, the clustering problem is formulated as a nonsmooth optimization problem. Then the limited memory bundle method [Haarala et al., 2007] is modified and combined with an incremental approach to design a new clustering algorithm. The algorithm is evaluated using real world data sets with both the large number of attributes and the large number of data points. It is also compared with some other optimization based clustering algorithms. The numerical results demonstrate the efficiency of the proposed algorithm for clustering in very large data sets.

View all citing articles on Scopus

View full text

Computing, Artificial Intelligence and Information TechnologyA new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems

Abstract

Introduction

Section snippets

The nonsmooth optimization approach to minimum sum-of-squares clustering

An optimization clustering algorithm

Solving optimization problems

Complexity reduction for large-scale data sets

Results and discussion

Conclusions

Acknowledgements

Pattern Recognition

Pattern Recognition

Pattern Recognition

European Journal of Operational Research

Pattern Recognition

Pattern Recognition

Computer and Operations Research

Pattern Recognition

Computers and Chemistry

Computational experience on four algorithms for the hard clustering problem

Pattern Recognition Letters

Minimization methods for one class of nonsmooth functions and calculation of semi-equilibrium prices

A method for minimization of quasidifferentiable functions

Optimization Methods and Software

Using global optimization to improve classification for medical diagnosis and prognosis

Topics in Health Information Management

Global optimization approach to classification

Optimization and Engineering

Automatische Klassifikation

Clustering and neural networks

Optimization and Nonsmooth Analysis

Constructive Nonsmooth Analysis

Efficient clustering of very large document collections

Computing, Artificial Intelligence and Information Technology
A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems