Stochastics and StatisticsNonsmooth nonconvex optimization approach to clusterwise linear regression problems
Introduction
Unsupervised classification, or clustering, is an important task in data mining, which consists in finding subsets of similar points in a data set, in order to find patterns in the data. Regression analysis consists in fitting a function (often linear) to the data to discover how one or more variables vary as a function of another.
The aim of clusterwise regression is to combine both of these techniques, to discover trends within data, when more than one trend is likely to exist. Clusterwise regression has applications for instance in market segmentation, where it allows one to gather information on customer behaviors for several unknown groups of customers [1], [7]. It is also applied to investigate the stock-exchange data [20] and the so-called benefit segmentation [28]. The presence of nonlinear relationships, heterogeneous subjects, or time series in these applications necessitate the use of two or more regression functions to best summarize the underlying structure of the data.
The simplest case in the clusterwise regression is the use of two or more linear regression functions to investigate the structure of the data. Such an approach is called clusterwise linear regression and it is widely used and studied better than other approaches. This problem can be formulated as an optimization problem. Mixed integer nonlinear programming formulations can be found in [9], [12]. Such problems may have a very large number of variables even for moderately large data sets. Therefore exact global optimization of such problems is very challenging and out of reach of existing algorithms [9]. The most popular approaches to clusterwise linear regression are generalizations of classical clustering algorithms such as k-means [23], [24] or EM [14]. In [8] the authors base their approach on the variable neighborhood search.
In the paper [20] the clusterwise linear regression is studied when the set of predictor variables forms an L2-continuous stochastic process. For each cluster the estimators of the regression coefficients are given by partial least square regression. The number of clusters is treated as unknown. The paper [15] extends the so-called TCLUST methodology to perform robust clusterwise linear regression. In this paper, a feasible algorithm for the practical implementation is also proposed.
The paper [12] presents a conditional mixture, maximum likelihood methodology for performing clusterwise linear regression. This methodology simultaneously estimates separate regression functions and membership in K > 0 clusters or groups. The conditional mixture, maximum likelihood methodology is introduced together with the EM algorithm utilized for parameter estimation.
Existing clusterwise linear regression algorithms suffer from the same drawbacks as their clustering counterparts: they are very sensitive to the choice of an initial solution and they may lead to sub-optimal solutions [31]. Furthermore, most of these algorithms assume the number of clusters to be known a priori. Most of algorithms try to separate data into subsets of observations and use one regression function for each subset.
There have been several attempts to simultaneously find all regression functions to approximate a data set and to estimate the number of subsets. The paper [13] presents a methodology which simultaneously clusters observations into a preset number of groups and estimates the corresponding regression functions’ coefficients, all to optimize a common objective function. Then a simulated annealing-based methodology is described to accommodate overlapping or nonoverlapping clustering. In the paper [16], the authors show that the estimation of the clusterwise regression model is equivalent to solving a nonlinear mixed integer programming model.
An information-based criterion for determining the number of clusters in the clusterwise regression problem is proposed in [22]. It is shown that, under a probabilistically structured population, the proposed criterion selects the true number of regression hyperplanes with probability one among all class-growing sequences of classifications, when the number of observations from the population increases to infinity. The paper [21] studies the problem of estimating the number of clusters in the context of logistic regression clustering. The classification likelihood approach is employed to tackle this problem. A model-selection based criterion for selecting the number of logistic curves is proposed and its asymptotic property is also considered.
In this paper, a new approach for solving the clusterwise linear regression problems is proposed based on a nonsmooth nonconvex formulation. This approach starts with one regression function and summarizes the underlying structure of the data by dynamically adding one hyperplane at each iteration. A special procedure is introduced to generate good starting points for solving global optimization problems at each iteration of the incremental algorithm. Such an approach allows one to find global or near global solution to the problem when a data set is sufficiently dense.
Several incremental algorithms have been proposed to solve the sum of squares clustering problems. The global k-means algorithm and its variations [2], [17] are based on constructing the clusters incrementally, starting from finding the center for the whole data set and then adding a cluster at a time and refining the new set of clusters by applying k-means. In this paper we propose to apply a similar scheme in order to solve the clusterwise linear regression problem. Instead of classical centers, we propose to use affine functions as representatives of clusters.
The proposed algorithm is numerically tested on twenty small and seven medium size and large publicly available data sets for regression analysis. We also compare it with the multi-start Späth algorithm for the clusterwise linear regression. Additionally, we study the efficiency of the proposed algorithm depending on the number of points, features and clusters using randomly generated data sets.
The structure of the paper is as follows. In Section 2 the clusterwise linear regression problem is introduced. We briefly explain the Späth algorithm for solving the clusterwise linear regression problem in Section 3 and describe the incremental algorithm in Section 4. Computation of initial solutions is discussed in Section 5. Section 6 contains computational results and Section 7 concludes the paper.
Section snippets
Clusterwise linear regression problem
In this section we will present the nonsmooth nonconvex optimization formulation of the clusterwise linear regression problem. Given a data set , the aim of the clusterwise linear regression is to find simultaneously an optimal partition of data in k clusters and regression coefficients {xj, yj}, j = 1, … , k within clusters in order to minimize the overall fit. Let Aj, j = 1, … , k be clusters such thatLet {xj, yj} be linear regression
An algorithm for clusterwise linear regression
In this section we recall the algorithm from [24] for solving Problem (2) which is based on the well known k-means algorithm. In this paper the algorithm was described for p = 2, however, in our description below we will present it for p ⩾ 1 in general. Algorithm 1 Späth algorithm Step 1: (Initialization) Select mutually disjoint clusters A1, ⋯ , Ak such that . Step 2: For j = 1, ⋯ , k, solve the following linear regression problem: and obtain regression
An incremental algorithm
The global optimization Problem (2) may have a large number of solutions among which only global or near global ones are of interest. However, conventional global optimization techniques cannot be directly applied to solve this problem due to its size, while efficient local methods can only reach local solutions whose quality depends on starting points. Therefore it is crucial to develop a procedure for finding those good starting points. In this section we propose to incorporate local methods
Computation of initial solutions
In this section we design an algorithm for solving Problem (6).
We denote a hyperplane by a pair (u,v) where and . Consider the following set of hyperplanes:The set Ck contains all hyperplanes which do not attract any point from the set A. It is clear that over this set the function is constant and reaches its global maximum value (5). Therefore any hyperplane from the set Ck is a stationary point for the function . This means that if we
Numerical results and discussions
In this section we present results of numerical experiments by applying the proposed algorithm to some real and random regression data sets. First we present some illustrative examples using three small data sets. Then results of numerical experiments on data sets with known solutions will be demonstrated and finally, we present results on large data sets. We also compare the proposed algorithm with the multi-start Späth algorithm using numerical results.
Conclusions
In this paper we developed an incremental algorithm for solving the clusterwise linear regression problem. This algorithm gradually finds clusters and linear regression functions within these clusters and minimizes the overall fit function. We proposed the algorithm to construct initial solutions at each iteration of the incremental algorithm using results obtained at the previous iteration. This algorithm allows one to find significantly more accurate solutions considerably faster than the
Acknowledgement
The authors are grateful to two anonymous referees for their valuable comments which significantly improved the presentation of the paper.
References (31)
- et al.
Amalgamation of partitions from multiple segmentation bases: a comparison of non-model-based and model-based methods
European Journal of Operational Research
(2010) Modified global k-means algorithm for minimum sum-of-squares clustering problems
Pattern Recognition
(2008)- et al.
Fast modified global k-means algorithm for incremental cluster construction
Pattern Recognition
(2011) - et al.
A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems
European Journal of Operational Research
(2006) - et al.
A combined approach for segment-specific market basket analysis
European Journal of Operational Research
(2008) - et al.
Globally optimal clusterwise regression by mixed logical-quadratic programming
European Journal of Operational Research
(2011) - et al.
Modeling wine preferences by data mining from physicochemical properties
Decision Support Systems
(2009) - et al.
Robust clusterwise linear regression through trimming
Computational Statistics and Data Analysis
(2010) - et al.
A mathematical programming approach to clusterwise regression model and its extensions
European Journal of Operational Research
(1999) - et al.
The global k-means clustering algorithm
Pattern Recognition
(2003)
Clusterwise PLS regression on a stochastic process
Computational Statistics & Data Analysis
A consistent procedure for determining the number of clusters in regression clustering
Journal of Statistical Planning and Inference
Consumer benefit segmentation using clusterwise linear regression
International Journal of Research in Marketing
Modeling of strength of high performance concrete using artificial neural networks
Cement and Concrete Research
Modeling slump flow of concrete using second-order regressions and artificial neural networks
Cement and Concrete Composites
Cited by (31)
Kernel-based online regression with canal loss
2022, European Journal of Operational ResearchIncremental DC optimization algorithm for large-scale clusterwise linear regression
2021, Journal of Computational and Applied MathematicsIncremental method for multiple line detection problem — iterative reweighted approach
2020, Mathematics and Computers in SimulationCitation Excerpt :Algorithm 3 will be investigated together with checking the necessary conditions for the MAPart defined in Section 4.3.1 and performing the necessary corrections described in Section 4.3.2. This also means that the method proposed in this paper is more efficient than the standard incremental algorithm [2,43]. If data are not homogeneously distributed around lines from which they were obtained, which is a common case in real-world images, then the necessary conditions for the MAPart should be expanded with the following condition:
Clusterwise support vector linear regression
2020, European Journal of Operational ResearchCitation Excerpt :Then using the penalty function this problem is replaced with an unconstrained nonsmooth optimization problem, where the regression errors are defined using the L1-risk, and small perturbations from hyperplanes are tolerated without penalty. This model differs from the typical nonsmooth nonconvex formulation of CLR (Bagirov & Ugon, 2018; Bagirov et al., 2013), where regression errors are defined using the L2-risk and all deviations are penalized. Similar to Bagirov and Ugon (2018), the objective function in the new formulation is represented as a DC function.
Clusterwise linear regression modeling with soft scale constraints
2017, International Journal of Approximate ReasoningCitation Excerpt :It contains 59 records for some U.S. small-companies' CEO salaries (dependent variable), and CEO ages (independent variable). Bagirov et al. [2] fitted a 2-component and a 4-component clusterwise linear regression, whereas Carbonneau et al. [3] focused on the perhaps most intuitive 2-component setup. We fitted respectively 2-component (Fig. 1), 3-component (Fig. 2), and 4-component (Fig. 3) clusterwise linear regressions, and graphically compared the regression lines and crisp classifications obtained.
Prediction of monthly rainfall in Victoria, Australia: Clusterwise linear regression approach
2017, Atmospheric ResearchCitation Excerpt :However, these algorithms use the same procedure to generate initial solutions. The description of this procedure can be found in Bagirov et al. (2013, 2015a,b). It involves the so-called auxiliary clusterwise linear regression problem.