Keywords

1 Introduction

With the rapid growth of the national economy, the competition in all walks of life has become fierce recently. In this competitive commercial framework, it is becoming more and more important for enterprises to analyze and understand the needs and expectations of customers. Accordingly, most enterprises establish CRM to accumulate massive customer data which can be analyzed and applied to effective targeting and predicting potential customers.

Customer segmentation [1] using clustering to discover intrinsic patterns of customer behaviour based on the transaction data. K-means has been widely used in customer segmentation because of its simplicity and fast convergence, but K-means is apt to local optimum. Self-Organizing Map (SOM) can map high dimensional input space to low-dimensional topologies and intuitively display data structures. [2] used SOM to segment customer based on RFM model, which has solved characteristic parameter stagger and nonlinear distribution. However, SOM is difficult to accurately analyze the performance indicator. Genetic K-means algorithm (GKA) [3] combines the global optimization of GA and the local search ability of K-means so as to find the global optimal solution. [4] added greedy selection in GKA to solve clustering problem.

The previous researches indicated that the GKA can get more accurate results by solving the problem that K-means is apt to fall into the local optimum. And the uncertain number of clusters will cause the algorithm converge to an immature result. Moreover, most models are researching customer segmentation, which only focus on customer clustering problems. Consequently, this paper proposes the AP-GKAs model to completely analyze transaction data.

2 Related Work

The AP algorithm [5] uses all data points as potential clustering center which called exemplar. The similarity s(ik) indicates how well the data point k is suited to be the exemplar for data point i, each similarity is to a negative squared error: for point \(x_i\) and \(x_k\), \(s(i,k)=-\Vert x_i-x_k\Vert ^2\). The reference degree p indicates the tendency of the data point chosen as an example. It is recommended that all p are set as the median of \(s(s_m)\) without prior knowledge [6].

The responsibility r(ik), sent from a data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i. To begin with, the availability is initialized to zero.

The availability a(ik), from a candidate exemplar point k to point i, reflects the accumulated evidence for how appropriate for point i to choose point k as its exemplar, taking into account the support from other points. The following availability update gathers evidence from data points to decide whether the candidate exemplar can make a good exemplar.

When updating the two messages, it is important to add damping factor to avoid possible oscillations arises. Each message is set to \(\gamma \) times of its previous iteration value plus \(1-\gamma \) times its prescribed updated value.

3 AP-GKAs

Based on the AP algorithm and the improved GKA (GKAs), a customer segmentation model AP-GKAs is proposed. The steps is: (1) Extract feature. Factor analysis constructs the feature attribute and weights the feature by the normalized multi-indicator RFM model; (2) Determine the clustering quantity. The AP algorithm is used to cluster the data obtained in step (1) and then get range of clusters, then two evaluation indicators are used to determine the number of clusters; (3) Cluster. GKAs clusters the transaction data obtained in step (1) with the number of cluster obtained in step (2).

3.1 Extract Feature

Extracting feature which firstly establishes muti-indicator RFM model, and then uses factor analysis to construct the implied feature and weight on each feature from the normalized multi-index RFM model.

Muti-indicator RFM Model. RFM model describes the customer overall trading behavior through recency (R), frequency (F) and monetary (M). However, customer behavior is a complex phenomenon, and it still has some drawbacks in using three indicators. For example, the transaction with the same attribute R can not determine the customers new or old property. The AP-GKAs uses the multi-indicator RFM model [7] to reflect customers transaction information.

As shown in Table 1. The multi-indicator RFM model uses ten indicators to describe the transaction information comprehensively. After obtaining indicators, it is unable to carry out unified measurement for big different between each indicator in magnitude. Therefore, ten indicators are normalized to improve the model accuracy.

Table 1. Muti-indicator RFM system.

Factor Analysis. Nevertheless, each weigh is different in empirical analysis. The widely used method is the analytic hierarchy, but it is more inclined to the plan decision than the weight determination. Accordingly, we use factor analysis method to find out implied feature and weight on each feature.

$$\begin{aligned} X=\left( \begin{array}{ccc} a_{11} &{} ... &{} a_{1n}\\ \vdots &{} &{} \vdots \\ a_{m1} &{} ... &{} a_{mn} \end{array} \right) \bullet factor \end{aligned}$$
(1)

where X is multi-indicator RFM indexes, factor is public factor.

3.2 Determine the Clustering Quantity

Traditional algorithms use K-means to determine the number of clusters, whose cluster centers are often initialized randomly for each number, therefore the results have poor comparability of validity indices and the inaccurate K. Most researchers use rules \(K_{max}\le \sqrt{n}\) to determine the \(K_{max}\), but this conclusion is based on the premise of uncertainty function, nothing that this assumption is not a sufficient condition [8]. The AP algorithm can get more exact cluster range, hence the upper limit of K changed to the result from the AP.

Sum of Square Errors: The sum of square errors reflects the similarity of the data points in same cluster. The higher similarity of data points in the same cluster displays, the smaller sum of square errors is.

$$\begin{aligned} SSE=\sum _{i=1}^k\sum _{x\in C_i}\Vert x-\mu _i\Vert ^2 \end{aligned}$$
(2)

where k is the number of cluster, and \(\mu _i\) is the center of cluster i.

Silhouette Coefficient: The silhouette coefficient combines two factors of cohesion and resolution, which can be used to evaluate different algorithms using same original data or the influence of different operation modes on clustering results. The clustering results are excellent with larger Sil average values.

$$\begin{aligned} Sil(i)=\frac{b(i)-a(i)}{\max \left\{ a(i),b(i)\right\} } \end{aligned}$$
(3)

where a(i) is the mean distance between point i and other points in same cluster. \(b(i)=\min \left\{ b_{i1},b_{i2},...,b_{ik} \right\} \) where \(b_{ij}\) is the mean distance of point i to points of other clusters. Sil(i) is from \(-1\) to 1.

3.3 Cluster

The GKA [3] algorithm, overcomes the defect that K-means is easy to fall into local optimum and improves the search scope of GA. However, the algorithm still has the shortcoming of premature convergence. The AP-GKAs uses adaptive mutation rate to solve the problem of premature convergence. The steps is:

Coding: Real number coding based on clustering center.

Initialization: Randomly generated initial population.

Selection: Elitist strategy and roulette wheel strategy. The fitness function is:

$$\begin{aligned} F(s_i)=\frac{Between(s_i)}{1+SSE(s_i)} \end{aligned}$$
(4)

where Between is the distance between each cluster using the average connection. SSE shown in Eq. (2).

Mutation: Taking a random value instead of the original gene.

Standard genetic algorithms have been confirmed that they fail in converging to the global optimal solution [9]. Therefore, the AP-GKAs uses the adaptive mutation rate to solve immature convergence.

$$\begin{aligned} P_m=\left\{ \begin{array}{ccl} p_{m1} &{} &{} f\le f_{avg}\\ p_{m1}-\frac{(p_{m1}-p_{m2})(f_{max}-f)}{f_{max}-f_{avg}} &{} &{} f>f_{avg} \end{array} \right. \end{aligned}$$
(5)

where \(p_{m1}\) and \(p_{m2}\) are mutation rate; f is the fitness value of the individual to be mutated in the population; \(f_{avg}\) and \(f_{max}\) is the mean and the largest of fitness value in population. The mutation probability will be larger (lower) for individuals with small (larger) fitness, which makes the AP-GKAs algorithm keeping the population diversity and ensuring the convergence of the algorithm.

K-means operation: (1) The best individual as the cluster center and reassign each object. (2) Calculate new cluster center to replace the worst individual.

4 Experiments

The AP-GKAs model is suited to customer transaction data which mainly contains three attributes: transaction ID, transaction time, transaction money. The experiments are conducted on two kinds data sets. The first is the online retail business data with 8 attributes in the UK (ORDS) [10]. ORDS contains 4371 customers with 541909 trading data. The second is the card transaction data with 11 attributes from a bank in China (CBDS). CBDS contains 4500 bank customers with over one million transaction for five years.

In two experiments, the reference degree \(p = s_m\) and damping factor \(\gamma = 0.5\) in the AP. In AP-GKAs, the population size m is 20, the mutation probability \(p_{m1}= 0.01\) and \(p_{m2} = 0.001\), and the maximum number of iteration \(T = 100\). In SOM-GKA, SOM is used to determine the initial cluster center, the learning rate is 0.25. GKA sets the same parameters except the mutation rate \(p_m = 0.01\).

Fig. 1.
figure 1

SSE and silhouette coefficient value in different K of ORDS.

4.1 Results Based on ORDS

The maximum number of clusters is 35 through the AP, it greatly reduce calculation time to \(\sqrt{n}=66\). For cluster number 2 to 35, we calculate the SSE and Sil value 50 times with K-means and average. In theory, the clustering quality should be in proportion to the Sil and in inverse to the SSE. Figure 1 shows the SSE value has converged to a minimum value with the larger Sil value when \(K = 15\), hence the number of clusters is 15.

As can be seen from Table 2, AP-GKAs achieved lower SSE, higher Sil, Between and F than the others. K-means has the lowest accuracy, GKA [3] has more improvement compared with K-means. SOM-GKA is better than GKA due to the initial cluster center determine by SOM [2]. Compared with GKA, AP-GKAs has 9.4% and 2.5% improvement in Sil and F.

Table 2. GKAs results compared with other algorithms in OBDS

4.2 Results Based on CBDS

The maximum number of clusters is 14 through the AP. From Fig. 2, the SSE value has converged to a minimum value with the larger Sil value when \(K = 9\).

Fig. 2.
figure 2

SSE and silhouette coefficient value in different K of OBDS.

Table 3 shows that 7.36% improvement is displayed in Sil and 3.7% improvement is displayed in F compared with GKA algorithm. Moreover, AP-GKAs converges faster than GKA. Its convergence rate is less slower than K-means, but the clustering effect of AP-GKAs is much better than K-means.

Table 3. GKAs results compared with other algorithms in ORDS

5 Conclusions

In this paper we have shown how to completely process the customer transaction data for segmenting customer. We used muti-indicator RFM to describe the behaviour closely because the customer transaction is a complex phenomenon. It is always difficult to determine the number of clusters without expert knowledge for customer segmentation. To overcome this problem, we introduced AP algorithm to get range of cluster quantity. Moreover, we showed GKAs with adaptive mutation rate and one-step K-means operation to increase the accuracy of clustering result. We have successfully validated our model on standard data set and real transaction data, AP-GKAs has the fastest convergence rate and more accuracy than other three algorithms.