Abstract

The -means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number in the -means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved -means clustering algorithm called the covering -means algorithm (C--means). The C--means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the -means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C--means algorithm combines the advantages of CA and -means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C--means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C--means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

1. Introduction

The development of big data technologies, cloud computing, and the proliferation of data sources (social networks, Internet of Things, e-commerce, mobile apps, biological sequence databases, etc.) enables machines to handle more input data than human being could. Due to this dramatic increase in data, business organizations and researchers have become aware of the tremendous value the data contain. Researchers in the field of information technology have also recognized the enormous challenges these data bring. New technologies to handle these data, called big data, are required. Therefore, it is vital for researchers to choose suitable approaches to deal with big data and obtain valuable information from them. Recognizing valuable information in data requires the use of ideas from machine learning algorithms. Thus, big data analysis must combine the techniques of data mining with those of machine learning. Clustering is one such method that is used in both fields. Clustering is a classic data mining method, and its goal is to divide datasets into multiple classes to maximize the similarities of the data points in each class and minimize the similarities between the classes. The cluster analysis method has been widely used in many fields of science and technology, such as modern statistics, bioinformatics, and social media analytics [15]. For example, clustering algorithms can be applied to social events to analyze big data to determine peoples’ opinions, such as predicting the winner of an election.

Based on the characteristics of different fields, researchers have proposed a variety of clustering types, which can be divided into several general categories, including hierarchy clustering, density-based clustering, graph theory-based clustering, grid-based clustering, model-based clustering, and partitional clustering [1]. Each clustering type has its own style and optimization approaches. We focus on partitional clustering algorithms. The most popular algorithm is -means [2, 3, 6, 7], which is one of the top ten clustering algorithms in data mining. The advantages of the -means algorithm are its easy implementation and understanding, whereas its disadvantages are that the number of clusters cannot be easily determined and the selection of the initial centers is easily disturbed by outliers, which has a significant impact on the final results [6]. Due to the simple iteration of the -means algorithm, it has good scalability when dealing with big data and is easy to implement in parallel execution [810]. Researchers have proposed improved -means algorithms to address the drawbacks of the -means algorithm, and most of the improvements were made by optimizing the selection of the initial -means centers [1113]. Good initial centers can significantly affect the performance of the Lloyd iterations in terms of quality and convergence and eventually help the -means algorithm to obtain the nearly optimal clustering results.

However, -means and its improved algorithms still need to ascertain the number of clusters in advance and then determine the best data partitioning based on this parameter. However, the obtained results do not always represent the best data partitioning. To address these problems, this paper proposes a -means clustering algorithm that is combined with an improved covering algorithm, which is called the C--means algorithm. Our improved covering-initialized algorithm has “blind” features. Without determining the number of clusters in advance, the algorithm can automatically identify the number of clusters based on the characteristics of the data and is independent of the initial centers. The C--means algorithm combines the advantages of the CA and -means algorithms; it has both the “blind” characteristics of the CA and the advantages of fast, efficient, and accurate clustering of high dimensional data of the -means algorithm. Moreover, CA is easy to implement in parallel and has good scalability. We implemented the parallel C--means clustering algorithm and baseline algorithms in the Spark environment. The experimental results showed that the proposed algorithm is suitable for solving the problems of large-scale and high-dimensional data clustering.

In particular, the major contributions of this paper are as follows: (1)We propose a covering-based initialization algorithm based on the quotient space theory with “blind” features. The initialization algorithm requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. CA determines the appropriate number of clusters and the -specific initial centers quickly and adaptively.(2)The convergence algebra of the Lloyd iterations of the C--means clustering algorithm is much simpler than that of baseline algorithms.(3)The parallel implementation of C--means is much faster than parallel baseline algorithms.(4)Extensive experiments on real datasets show that the proposed C--means algorithm outperforms existing algorithms in both accuracy and efficiency under sequential and parallel conditions.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related work. Section 3 gives an introduction to baseline algorithms and details of the C--means algorithm under both sequential and parallel conditions. Section 4 presents the experimental results and analysis, and Section 5 concludes the paper with future work identified.

As a classic clustering algorithm, the -means algorithm is widely used in the fields of database and data anomaly detection. Ordonez [14] implemented efficient -means clustering algorithms at the top of a relational database management system (DBMS) for efficient SQL. They also implemented an efficient disk-based -means application that takes into account the needs of the relational DBMS [15]. Efficient parallel clustering algorithms and implementation techniques are key to meet the scalability and performance requirements for scientific data analysis. Therefore, other researchers have proposed parallel implementation and applications of the -means algorithm. Dhillon and Modha [16] proposed a parallel -means clustering algorithm based on a message passing model, which utilized the inherence of the -means algorithm. Due to data parallelism, as the amount of data increases, the speedup and extendibility of the algorithm improve. Zhao et al. [8] implemented a -means clustering algorithm based on MapReduce, which significantly improved the efficiency of the -means algorithm. Jiang et al. [17] proposed a two-stage clustering algorithm to detect outliers. In the first stage, the algorithm used improves -means to cluster the data. In the second stage, while searching for outliers in the clustering results of the first stage, it identifies the final outlier. Malkomes et al. [18] used the -center clustering variant to handle noisy data, and the algorithms used are highly parallel. However, the selection of the initial center point of the -means algorithm is easily disturbed by abnormal points, which has a significant impact on the final results. However, efficient methods to solve the issue in which the -means algorithm is influenced by the initial centers have not been proposed.

Recently, scholars have focused on research into the issue that the selection of the initial centers of the -means algorithm is easily disturbed by outlier points and have proposed several improved algorithms to help the -means algorithm select the initial centers. The most classic improved algorithms are the -means++ algorithm and the -means|| algorithm. The -means++ algorithm, which was proposed by Arthur and Vassilvitskii [12], helps the -means algorithm to obtain the initial centers prior to the Lloyd iteration. It randomly selects a data point as the first cluster center, which is followed by selection based on the probability of the number of data points constituting the center point of the initial set of . The probability of selecting each successive center point is dependent on the previously selected cluster centers. However, due to the inherent sequential execution characteristics of -means++, the clustering centers must traverse the datasets times and the current clustering center calculation depends on all of the previously obtained clustering centers, which makes the -means++ initialization algorithm difficult to implement in parallel. Inspired by the -means++ algorithm, Bahmani et al. [13] proposed the -means|| algorithm to improve the performance of the parallelization and initialization phases. The -means|| initialization algorithm introduces oversampling factors, obtains initial centers that are much larger than the value of after a constant number of iterations, and assigns the weights to the center points. It then reclusters these weighted center points using the known clustering algorithm to obtain the final initial centers containing points. -means|| initialization has the advantages of the -means++ algorithm and also addresses the drawback of -means++ being difficult to extend. In follow-up research, researchers have proposed more improved algorithms of -means and most are compared to these two classic improved algorithms. Cui et al. [10] proposed a new method of optimizing -means based on MapReduce to process large-scale data, which eliminated the iterative dependence and reduced the computational complexity. Wei [19] improved the -means++ algorithm by selecting the cluster centers using the sampling method in the -means++ algorithm and then producing centers with the expectation of having an approximately constant factor for the best clustering result. Newling and Fleuret [20] used the CLARANS to help -means solve the problem of selecting initial centers.

However, the number of clusters in the -means algorithm and its variations must be known in advance, and the best data division based on this parameter is then defined. The data division defined in this way is actually based on an imaginary model; it is not necessarily suitable for the best data division. In addition, the final clustering result is based on clustering under a hypothetical parameter without considering the actual structural relationship of the data.

In response to the problems described above, this paper presents a novel clustering algorithm called C--means that has both the “blind” feature of the CA and the fast, efficient clustering advantage of the -means algorithm. It can be applied to high-dimensional data clustering with strong scalability. We implement the parallelized C--means algorithm on the Spark cloud platform. Extensive experimental results show that the C--means clustering algorithm is more accurate and efficient than the baseline algorithms.

3. The Algorithms

In this section, we first introduce the -means clustering, -means++ clustering, and -means|| clustering algorithms. The motivation for using the CA as the initialization algorithm of the C--means clustering algorithm is then introduced, and the reason that the CA initialization can obtain clustering results that are approximately optimal is explained. Finally, we implement the parallel C--means algorithm. Before explaining these questions, we summarize the notions used throughout this paper in Table 1.

3.1. State-of-the-Art Algorithms
3.1.1. -Means

The -means algorithm is one of the most classic clustering algorithms, because of its simple and fast performance, leading it to be widely-used. The description of the -means algorithm is shown in Algorithm 1. First, we randomly select data points from the original dataset as the initial cluster centers denoted by , and we then calculate the distance between each data point in and each center in the initial centers . Each data point can independently determine which center is closest to it, given an assignment of data points to clusters, the closest center is denoted by . Then, the center of each cluster is updated, and each data point is repeatedly assigned to the cluster of the nearest center until the new set of cluster centers is equal to or less than the set of former cluster centers. This local search is called Lloyd iteration. The simple iteration of the -means algorithm gives it good flexibility and can work effectively even with today’s big data. Algorithm 1 presents the pseudocode for the -means algorithm [6, 12, 13].

Input:
Output: A set of clusters C1, C2,…
Begin
1: ← sample points uniformly at random from dataset
2: C_newC, C_old
3:   whileC_new - C_old∣ ≤ θdo:
4:   C_oldC_ new
5:   calculate all of the distances between and C_oldj:
         get_distance (, C_oldj), , C_old
6:   assign to the nearest C_oldj
7:   calculate new centroid C_new:
          
8:   end while
End
3.1.2. -Means++

Because the selection of the initial centers has a significant influence on the -means clustering results, the -means algorithm can only find a local optimal solution. To obtain the global optimal solution, it may be necessary to select the initial centers several times and then acquire the final values by constantly choosing these initial centers.

To overcome the disadvantages of -means, researchers have proposed improved methods to help -means find suitable initialization centers. -means++, which was proposed by Arthur and Vassilvitskii [12], is a typical representative algorithm (shown in Algorithm 2). The main idea of this algorithm is to select the initial centers one by one in a controlled way, and the calculation of the current cluster centers depends on all of the previously obtained cluster centers. Intuitively, the initialization algorithm selects relatively decentralized initial center points for -means clustering, and the -means++ initialization algorithm prioritizes the data points away from the previously selected centers when selecting a new clustering center. However, from the scalability point of view, the main disadvantage of -means++ initialization is its inherent sequential execution properties. The acquisition of centers must traverse the entire dataset times, and the calculation of the current cluster center relies on all of the previously obtained clustering centers, which makes the algorithm not scalable in parallel and therefore greatly limits the applications of the algorithm to large-scale datasets. Algorithm 2 presents the pseudocode for the -means++ algorithm [12].

Input:
Output: Initial center set
Begin
1: ← sample a point uniformly at random from dataset
2:  while do:
3:     sample with probability
4:     
5:  end while
End
3.1.3. -Means||

Based on the advantages and disadvantages of the two initialization algorithms described above, researchers have proposed a new initialization algorithm called -means|| [13] (see Algorithm 3 for details). The main idea of this algorithm is to change the sampling strategy during each traverse and propose an oversampling factor . Each time the sample points are traversed in a nonuniform way and the sampling process is repeated for approximately iterations, is the clustering cost of the selected centers. We can then obtain the centers of sample points with repeated sampling. The number of intermediate centers is larger than and much smaller than the original data size. Line 7 of Algorithm 3 shows that the center points in the set of center points are assigned weights, and the center points of these weights are then reclustered in line 8, that is, the clustered centers obtain the final centers. Finally, these points are fed into the Lloyd iteration as the initial centers. Algorithm 3 presents the pseudocode for the -means|| algorithm [13].

Input:
Output: Initial center set
Begin
1: ← sample a point uniformly at random from dataset
2:
3: for times do:
4:      ← sample each point independently with probability
5:   C ←
6: end for
7: For , set as the number of points in that are closer to than any other
 point in
8: Recluster the weighted points in into clusters
End
3.2. Intuition behind the Proposed Algorithm

The traditional -means random initialization method requires only one iteration and selects centers uniformly and randomly. The -means++ initialization method improves the method by randomly selecting the center point by selecting the initial center in a nonuniform way, but it requires iterations. Only one data point is selected for each iteration to join the set of center points. Moreover, the selection of the current center point depends on the previously selected center. -means++, which is a constantly updated nonuniform selection operation, increases the accuracy of -means++ over random initialization, but it makes the -means++ algorithm difficult to expand on a big dataset. Therefore, researchers proposed the -means|| algorithm to improve the shortcomings of random initialization and -means++ initialization and to choose initial centers in a nonuniform manner with fewer iterations. However, both the -means algorithm and its variant algorithms require the input of the clustering parameter in advance and must define the best data partitioning for this parameter. However, the defined division of data is actually based on a hypothetical value of and may not be suitable for the best division of data, so the actual accuracy of the clustering results cannot be guaranteed.

Based on the geometric meaning of neural networks and the M-P neuron model, the covering algorithm was proposed by Zhang and Zhang [21]. It obtains a rule based on field covering and does not require the numbers of clusters and initial centroids to be prespecified. However, the traditional covering algorithm may face a problem in which some data points of the existing clusters are too large in the clustering process, which results in unreasonable clustering results. Therefore, based on the quotient space theory, we propose a covering algorithm called CA. The concept of granularity was first proposed by Zadeh in the 1970s [22], and Zhang and Zhang proposed the theory of quotient space [23]. This theory provided a reasonable formal model for mankind’s ability to analyze and synthesize problems on a macroscopic and granular scale. Different granularities describe information at different levels. When the granularity is too small, all of the data points are self-formed and the inner knowledge cannot be mined. When the granularity is too coarse, all of the data are aggregated into a cluster, so some properties of the problems are obscured. Granularity is introduced to scientifically accomplish the task of covering clustering and obtain the optimal clustering results.

The CA requires neither the number of clusters to be prespecified nor the initial centers to be manually selected, and it automatically finds a set of fields that can separate samples with low similarity and merge samples with high similarity. The center of the set constitutes the initial clustering centers. Therefore, the CA has the beneficial feature of being “blind”. Without knowing the number of clusters a priori, based on the relationships of the data, the CA can automatically identify the number of clusters and has no dependence on the initial clustering centers as well as fast computational speed. The CA also has good scalability. It is easy to implement in parallel, which is suitable for data processing in a big data environment. Therefore, this paper uses the improved CA as a -means initialization algorithm to obtain the set of initial center points.

3.3. Overview of the C--Means Algorithm

In this section, we introduce the realization of the C--means clustering algorithm in detail. Figure 1 depicts the entire process of the C--means algorithm. The C--means algorithm is divided into two main phases: phase 1 and phase 2. Phase 1 performs the CA initialization, and phase 2 performs the Lloyd iterations. Next, we describe both phases in detail.

3.3.1. Phase 1: Overall Procedure of the CA

Algorithm 4 presents the pseudocode for the CA initialization. Below, we introduce the implementation process of CA in detail. (1).Find the center of gravity of all of the sample sets that have not been clustered (covered) and then take the point denoted by center that is closest to the gravity as the initial center of the first cluster; this process is get_center (Cu) in Algorithm 4.(2).Find the distance between each data point and center that has not been clustered separately and obtain the sum of all of the distances denoted by . Next, we set the weight on all data points. Finally, we use and to calculate the covering radius, , which is introduced in get_weight_radius(c,Cu) in Algorithm 4.(3).Find the centroids of the current spheres continually according to the obtained center and radius and obtain new clusters until the number of clusters in the data points does not increase. We can then determine the spheres (covering or clustering), which is introduced in get_covering (c, r, Cu) and lines 10 to 15 in Algorithm 4.(4).Repeat steps (1), (2), and (3) until all of the data points have been completely covered. This is introduced in lines 3 to 16 in Algorithm 4.

During the data clustering process, we can also automatically adjust the inner class and interclass relationships based on the actual demand or the relationship between the data in the dataset. For a covering with fewer sample points, the single linkage method (using the Euclidean distance) in the hierarchical clustering algorithm [24, 25] is adopted to merge them to form an ellipsoidal domain, which means combing the most similar pair of clusters into a new cluster. Then, the similarities between the new cluster and the other clusters are updated, and the two most similar clusters are again merged. Based on the relationship between the data in the dataset or the actual demand, we can decide whether to continue merging the clusters with fewer data points or to split the spheres with more data points. Finally, we can obtain reasonable covering divisions with all of the similar data points that are distributed in one area (spherical or ellipsoidal), which is introduced in lines 17 and 18 in Algorithm 4.

Input:
Output: Results of parallel covering with granularity analysis–A set of clusters
Begin
1: center
2: Set
3:  do
4:   center get_center(Cu)
5:   radius r ← get_weight_radius(,Cu)
6:   Covering Cform=get_covering(,r,Cu)
7:   c ← get_centroid(Cform)
8:   r ← get_weight_radius(,Cu)
9:     Covering Clast=get_covering(,r,Cu)
10:    whileClast.subtractByKey(Cform)>0
11:      CformClast
12:       ← get_centroid(Cform)
13:      r ← get_radius_centroid(,Cu)
14:      Clast=get_covering(,r,Cu)
15:    end while
16: while ()
17:   Do Split Operation
18:   Do Merge Operation
19: return
End

Figure 2 presents an illustrative example to intuitively demonstrate the clustering process of Algorithm 4. To cluster the data points, Algorithm 4 goes through five iterations to identify five clusters (covering or fields), . We then compute the relationship between the inner class and the interclasses and find that clusters and are very similar. Therefore, the sixth iteration merges them into one cluster and then updates the similarities between each cluster, where are the centers and are the radii, respectively.

When we study a dataset, we can divide it in different ways. Each division is a quotient space of different granularities. We observe and analyze this dataset from different granularities. Based on the different granularities of the observation and analysis datasets, we can solve the problem in different granular worlds and can jump quickly from one granular world to another. This ability to handle different worlds of granularity is a powerful manifestation of the solution of human problems [26]. When we study the problem of reasonably clustered datasets, we can put the problem in the quotient space with different granularities for analysis. We can then obtain the solution to the clustering problem synthetically. In a different granularity quotient space, we can observe the different nature of the dataset and then find the properties of interest to the user, which can be maintained in different granular worlds or preserved to a certain extent. However, not every arbitrary division can achieve this goal. Therefore, the dataset division and its choice of granularity must be studied, that is, we need to select the appropriate dataset division. Based on the above, we propose the split-operation and merge-operation mechanisms in the C--means algorithm to help the datasets determine the appropriate partitioning and granularity. The C--means algorithm automatically adjusts the number of clusters during the iteration by merging similar clusters and splitting clusters with larger standard deviations. Finally, after a small number of constant iterations, C--means helps the dataset find the appropriate number of clusters and initial centers, and it then feeds the clustering centers into the Lloyd iteration to complete the final clustering process and determine the reasonable quotient space for the original dataset.

Adjustment Mechanism 1: Split Operation. First, we calculate the vector of the standard deviations for all of the samples in the cluster to the center of the cluster in all of the clusters: , , where is the number of existing classes and is the dimension of the samples. We then calculate the maximum component on of the standard deviation vector of each class and determine the threshold value σs. For cluster Cu, we consider the following conditions: (1) the maximum component-wise standard deviation in the cluster, that is, ; (2) the average distance between the samples in the cluster is greater than the overall average distance, that is, , where and represent the average inner class distance of the i cluster (i.e., the average distance from the sample to the centroid in the calculation cluster) and the overall average distance (i.e., the overall average distance of each sample to its inner class centers), respectively; (3) the number of samples in the cluster is greater than , that is, , where is the threshold cluster number, is the minimum number of samples allowed in each cluster (if less than this number, it cannot form a cluster), and denotes the number of samples in the th cluster; and (4) the number of clusters is greater than . If all of these conditions are satisfied, then split cluster Cu into two clusters with two cluster centers Cu+ and Cu− and delete the original class Ci. The current number of clusters will increase by 1. The values of Cu+ and Cu− are the components corresponding to in the original Cu that to are added to and subtracted from, respectively, while their components remain unchanged.

Adjustment Mechanism 2: Merge Operation. To sort the numbers of points contained in all clusters that have been formed, for clusters with fewer points, we calculate the similarity values between all other clusters and them: , and . To sort all of the obtained , values according to the value of the final number of clusters ; we merge the two clusters with the largest values and update the merged cluster centers. The current number of clusters will decrease by 1.

3.3.2. Phase 2: Overall Procedure of Lloyd’s Iterations

Phase 1 determines the suitable value of and specific initial centers by performing the CA initialization. In phase 2, we assign the data points in the dataset to the cluster whose center is closest to the data point according to the cluster centers obtained in phase 1. We then update the class centers until the convergence condition is satisfied. All of the data are distributed to the cluster when the data point is closest to the cluster center, that is, the Lloyd iteration of the -means clustering algorithm is completed, and the clustering results near the optimal clustering solution are obtained to complete the proposed C--means algorithm. Our CA initialization and final C--means algorithm can be easily parallelized, and we can rapidly complete the clustering operations.

3.4. Computational Complexity Analysis

This section discusses the computational complexity of the C--means algorithm with two phases. First, we analyze the computational complexity of the forming phase of C--means (i.e., CA initialization). In Algorithm 4, the computational complexity of line 5 is because dataset contains a maximum of points. Similarly, the computational complexities of lines 5 and 6 are also , and those of lines 7, 8, and 9 are also because the number of clusters is smaller than . Lines 10–15 will be repeated until the data points in the cluster do not change. Lines 3–15 must also be repeated until all of the data points in are covered, and the number of repetitions num_C is much smaller than . In line 3, the radius of a cluster is the average distance between the center of the cluster and all of the data points that are not covered by any clusters. On average, each newly created cluster covers half of the uncovered data points, and the computational complexity is . In line 17, the computational complexity is because there is a maximum of clusters after the initial covering process. Similarly, the computational complexity of line 18 is because there is a maximum of clusters. The number of clusters is much smaller than . Thus, the computational complexity of Algorithm 4 is . We then introduce the second phase’s computational complexity, which is the Lloyd iterations. The second phase performs num_iter iterations until the cluster centers do not change, so its computational complexity is . The numbers of clusters and iterations num_iter are much smaller than . Therefore, the computational complexity of the C--means algorithm is .

3.5. A Parallel Implementation

In this section, we discuss the proposed CA initialization and the parallel implementation of the C--means algorithm on Spark.

Spark is the de facto distributed computing platform for large data processing and is particularly suitable for iterative calculations. A main component of Spark is the resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across multiple machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across multiple machines and reuse it in multiple parallel operations. The RDD is the main reason that Spark is able to process big data efficiently. Due to the performance of memory computing, data locality, and transport optimization of Spark, it is particularly suitable for performing recursive operations on big data [27]. However, not all large-scale data can be efficiently processed via parallel implementation. Partitional clustering algorithms require an exponential number of iterations [28]. Simultaneously, exponential job creation time and time of large-scale data shuffling are difficult to accept, especially for large amounts of data, so mere parallelism is not sufficient. High performance can be reached only by eliminating the partitional clustering algorithm’s dependence on the iteration.

The parallel implementation principle of the C--means clustering algorithm in Spark is illustrated in Figure 3. As demonstrated, C--means consists of three main stages. Stage 1 performs the parallel CA on Spark, and stage 2 analyzes the results of the initial covering clustering obtained from Stage 1 and splits or merges the clustering results through self-organization to determine the number of clusters and the specific initial center set. Together, stages 1 and 2 constitute the parallel CA initialization process. Stage 3 is the Lloyd iteration phase, in which Lloyd iteration is conducted on initial centers to obtain the optimal clustering results.

The covering algorithm implemented on Spark is illustrated in stages 1 and 2 in Figure 3. The distributed files are read from Hadoop Distributed File System (HDFS) and transformed into [29]. The parallel covering process in stage 1 consists of many covering processes. Each covering process comprises three processes, which obtain the cluster and its center, radius, and the cluster, respectively. Stage 1 describes the process for obtaining all of the clusters. We obtain the first cluster center through the reduce operation on Spark. This operation obtains the data point that is nearest to the centroid of all of the data in parallel. Next, we obtain the radius of the cluster through the map and reduce operations on Spark. Specifically, an intermediate variable is obtained through the map operation on Spark. The map operation calculates the distance between the cluster center and each uncovered data point and forms . Then, the radius is obtained through the reduce operation. This operation produces the radius by calculating in parallel. Finally, we obtain cluster_1 through the filter operation on Spark. Simultaneously, the filter operation filters the data points, where the distances between center and each uncovered data point are less than the radius . The radius and center are acquired using the process introduced above. The remaining clusters are obtained in a manner similar to the first cluster. These processes are repeated until no more data points can be identified, which indicates that all of the data points have been included in these clusters. This is the end of the covering process, which also indicates that stage 1 is complete. After the covering process, C--means performs the split and merge operations in stage 2 to obtain the final initialization centers. Through the CA initialization process, the initialization centers are adaptively obtained and then fed into Lloyd’s iteration in stage 3. As described earlier, Lloyd’s iteration can also be easily parallelized on Spark. Therefore, it is imperative that we implement an efficient CA initialization and C--means algorithm on the Spark platform.

4. Experimental Results

This section presents a detailed analysis and comparison of the experimental results, including sequential and parallel versions of the algorithm to confirm the merits of our C--means algorithm, which include the following: (1) the C--means algorithm can adaptively determine the number of clusters and obtain a set of cluster center points according to the similarity between the data, which then allows the C--means algorithm to obtain high-precision clustering results, (2) the C--means algorithm can obtain a clustering result that is near the optimal value which outperforms -means in terms of its cost and is very similar to -means++ and -means||, and (3) compared with -means++ and -means||, the number of Lloyd’s iterations in the C--means algorithm is relatively small which converges quickly when accuracy and cost are ensured, meaning that the proposed C--means algorithm is accurate and efficient under parallel conditions.

In this paper, the C--means clustering algorithm and its counterparts are implemented sequentially and in parallel. The sequential implementation is evaluated on a stand-alone computer with a 6-core 3.60 GHz processor and 20 GB of memory. All of the parallel algorithms are implemented on a cluster of Spark 1.6 with Hadoop 2.6. The cluster has 16 nodes, each of which is an 8-core 3.60 GHz processor with 20 GB memory.

4.1. Datasets

We used 7 datasets in our experiments to evaluate the performance of the C--means algorithm. The summary statistics and information about these 7 datasets are shown in Table 2.

The question marks in Table 2 indicate that the number of clusters in the dataset is unknown.

Some of the datasets, such as Gauss, are synthetic, and the others are from real-world settings and are publicly available from the University of California Irvine (UCI) machine learning datasets [30]. The Iris dataset [3133] is a well-known database in clustering algorithm comparisons. It consists of three types of Iris plants (setosa, versicolor, and virginica) with 50 instances, each of which was measured with four features. The Wine dataset [3133] is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. It contains 178 instances measured with 13 continuous features. The Abalone dataset [34, 35] contains physical measurements of abalone shellfish. It contains 4177 instances with 9 features each (1 cluster label and 8 numeric and we apply 8 primary features), which are divided into 29 clusters. The age of an abalone can be determined by cutting the shell through the cone, staining it, and counting the number of rings with a microscope. In practice, measurements are used to estimate the age. The SPAM dataset [13] consists of 4601 instances with 57 dimensions and represents features available to an e-mail spam detection system. The Cloud dataset [12] consists of 1024 instances in 10 dimensions and represents the 1st cloud cover database. The individual household electric power consumption dataset [10] contains 2,049,280 instances with 9 features, 7 of which are applied in this paper because the other 2 are related to time, which are not applicable.

To effectively evaluate the experimental performance of the algorithm, we normalized the datasets. All of the algorithms use datasets that are normalized to frequent cases. When the dimension of the data points in a dataset is too high, it reduces the discrimination of the other dimensions with lower values during the clustering process. We normalized the datasets in an operation by where and represent the th dimension values of the th data point in the dataset before and after normalization, respectively, and and are the maximum and minimum values of the th dimension of all data points in the dataset, respectively.

4.2. Baselines

In the remainder of this paper, we assume that both the -means++ and -means|| initialization algorithms implicitly follow the Lloyd iteration process. The proposed C--means clustering algorithm outperforms the baseline algorithms as described below: (i)Traditional -means algorithm (or -means algorithm): this algorithm is based on random initialization and is often applied to randomly select sample points as the initial centers for Lloyd’s iteration and complete the final clustering process accordingly (see Algorithm 1) [6].(ii)-means++ algorithm: this method selects centers as the initial centers for Lloyd’s iteration through multiple iterative processes. Based on the probability of each sample point, each iteration selects 1 sample point from the dataset to join the center set and completes the final clustering process (see Algorithm 2) [12].(iii)-means|| algorithm: this method selects centers as the initial centers for Lloyd’s iteration through a constant number of processes. Based on the probability of each sample point, each iteration selects sample points from the dataset to join the center set. It then reclusters the initial center set to obtain the final center set and feeds the final initial center point into Lloyd’s iteration. The final clustering process is then completed (see Algorithm 3) [13].

4.3. Evaluation Metrics

The effectiveness of clustering is evaluated by numerous factors that determine the optimal number of clusters and the granularity of checking the clustering results. The evaluation of clustering results is often referred to as cluster validation, and researchers have proposed many measures of cluster validity. In this paper, we choose six standard validity measures to examine the soundness of the clustering algorithms, including Davies-Bouldins index (DBI) [10, 35, 36], the Dunn validity index (DVI) [36, 37], normalized mutual information (NMI) [3840], the clustering cost function (), the Silhouette index (SI) [41, 42], and the SD index (SDI) [42]. These measures are described as follows: where

In the DBI validation measure, denotes the number of clusters, denotes the average distance within the th cluster, and denotes the distance between the th cluster and the th cluster. In the DVI validation measure, denotes the number of clusters and denotes the distance between two data points. In the NMI validation measure, and denote the obtained cluster and true classes, respectively, where is the mutual information between and and and are the Shannon entropies of and , respectively. The variables in the cost function are described in Table 1. In the SI validation measure, denotes the number of clusters, denotes the average distance from the th object to all of the objects in the same cluster, and denotes the minimum average distance from the th object to all of the objects in a different cluster. In the SDI validation measure, denotes the number of clusters, denotes the average scattering of the clusters, where denotes the variance of cluster , denotes the variance of data set , and denotes the total separation between the clusters, where denotes the maximum distance between the cluster centers, denotes the minimum distance between cluster centers, and denotes the weighting factor that is equal to , where is the maximum number of input clusters. DBI is a function of the ratio of the sum of the inner cluster distribution to the intercluster separation. The lower the DBI value is, the better the clustering performance will be because the distance within the clusters is small, but the distance among the clusters is large. DVI is a function of the ratio of the intercluster distribution separation to the sum of the inner cluster distributions. The larger the DVI value is, the better the clustering performance will be because the distance among the clusters is large and the distance within the clusters is small. NMI indicates the difference between the actual data type of the original data and the data type calculated by the clustering algorithm. Therefore, the NMI validation measure requires that the actual data type and the calculated data have the same number of class elements. The NMI values are in the interval [0, 1], and a larger value means that the two clusters are very similar and also indicates a better clustering result. The value of the cost function indicates the sum of the distances from each data point to the nearest cluster center. Therefore, the lower the cost function is, the better the clustering performance will be. The purpose of SI is to calculate the average dissimilarity between points in the same cluster and a different cluster to describe the structure of the data. The SI values are in the interval [−1, 1], and a larger SI value indicates a more optimal number of clusters in the dataset. The SDI is based on the average scattering of the clustering and the total separation of clusters. The minimum SDI value indicates that is the optimal cluster number.

4.4. Determination of an Optimal Value of in C--Means

CA self-organizes and recognizes the number of clusters based on the similarities in the data without prior knowledge. By executing the CA algorithm, we can initially obtain the approximate number of clusters . Next, we will conduct the split-operation and merge-operation mechanisms (see Section 3.3) to help the datasets determine the appropriate partitioning and granularity. To evaluate the resultant clusters for finding the optimal number of clusters, properties such as the cluster density, size, shape, and separability are typically examined by such as the DBI, DVI, SI, and SDI cluster validation indices. The clustering validity approach uses internal criteria to evaluate the results with respect to the features and quantities inherited from the data to determine how close the objects within the clusters are and the distances among the clusters.

Performing the CA on datasets Iris and Wine, the numbers of clusters are known (see Table 2). We initially obtain the approximate number of clusters 6 for the Iris dataset and 7 for the Wine dataset. We then conduct the split-operation and merge-operation mechanisms to get several numbers of clusters that close to 6 for the Iris dataset and 7 for the Wine dataset, respectively. The numbers of split operations are between 1 and 5 for both the Iris and Wine datasets. The numbers of merge operations need not be pregiven because they are determined by the numbers of clusters and split operations. To further evaluate the results, we choose the Cloud and Gauss datasets to execute the CA, in which the numbers of clusters are unknown (see Table 2). We initially obtain the approximate number of clusters 7 for the Cloud dataset and 13 for the Gauss dataset, respectively. Similarly, we then conduct the split-operation and merge-operation mechanisms to get several numbers of clusters that close to 7 for the Cloud dataset and 13 for the Gauss dataset, respectively. The numbers of split operations are between 0 and 6 for both the Cloud and Gauss datasets.

Table 3 shows a comparative analysis of the Iris and Wine datasets, using four validity measures. Because the numbers of clusters in the datasets are known, we can intuitively determine that the finite number is obtained by our CA when most of the clustering indexes obtain the optimal value. Table 3 shows that 3 clusters are optimal on both datasets, which exactly match the actual numbers of clusters in the datasets. We used the results of the clusters from CA to check the performance of C--means in the Cloud and Gauss datasets and compared them to four existing validation indices. As shown in Table 4, the optimal validation indicators for the Cloud dataset are obtained with 10 clusters, thus the optimal cluster value is 10. For the Gauss dataset, each index shows that the optimal value is 13. The CA combined with split-operation and merge-operation mechanisms self-organizes and recognizes the reasonable number of clusters based on the similarities in the data for any dataset.

4.5. Clustering Validation

Clustering validation is generally concerned with determining the optimal number of clusters and checking the suitability of the clustering results [10]. The evaluation of the clustering results is commonly referred to as cluster validation [10, 35, 43]. The accuracies of the baseline approaches and the C--means algorithm are measured in terms of three standard validity measures, namely DBI, DVI, and NMI, on datasets of different sizes. Other than the individual household dataset, the other datasets are small enough to be evaluated on a single machine. We compare the accuracies of C--means and the baseline approaches on the Iris, Wine, and Abalone datasets because the numbers of clusters and the labels to which the data belong are known in those datasets. The value of is kept constant to effectively compare the C--means algorithm and the baseline algorithms. Using the split- and merge-operation mechanisms, the number of clusters of C--means is adjusted to be consistent with the number of clusters in the baseline algorithms. Table 5 shows a comparative analysis of the different approaches on the three datasets and the three validity measures. For the Iris and Wine datasets, the numbers of split operations are both 1. And for the Abalone dataset, the number of split operations is 8. To better verify the performance of the algorithms, we also choose the Gauss, SPAM, and Cloud datasets, the class categories of which are unknown for the experiments. To examine the soundness of our clusters, we discuss the DBI and DVI values of these three unknown data label datasets to those of C--means for moderate values of . For the Gauss dataset with different values of , the numbers of split operations are 32, 4, and 10, respectively. For the SPAM dataset, the numbers of split operations are 10, 30, and 50, respectively. And for the Cloud datasets, the numbers of split operations are 6, 8, and 8, respectively. We also use other values of and obtain similar results. The clustering results for C--means and the baseline approaches are listed in Table 6 for the Gauss dataset, Table 7 for the SPAM dataset, and Table 8 for the Cloud dataset. Obviously, the three tables show that the accuracies of proposed C--means are better than baseline approaches.

4.6. Cost

To evaluate the clustering cost of C--means, we compare it to the baseline approaches. We compare the cost of the SPAM and Gauss datasets to that of C--means for moderate values of . For the Gauss dataset with different values of , the numbers of split operations are 4, 5, and 10, respectively. For the SPAM dataset, the numbers of split operations are 5, 4, and 4, respectively. The results of the Gauss and SPAM datasets are presented in Tables 9 and 10, respectively. For each algorithm, we list the cost of the solution at the end of the initialization step before Lloyd’s iteration as well as the final cost. In Tables 9 and 10, “seed” represents the cost after the initialization step and “final” represents the cost after the final Lloyd iteration. The initialization cost of C--means is similar to that of -means|| and lower than that of -means++. These results suggest that the centers produced by C--means, like those produced by -means||, are able to avoid outliers. In addition, C--means guarantees high precision with high efficiency because CA runs very fast.

4.7. Computational Time

The individual household dataset is sufficiently large for large values of . We now consider the parallel algorithms for the individual household dataset. For the household dataset with corresponding values of , the numbers of split operations are 6, 9, and 7, respectively. C--means is faster than -means, -means++, and -means|| when implemented in parallel. The running time of C--means consists of two components: the time required to generate the initial solution and the time required for Lloyd’s iteration to converge. The former is proportional to . The latter is considered, and C--means is compared to the baseline approaches. Table 11 shows the total running time of the clustering algorithms. For some values of , C--means runs much faster than -means and -means++. C--means runs much faster than -means|| when . However, when is 500, the total running time of C--means is similar to that of -means|| because C--means needs to split and merge many times to obtain the number of clusters, which means that the initialization occupied a large proportion of the total running time.

Next, an expected advantage of C--means is demonstrated; the initial solution discovered by C--means contributed to a faster convergence of Lloyd’s iteration. Table 12 shows the number of iterations required to reach convergence of Lloyd’s iteration for the Cloud dataset with different initializations. C--means typically requires fewer iterations than the baseline approaches to converge to a local optimal solution. The convergence times of the iteration for datasets of different dimensions are also evaluated, and the Gauss and SPAM datasets are selected to verify the performance of the proposed C--means algorithm. The graphical representations of the number of iterations required to reach convergence of Lloyd’s iteration for datasets of several different dimensions with different initializations are shown in Figure 4(a) for the Gauss dataset (3 dimensions) and Figure 4(b) for the SPAM dataset (57 dimensions).

5. Conclusions and Future Work

This paper presents a covered -means algorithm (C--means) that uses an improved covering algorithm (CA). First, based on the similarity between the data, the C--means algorithm uses the CA initialization to determine the number of clusters and the specific cluster centers through self-organization. Because it is independent of the initial cluster centers, the CA is characterized as being “blind” without the need to have prespecified. The -means algorithm is then used to perform Lloyd’s iteration on the initial cluster centers determined by the CA until the cluster centers do not change, which means that the C--means clustering is complete, and the clustering results are close to optimal. In addition, a parallel implementation of C--means is performed on the Spark platform. Parallel computing is used to solve a large-scale data clustering problem and improve the efficiency of the C--means algorithm. A large number of experiments on real large-scale datasets demonstrated that the C--means algorithm significantly outperforms its counterparts under both sequential and parallel conditions. In future, we will optimize C--means and focus on the parameters that increase its speed and parallelism.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Key Technology R&D Program (no. 2015BAK24B01), the Natural Science Foundation of Anhui Province of China (no. 1808085MF197), and a Key Project of Nature Science Research for Universities of Anhui Province of China (no. KJ2016A038).