Skip to main content
  • Research article
  • Open access
  • Published:

Deep gene selection method to select genes from microarray datasets for cancer classification

Abstract

Background

Microarray datasets consist of complex and high-dimensional samples and genes, and generally the number of samples is much smaller than the number of genes. Due to this data imbalance, gene selection is a demanding task for microarray expression data analysis.

Results

The gene set selected by DGS has shown its superior performances in cancer classification. DGS has a high capability of reducing the number of genes in the original microarray datasets. The experimental comparisons with other representative and state-of-the-art gene selection methods also showed that DGS achieved the best performance in terms of the number of selected genes, classification accuracy, and computational cost.

Conclusions

We provide an efficient gene selection algorithm can select relevant genes which are significantly sensitive to the samples’ classes. With the few discriminative genes and less cost time by the proposed algorithm achieved much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.

Background

Studying the correlation between microarray data and diseases such as cancer plays an important role in biomedical applications [1]. Microarray data contains gene expressions extracted from tissues (samples). We can obtain more information about the disease pathology by comparing the gene expressions of the normal tissues with the ones of the diseased tissues [1]. Exploring the difference between the cancerous gene expression in tumor cells and the gene expression in normal tissues can reveal important information from microarray datasets, based on which a number of classification techniques have been used to classify tissues into cancerous / normal or into types/subtypes [2,3,4,5,6]. However, microarray data generally has its own high dimensionality problem, i.e., usually there are thousands of genes/attributes but a few samples in a dataset. Moreover, most of these attributes are irrelevant to the classification problem. Therefore, reducing the attribute dimensionality and meanwhile ensuring that the selected attributes still contain rich and relevant information could address this data imbalance problem, although it remains a big challenge. In addition, small sample set makes the problem much harder to solve because the Machine Learning (ML) algorithms do not have enough space to learn (training examples) and this will increase the risk of over fitting. Moreover, microarray data is known as of highly complicated because most of the attributes (genes) in microarray data are directly or indirectly correlated with each other [7]. Selecting a small relevant attribute subset can solve many problems related to microarray data [8, 9]. By removing irrelevant and redundant attributes, we can reduce the dimensionality of the data, simplify the learning model, speed up the learning process and increase the classification accuracy. Several studies have developed and validated a novel gene expression signature and used it as a biomarker to predict cancer in clinical trials [10, 11]. Cancer-associated microarray biomarkers allow less-invasive monitoring and can facilitate patient diagnosis, prognosis, monitoring, and treatment in the oncology field [12, 13].

Several gene selection methods have been developed to select the genes that are directly related to the disease diagnosis, prognosis, and therapeutic targets [14]. In addition to statistical methods, recently data mining and machine learning solutions have been widely used in genomic data analysis [9, 15]. However, still most of the existing gene selection approaches are suffering from several problems such as the stagnation in local optima and the high computational cost [16,17,18]. Therefore, to solve these problems an efficient new selection approach is needed.

Evolutionary Algorithms (EA) have recently played an important role in gene selection field due to their ability in global search [19]. Besides, many hybrid EA have been proposed to improve the accuracy of the classification methods [20,21,22,23]. Various evolutionary algorithms aim to find an optimal sub-set of features by using bio-inspired solutions (such as Genetic Algorithm (GA) [24], Genetic programming (GP) [25], particle swarm optimization (PSO) [26], and Honey Bee [27]). These kinds of algorithms have shown appropriate performances over various problems but are dependent on expert’s intervention to obtain the desired performance.

Recently, a new gene selection method called Gene Selection Programming (GSP) [28] was proposed which showed good results in terms of accuracy, the number of selected genes and time cost. However, the problem of search space is still unsolved.

Gene Expression Programming (GEP) [29] is a new evolutionary algorithm, which was widely used for classification and gene selection [30,31,32,33,34,35]. GEP has two merits: flexibility which makes it easy to implement, and the capability of getting the best solution, which is inspired by the ideas of genotype and phenotype. In this paper, we use GEP to construct our algorithm.

The purpose (and contribution) of this paper is to present a simple and thus computational efficient algorithm to solve the problem of attribute selection from microarray gene expression data. To this end we explore how to extract the important features from massive datasets.

The rest of this paper is organized as follows: In Gene Expression Program a brief background of GEP is presented. The proposed gene selection algorithm DGS is presented in Results. Evaluation results and discussions, as well as statistical analysis, are presented in Discussion. Finally, Conclusion gives the conclusions.

Gene expression program

Gene Expression Program (GEP) [36] is an evolution algorithm that creates a computer programing/ model from two parts. The first part, which is also known as genotype, is the characteristic linear chromosomes with a fixed length. Each chromosome consists of one or more genes and each gene consists of a head (h) and a tail (t). The head consists of terminals (attributes) and functions while the tail consists of attributes only, and the head length and tail length follow the rule t = h (n-1) + 1 where n is the maximum number of parameters required in the used functions. The second part is the expression tree (ET) which is also known as phenotype. For example, suppose h = 5 and the chromosome has only one gene. The function set is {+, Q, /} where Q is the square root and the terminals set (the attributes in the data) is coded as {a0,…, a6} then an example of chromosome could be.

+/a4Qa2a1a5a6a3 a0 a3,(Genotype)

where the bold part represents the head and the rest represents the tail. The ET is.

(Phenotype)

The basic GEP algorithm consists of four steps: creating the chromosomes to initialise the population, evaluating the fitness of each individual/ chromosome by using a predefined fitness function, identifying a suitable stop condition/s and applying the genetic operations to modify the individuals for the next generation. GEP was successfully applied on microarray data to find different biological characteristics [30, 37]. More details about GEP algorithm and process can be found in [29, 36, 38].

Results

Materials

In our experiments, we evaluated the performance of DGS method on an integrated lung cancer microarray dataset downloaded from NCBI (https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE68465). The dataset contains 442 patients collected from 4 hospitals: Moffitt Cancer Center (MCC) 79 patients, Memorial Sloan-Kettering Cancer Center (MSKCC) 104 patients, University of Michigan Cancer Center (UMCC) 177 patients, and Dana Farber Cancer Centre (DFCC) 82 patients.

The data include various prognosis information, we used lung cancer recurrence information to predict the lung cancer recurrence. To this end, we extracted only the samples with recurrence or free survival (non-recurrence) and delete all the unrelated information such as the dead patients and the disease-free patients. After the preparation the total number of the patients in the dataset was 362. The number of cancer recurrence patients was 205 while the number of free survival patients was 157. The total number of attributes (probe sets) was 22,283. Regarding the training and testing of the method, we used 10-fold cross- validation method. The 9 folds were used for training DGS while the left fold was used for testing. For more reliability we repeated the experiment ten times and obtained the average results of these experiments.

To make the evaluations more reliable, we validated the prediction model using another independent dataset with the same statistical measures. The validation dataset from South Korea (GSE8894) can be downloaded from NCBI. GSE8894 dataset had 138 NSCLC samples from Affymetrix Hu133-plus2 platform microarray chips. It had an equal number of samples for two classes, i.e. 69 samples were labelled ‘recurrence’ and 69 samples were labelled ‘nonrecurrence’.

The best setting for the number of chromosome (CH) and the number of genes (N)

To find out the best settings for the number of chromosomes in each generation (CH) and the number of genes (N) in each chromosome, we did experiments with different values of CH and N. To show the effect of CH and N on the DGS classification performance, we selected nine different settings. Three different values for CH, 100, 200 and 300, and for each CH value, three different N values are selected: 1, 2 and 3. The values of CH are increased by 100 to make the effect of CH values clear, especially when the effect of increasing CH is very slight. To make the experiments more reliable, we repeated the experiment 10 times and took the average as a final result. The parameters used in DGS, which is based on gene expression programming (GEP) algorithm, are showed in Table 1.

Table 1 Parameters used in DGS

The average experimental results are presented in Table 2. ACavg, Iavg, Savg and TMavg represent the average accuracy, the number of iterations, the number of selected attributes and CPU time respectively for ten runs, while ACstd, Istd, Sstd. and TMstd. represent the standard deviation of the classification accuracy, the number of iterations, the number of selected attributes and CPU time respectively.

Table 2 the results of different setting for the number of genes (N) and the number of chromosomes (CH)

We observed from Table 2 that:

  1. 1-

    Comparing CH with N: CH has a less effect on the results than N.

  2. 2-

    Regarding CH results: CH has positive relationships with ACavg, TMavg and Savg.That is when CH value was increased, ACavg, TMavg and Savg. values also increased. While CH has negative relationships with ACstd, TMstd. and Sstd. That is when CH values increased, ACstd, TMstd. and Sstd. values were decreased. The results became stable when the CH was over 200.

  3. 3-

    Regarding N results: N has positive relationships with, ACavg, TMavg and Savg and negative relationships with ACstd, TMstd. and Sstd. The results became stable after two genes.

  4. 4-

    Increasing CH values over 200 would increase the processing time while the AC and N results would not significantly change.

  5. 5-

    The best results were achieved when the value of CH is 200 and the value of N is 2.

DGS evaluations

Evaluate DGS performance based on the AC, SN, SP, PPV, NPV, S, TM and AUC

The performance of DGS was evaluated and measured for each test in terms of classification accuracy (AC), Sensitivity (SN), Specificity (SP), Positive predictive value (PPV), Negative predictive value (NPV), the number of selected genes (S), and processing time (TM) with confidence intervals (CI 95%).

To make the evaluations more reliable, we compared DGS with five representative models on the integrated lung cancer dataset. These five gene selection algorithms were Correlation-based Feature Selection (CFS), Consistency Subset Feature Selection (CSFS), Wrapper Subset (WS), Support Vector Machine (SVM) which applied using WEKA with their default configurations, and Gene Expression Programming (GEP) using GEP4J package. All the values are the average (avg) values over ten runs of the models. Table 3 gives the performance evaluation values for all the prediction models.

Table 3 Comparison of DGS performance with different feature selection models in term of AC, SN, SP, PPV, NPV, AUC, S and TM with CI 95% for each test

In term of AC, the experimental results showed that the DGS method achieved the highest average accuracy result (0. 8749), while the average accuracies of other methods were 0.8436, 0.8370, 0.8395, 0.8544 and 0.8577for CSF, CSFS, WS, SVM and GEP respectively.

In term of SN, the experimental results showed that the DGS method achieved the highest average accuracy result (0. 9522), while the average sensitivity results of other methods were 0.8995, 0.8907, 0.8932, 0.9307and 0.9278 for CSF, CSFS, WS, SVM and GEP respectively.

In term of SP, the experimental results showed that the DGS method achieved the highest average accuracy result (0. 7739), while the average sensitivity results of other methods were 0.7707, 0.7669, 0.7694, 0.7548 and 0.766242 for CSF, CSFS, WS, SVM and GEP respectively.

The DGS model achieved the highest average PPV which was 0. 8462, while the average PPV of other models were 0.8373, 0.8332, 0.8351, 0.832 and 0.8382 for CSF, CSFS, WS, SVM, GEP respectively.

The highest average NPV was for DGS (0. 9253) while the average PPV of other models were 0.8550, 0.8434, 0.8468, 0.8931 and 0.8907 for CSF, CSFS, WS, SVM, GEP respectively.

DGS achieves the smallest number of selected genes (3.9) which is almost half of the number of genes selected by other comparison methods.

Regarding TM, the less processing time was for DGS (218.85) while the average time results of other models were 600.12, 600.02, 600.01, 600.21 and 620.51 for CSF, CSFS, WS, SVM, GEP respectively.

Figure 1 shows the effectiveness of DGS method in term of AC, SN, SP, PPV, NPV, S, TM and AUC.

Fig. 1
figure 1

Comparison of DGS performance with different feature selection models in term of AC, SN, SP, PPV, NPV and AUC

For more reliability, we validated the prediction model using an independent dataset (GSE8894). The selected genes were used as biomarkers to classify the recurrence/ non-recurrence patients. The evaluation results for DGS on the validation dataset in terms of AC, SN, SP, PPV, NPV and AUC are presented in Table 4, which show the effectiveness of the proposed gene selection algorithm DGS that enabled the prediction model to achieve the accuracy of 87.68%.

Table 4 Validation results of DGS on the independent dataset GSE8894

Figure 2 shows that the selected genes are able to separate risk groups (recurrence/non-recurrence) characterized by differences in their gene expressions.

Fig. 2
figure 2

The evaluation results for the selected genes. aThe gene expression level of the selected genes shown as a heatmap. b The prediction results using the selected genes

The biological meaning for the selected genes from DGS method

In this section we present the biological meanings of the selected genes obtained from “Expression Atlas” database of EMBL-EBI (http://www.ebi.ac.uk/gxa/). Table 5 shows the genes that were selected by DGS method for the ten runs.

Table 5 The selected gens of each run

We used the OMIM, Expression Atlas and NCBI websites to find the biological meanings of the selected microarray probe-ids and list their corresponding genes. The specifications are shown in Table 6.

Table 6 The final selected genes from the gene selection method DGS

DGS comparison with up-to-date models

We also compared DGS method with models recently proposed, which are IBPSO [39], IG-GA [40], IG-ISSO [41], EPSO [42], mABC [43] and IG-GEP [32]. The comparison results were based on two criteria: the classification accuracy and the number of the selected genes regardless of the methods of data processing.

We used the same datasets that were used by these up-to-date models to compare DGS results. A brief description of these datasets is presented in Table 7.

Table 7 Description of the experimental datasets

The comparison results are presented in Table 8. Across the ten datasets used in the comparison, DGS achieved the best results in seven datasets (11_Tumors, 9_Tumors, Leukemia1, Leukemia2, Lung_ Cancer, DLBCL and SRBCT) compared with the other comparator models, while mABC achieved better results in three data sets (Prostate, Brain_Tumor1, and Brain_Tumor2). Moreover, DGS achieved superior results in term of the number of selected genes which were the best results in all experimental datasets. The average evaluation values in terms of accuracy (ACavg) and the number of selected genes (Savg) for IBPSO, IG-GA, IG-ISSO, EPSO, mABC and IG-GEP are listed in Table 8.

Table 8 Comparison of the gene selection algorithms on ten selected datasets

Discussion

We improve the genetic operations that can improve the generation quality effectively. The experimental results show that the proposed DGS can provide a small set of reliable genes and achieve higher classification accuracies in less processing time.

These superior achievements are due to the following DGS features -

  1. 1-

    The ability of DGS to reduce the complexity by using different ways

    1. a.

      Narrowing the search space gradually. In each iteration DGS extract a new terminal set by removing the genes that don’t provide high fitness values (see DGS Population Generation)

    2. b.

      Reducing the generation size by applying Eq. 3. (see Generation size controlling)

  2. 2-

    The ability to select the related genes. In each generation DGS removes the unrelated genes to increase the probability of choosing related genes for generating 200 chromosomes, and after several generations DGS can finally find the most related genes. Table 5 shows the gene selection process and results.

  3. 3-

    DGS is faster compared with other comparative methods. This feature comes from the DGS’s abilities.

  • The ability of narrowing the search space.

  • The ability of resizing the chromosomes in each iteration

Table 9 shows the differences between DGS and the related methods GA and GEP.

Table 9 the differences between DGS, GA and GEP

Conclusion

In this paper, an innovative DGS algorithm is proposed for selecting informative and relevant genes from microarray data sets to improve cancer classifications. The proposed method inherits the evolutionary process from GEP. DGS has the ability of reducing the size of attribute space iteratively and achieve the optimal solution. We applied this method on an integrated dataset and selected 4 genes which can achieve better classification results.

Method

Proposed method

A novel evolutionary method named Deep Gene Selection (DGS) is presented in this section, which is based on the gene expression programming (GEP) algorithm. DGS is developed to explore the subset of highly relevant genes. The proposed evolutionary method consists of several steps as depicted in Fig. 3. According to Fig. 3, the attributes/genes are coded as a0, ----, am where m represents the number of attributes in the dataset. T is the size of the terminal set which is used to create a population of chromosomes. In the first-generation T = m.

Fig. 3
figure 3

DGS Flowchart

The length of each chromosome (L) is defined based on the dimensionality of the dataset. Furthermore, the minimum length of L could also be defined. Next, the population is evaluated using a fitness function that employs a classifier and the number of the attributes. After being assigned fitness values, all chromosomes of the population are sorted to find the best individuals that have the higher fitness values. Improved genetic operators are then applied to the selected population individuals and accordingly the top individuals (the individuals with the highest fitness values) are selected to generate the next generation. Then a new attribute subset with new T is extracted from these best individuals of the new generation. In other words, the output (new attribute set) of previous generation is the input of the next generation. After several generations, the attribute set will represent the minimum genes that can achieve the highest fitness values, because in each generation only the attributes that can achieve the highest fitness values will be selected. One termination condition of this iteration process is that there is no change in the top fitness values. This means the selected genes are the same (same attribute set) and the classification results are the same. Another termination condition is the number of generations reaches the maximum number although the program cannot reach the ideal solution. The selection operation will stop once one of these two termination conditions is met. The application of this algorithm on real data sets is presented in Materials. It is worth noting that the proposed method is taking the advantages of evaluation algorithms and dynamic attribute extraction to reach the optimal solution in a very simple and effective way.

Overall, the proposed method focuses on searching for superior solutions with the smallest number of attributes by using the evolutionary structures to evaluate the best solution and using the dynamic attribute extraction approach to narrow the search space. With the progress of iteration, the cost of search will decrease, and the quality of the solution will increase until the optimal solution (or the solution close to the optimal one) in the smallest space is achieved. DGS was implemented using Java. To implement the expression tree (ET), we used GEP4J package [54]. The DGS flowchart is presented in Fig. 3.

The detailed descriptions of the proposed method, including chromosome representation, initial DGS population, DGS fitness function and improved genetic operations, are presented in the following sub-sections.

DGS population generation

DGS population is the base of the proposed method. The chromosome concept and representation of DGS population are inherited from gene expression programming (GEP) algorithm (see section 2.2). The chromosomes are constructed from two sets: terminal set (ts) and function set (fs). The function set can be a set of any mathematic operators such as {−, +, /, *, sqr, log}. Terminal set in this paper represents the attribute set of the microarray dataset.

The first generation is generated from all attributes in the microarray dataset. Each individual (chromosome) of the generation is evaluated by the fitness function and assigned a fitness value. All the individuals are then sorted in a descending order from the highest individuals (the individual with the highest fitness value) to the lowest individual. Then the attributes of the first 50% individuals are extracted to generate a new terminal set (ts) for generating the next generation. This means the attribute output of an iteration will be the input of the next iteration for generating a new generation. This iterative population generation process will continue until one of the program termination conditions is met. In this way, DGS is able to reduce the dimension of the attribute search space by extracting the attributes that can achieve the high fitness values.

The details of this population generation process are outlined in Algorithm.1.

figure b

The following simulation example illustrates the generation of a DGS population.

Example 1

If we have a dataset that has13 attributes, then.

ts = {a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13}.

Let h = 3 and fs = {+. -, *, /, Q} then n = 2, t = h (n-1) + 1 = 4 and the gene length g = h + t = 7. Suppose each chromosome has only one gene. The population with 10 individuals/chromosomes, as well as their fitness values, is listed below:

Take chromosome 0 as an example to show how to calculate the fitness function.

+,-,a12 is the head, and a9,a3,a11 , a7 is the tail of chromosome 0.

The Phenotype/ET of chromosome 0 is.

figure a
figure c

DGS will use the gene expression of a12, a9, a3 genes to calculate the fitness.

DGS sorts the individuals in a descending order based on their fitness values, then selects the top 50% individuals from them (the highlighted individuals in the above example). DGS then extracts the attributes from these selected individuals to form a new terminal set which is {a3, a4, a5, a6, a7, a8, a9, a11, a12}.

DGS will use this new terminal set which is smaller than the original one and the function set to generate a new population. This process will continue until the program reaches the best solution (e.g., Accuracy = 100%) with no changes to the consecutive terminal sets, or the program reaches the maximum number of generations.

Generation size controlling

The generation size is determined by three values: the number of individuals/ chromosomes (CH) in a generation, the length of each chromosome (L) and the size of the terminal set (T). The generation size must be properly defined. If the size is too big, it will lead to the increment of the computational time, and if it’s too small, the generation may not cover all attributes /terminals. In the original evolution algorithms, the number of chromosomes in each generation (i.e., the generation size) is fixed, so the other values that are suitable for the first generation, are also suitable for all other generations. However, in our method, the first generation is generated from all attributes, and the number of attributes may be thousands in the big datasets. The attributes used for generating the second generation are a subset of the attributes of the first generation as we see in example 1. Usually, the number of attributes used for generating a generation is dynamic, i.e. it decreases or non-decreases with the progress of the evolution program. Therefore, the values of CH and L that are suitable for a generation may not be suitable for other generations. To ensure the generation size is properly defined, we define the following rule in Eq. (1) for these three values.

$$ L\ast CH= 2T $$
(1)

Actually L*CH is the overall size of a generation in terms attributes and functions. The constant 2 in Eq. (1) is to ensure that each attribute in the terminal set has nearly a double chance to be selected to generate a generation.

Our previous experiments [32] showed that the value of L has more impact on classification results and computational time than CH. So usually we use a fixed CH value (200) for all generations and changeable values for L.

In fact, let N be the number of genes of a chromosome/individual, then

$$ \mathrm{L}=\mathrm{N}\ast \left(\mathrm{gene}\ \mathrm{length}\right)=\mathrm{N}\ast \left(\mathrm{h}+\mathrm{t}\right) $$

where h is the length of gene head and t is the length of gene tail, and

$$ t=h\ast \left(n-1\right)+1 $$
(2)

where n represents the maximum number of parameters needed in the function set.

From our experiments, we found that N = 2 can provide the best classification results from microarray data sets. If we choose N = 2, then

$$ L=2\left(n\ast h+1\right) $$

Considering Eq. (1), we have

$$ 2\left(n\ast h+1\right)\ast CH=2T $$
$$ h=\left(T/ CH-1\right)/n $$

Usually n = 2 for commonly used functions, therefore h can be defined as the integer number of (T/CH-1)/n, i.e.

$$ h=\mathrm{floor}\left[\left(T/ CH-1\right)/n\ \right] $$

On the other hand, it is necessary to set a minimum value of h (h = 3 which is a commonly used value) to guarantee the genes of a chromosome contain enough information for evolution.

Based on the above rules and the minimum requirement, we can define the head size (h) of each gene in a chromosome as:

$$ h=\mathit{\max}\ \left( 3, floor\ \left[\left(T/ CH- 1\right)/ 2\right]\right) $$
(3)

Since CH is fixed (e,g. 200) and the number of genes in a chromosome is set as 2, once the value of h is defined according to (3), the overall size of a generation is defined. The following simulation example shows different h values with different sizes (T) of terminal set.

Example 2

If a microarray dataset originally has 2200 attributes and we set CH = 150, the values of h and T are listed in Table 10.

Table 10 The results of example 2

Fitness function

The purpose of using gene selection methods is to obtain a smallest gene subset that can provide the best classification results. To this end, a new fitness function is proposed to enable DGS to select the best individuals/chromosomes. The fitness value of an individual i can be calculated by the following equation

$$ {f}_i=\left(1-r\right)\ast AC(i)+r\ast \frac{t-{s}_i}{t} $$
(4)

This fitness function consists of two parts. The first part is based on the classification accuracy AC(i) of the individual i. We use support vector machine (SVM) as a classification method to calculate the accuracy of an individual/chromosome because it is a powerful classification algorithm which is widely used to solve the binary and multi-classification problems [55, 56] and can achieve a high classification accuracy. To calculate the AC, we use the following Eq. (5), which is widely used in cancer classification.

$$ AC=\left( TP+ TN\right)/\left( TP+ FN+ TN+ FP\right) $$
(5)

where TP, TN, FP and FN represent True Positive, True Negative, False Positive and False Negative respectively. The second part is based on the number of selected genes, specifically t is the total number of attributes in the terminal set and si is the selected number of attributes in the individual/chromosome i, rϵ [0,0.5) is a predefined weight controlling the importance of AC(i) and si.

Improved genetic operations and DGS algorithm

The reason of using genetic operations is to improve the individuals for achieving the optimal solution. In this paper, we improve two genetic operations: Mutation and Recombination. The improved genetic operations depend more on the weight of genes, as we explain below.

Attribute weight

The weight (w) of each attribute (i) is calculated based on Eq. (6)

$$ {w}_i=\frac{k_i}{sum}\kern0.5em \in \left(0,1\right) $$
(6)

where \( sum=\sum \limits_i{k}_{i\kern0.5em }\kern4em i\in ts \), ki is the rank value of the attribute i, and \( \sum \limits_{i\ }{w}_i=1 \) .

In this study we used Gain Ratio to calculate the rank of the individual i as follow:

$$ {k}_i=\frac{information\ gain\ \left(i\ \right)}{intrinsic\ information\ (i)} $$
(7)

The details of calculating the information gain and the intrinsic information can be found in [57,58,59].

The attributes with a higher weight contain more information for classification.

Mutation

Mutation is an important genetic operator which can significantly affect the individual’s development. It marks a minor variation in the genomes by exchanging one component with another. In evolution algorithms, the changes made by mutation might bring substantial differences to chromosomes. For example, a mutation might make a chromosome better in terms of fitness, or the important attributes might be lost due to a random mutation which could result in the decreasing of accuracy and the increasing of processing time.

The critical question is which attribute/terminal should be added or deleted when performing a mutation. Ideally, a weak terminal deleted by the mutation operation should be replaced by a strong one. This can be achieved by using the following improved mutation operation.

To clarify the DGS mutation operation, we provide a simple example shown in Fig. 4. In the example, the chromosome consists of a single gene (− / a6 a2 a0 a9 a7). The gene head size (h) is 3. The function set is {Q, +, −, *, /} which means n = 2. According to Eq. (2), the gene tail size (t) is 4 and the chromosome length is (3 + 4) =7.

Fig. 4
figure 4

Example of mutation operation for DGS

All the terminals in the database are weighed once at the beginning of the program and sorted in a descending order based on their weights as shown at the top of Fig. 4. In this example a3 has the highest weight while a8 has the lowest weight. Terminal a6 is identified by the DGS mutation as the weakest terminal as it has the lowest weight among all terminals in the example chromosome.

For this weak terminal a6, DGS mutation has two options to replace it: either it is replaced by a function such as (+) or by a terminal. In the latter option, the replacing terminal should have a weight higher than that of a6. In this example terminal a7 is selected as a replacing terminal. With the stronger terminals/attributes after mutation, the new chromosome might achieve a higher fitness value than the previous one. The details of this mutation operator are outlined in Algorithm 2.

figure d

Recombination

The second genetic operation we used in this proposed method is the recombination operation.

Generally, in the recombination operation pairs of chromosomes (parents) are randomly selected and combined to generate new pair. To generate the new chromosomes, the parents will exchange one or more parts (short sequences) with each other. The exchanging part can also be the entire gene from one parent with the equivalent gene from the other parent.

In this study, we replace the random exchange process with a new controlling process. To clarify DGS recombination process we use the example in Fig. 5. DGS program records all the fitness functions for all the chromosomes. The program selects two chromosomes. In this example, the fitness value of chromosome1 is 80% and the fitness value of chromosome2 is 70%. DGS recombination gene operation selects the “strong” gene (gene with the highest weight summation ∑wi) from the chromosome that has a lower fitness value (lc) and exchanges it with the “weak” gene (gene with the lowest weight summation) from another chromosome that has a higher fitness value (hc). The process is repeated until the program obtain a new chromosome (hc’) with a higher fitness value than both parents (the original chromosomes). This idea comes from the gene structure [60].

Fig. 5
figure 5

DGS Recombination example

Based on the above improvements and innovations, the deep gene selectin (DGS) algorithm is presented as pseudocode in Algorithm 3 below.

figure e

Availability of data and materials

The lung cancer dataset GSE68465 was downloaded from NCBI.

Abbreviations

a0,----, am :

gene coding

AC:

Accuracy value

c:

Chromosome

CH:

the number of Chromosomes in each generation

DGS:

Deep Gene Selection

e:

element

fs:

Functional Set

g:

gene

GEP:

Gene Expression Programming

GSP:

Gene Selection Programming

h:

head

hc:

higher fitness value

I:

the number of iterations

k:

the rank value of the attribute

L:

Chromosome Length

lt:

lowest/weakest terminal in the chromosome

n :

the maximum number of parameters needed in the function set

N:

the number of genes of a chromosome

r:

weight controlling the importance of AC

s:

the selected number of attributes in the chromosome

t:

Tail

T:

Terminal size

ts:

Terminal Set

w:

the weight of each attribute

References

  1. Hoopes L. Genetic diagnosis: DNA microarrays and cancer; 2008.

    Google Scholar 

  2. S. H. Aljahdali and M. E. El-Telbany, "Bio-inspired machine learning in microarray gene selection and cancer classification," in Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on, 2009, pp. 339–343: IEEE.

  3. C. A. Kumar and S. Ramakrishnan, "Binary Classification of cancer microarray gene expression data using extreme learning machines," in Computational Intelligence and Computing Research (ICCIC), 2014 IEEE International Conference on, 2014, pp. 1–4: IEEE.

  4. Bhola A, Tiwari AK. Machine learning based approaches for Cancer classification using gene expression data. Mach Learn Appl. 2015;2(3/4):01–12.

    Google Scholar 

  5. S.-B. Cho and H.-H. Won, "machine learning in DNA microarray analysis for cancer classification," in Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003-Volume 19, 2003, pp. 189-198: Australian computer society, Inc.

  6. H. Azzawi, J. Hou, Y. Xiang, and R. Alanni, "A Hybrid Neural Network Approach for Lung Cancer Classification with Gene Expression Dataset and Prior Biological Knowledge." International Conference on Machine Learning for Networking. Springer, Cham, vol 11407, pp. 279–293, 2018.

    Chapter  Google Scholar 

  7. Han F, Sun W, Ling Q-H. A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information. PloS one. 2014;9(5):e97530.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  8. Wang Y, et al. Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem. 2005;29(1):37–46.

    Article  PubMed  CAS  Google Scholar 

  9. Liu Q, et al. Gene selection and classification for cancer microarray data based on machine learning and similarity measures. BMC Genomics. 2011;12(5):S1.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Y. Lu, L. Wang, P. Liu, P. Yang, and M. You, "Gene-expression signature predicts postoperative recurrence in stage I non-small cell lung cancer patients," vol. 7, no. 1, p. e30880, 2012.

  11. W. Liu et al., "Identification of genes associated with cancer progression and prognosis in lung adenocarcinoma: Analyses based on microarray from Oncomine and The Cancer Genome Atlas databases," vol. 7, no. 2, p. e00528, 2019.

  12. J. Hayes, P. P. Peruzzi, and S Lawler, "MicroRNAs in cancer: biomarkers, functions and therapy," vol. 20, no. 8, pp. 460–469, 2014.

    Article  CAS  PubMed  Google Scholar 

  13. W. Wang et al., "The value of plasma-based microRNAs as diagnostic biomarkers for ovarian cancer," 2019.

    Book  Google Scholar 

  14. Das S, Meher PK, Rai A, Bhar LM, Mandal BN. Statistical approaches for gene selection, Hub gene identification and module interaction in gene co-expression network analysis: An application to aluminum stress in soybean (Glycine max L.). PloS one. 2017;12(1):e0169605.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Mundra PA, Rajapakse JC. SVM-RFE with MRMR filter for gene selection. IEEE Trans Nanobioscience. 2010;9(1):31–7.

    Article  PubMed  Google Scholar 

  16. H. Mhamdi and F. Mhamdi, "Feature selection methods on biological knowledge discovery and data mining: A survey," in Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on, 2014, pp. 46–50: IEEE.

  17. Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electrical Eng. 2014;40(1):16–28.

    Article  Google Scholar 

  18. Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ. A survey on semi-supervised feature selection methods. Pattern Recogn. 2017;64:141–58.

    Article  Google Scholar 

  19. W. Wan and J. B. Birch, "An improved hybrid genetic algorithm with a new local search procedure," Journal of Applied Mathematics, vol 2013, 2013.

  20. Apolloni J, Leguizamón G, Alba E. Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl Soft Comput. 2016;38:922–32.

    Article  Google Scholar 

  21. Han F, et al. A gene selection method for microarray data based on binary PSO encoding gene-to-class sensitivity information. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(1):85–96.

    Article  PubMed  Google Scholar 

  22. Alshamlan H, Badr G, Alohali Y. mRMR-ABC: a hybrid gene selection algorithm for Cancer classification using microarray gene expression profiling, BioMed Res Int. 2015;2015.

    Article  CAS  Google Scholar 

  23. Moradi P, Gholampour M. A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy. Appl Soft Comput. 2016;43:117–30.

    Article  Google Scholar 

  24. J. Yang and V. Honavar, "Feature subset selection using a genetic algorithm," in Feature extraction, construction and selection: Springer, 1998, pp. 117–136.

  25. Koza JR. Genetic programming as a means for programming computers by natural selection. Stat Comput. 1994;4(2):87–112.

    Article  Google Scholar 

  26. Y. Shi, "Particle swarm optimization: developments, applications and resources," in evolutionary computation, 2001. Proceedings of the 2001 Congress on, 2001, vol. 1, pp. 81–86: IEEE.

  27. D. Karaboga, "An idea based on honey bee swarm for numerical optimization," Technical report-tr06, Erciyes university, engineering faculty, computer engineering department 2005.

  28. R. Alanni, J. Hou, H. Azzawi, and Y. Xiang, "A novel gene selection algorithm for cancer classification using microarray datasets," BMC Medical Genomics, vol. 12, no. 1, p. 10, 2019.

  29. C. Ferreira and U. Gepsoft, "what is gene expression programming," ed, 2008.

  30. Azzawi H, Hou J, Xiang Y, Alanni R. Lung cancer prediction from microarray data by gene expression programming. IET Syst Biol. 2016;10(5):168–78.

    Article  PubMed  Google Scholar 

  31. Alanni R, Hou J, Abdu-aljabar RD, Xiang Y. Prediction of NSCLC recurrence from microarray data with GEP. IET Syst Biol. 2017;11(3):77–85.

    Article  Google Scholar 

  32. Alanni R, Hou J, Azzawi H, Xiang Y. New gene selection method using gene expression programing approach on microarray data sets. In: Lee R, editor. Computer and information science. Cham: Springer International Publishing; 2019. p. 17–31.

    Chapter  Google Scholar 

  33. H. Azzawi, J. Hou, R. Alanni, and Y. Xiang, "SBC: A New Strategy for Multiclass Lung Cancer Classification Based on Tumour Structural Information and Microarray Data," in 17th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2018), 2018, pp. 68–73: IEEE.

  34. Alanni R, Hou J, Azzawi H, Xiang Y. Cancer adjuvant chemotherapy prediction model for non-small cell lung cancer. IET Syst Biol. 2019.

  35. R. Alanni, J. Hou, H. Azzawi, and Y. Xiang, "RISK CLASSIFICATION FOR NSCLC SURVIVAL USING MICROARRAY AND CLINICAL DATA," presented at THE 207TH THE IIER INTERNATIONAL CONFERENCE, 12-12-2018, 2019. Available: http://worldresearchlibrary.org/proceeding.php?pid=2429

  36. C. Ferreira, "Gene expression programming in problem solving," in Soft computing and industry: Springer, 2002, pp. 635–653.

  37. H. Azzawi, J. Hou, R. Alanni, Y. Xiang, R. Abdu-Aljabar, and A. Azzawi, "Multiclass Lung Cancer Diagnosis by Gene Expression Programming and Microarray Datasets," in International Conference on Advanced Data Mining and Applications, 2017, pp. 541–553: Springer.

  38. Ferreira C. Gene expression programming: a new adaptive algorithm for solving problems. Complex Systems. 2001;13(2):87–129.

    Google Scholar 

  39. Mohamad MS, Omatu S, Deris S, Yoshioka M. A modified binary particle swarm optimization for selecting the small subset of informative genes from gene expression data. IEEE Trans Inf Technol Biomed. 2011;15(6):813–22.

    Article  PubMed  Google Scholar 

  40. Yang C-H, Chuang L-Y, Yang CH. IG-GA: a hybrid filter/wrapper method for feature selection of microarray data. J Med Biol Eng. 2010;30(1):23–8.

    CAS  Google Scholar 

  41. Lai C-M, Yeh W-C, Chang C-Y. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing. 2016.

  42. M. S. Mohamad, S. Omatu, S. Deris, M. Yoshioka, A. Abdullah, and Z. Ibrahim, "An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes," Algorithms for Molecular Biology, vol. 8, no. 1, p. 1, 2013.

  43. J. M. Moosa, R. Shakur, M. Kaykobad, and M. S. Rahman, "Gene selection for cancer classification with the help of bees," BMC Medical Genomics, vol. 9, no. 2, p. 47, 2016.

  44. Su AI, et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 2001;61(20):7388–93.

    CAS  PubMed  Google Scholar 

  45. Staunton JE, et al. Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci. 2001;98(19):10787–92.

    Article  CAS  PubMed  Google Scholar 

  46. S. L. Pomeroy et al., "Prediction of central nervous system embryonal tumour outcome based on gene expression," Nature, vol. 415, no. 6870, p. 436, 2002.

  47. Nutt CL, et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 2003;63(7):1602–7.

    CAS  PubMed  Google Scholar 

  48. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999;286(5439):531–7.

    Article  CAS  PubMed  Google Scholar 

  49. S. A. Armstrong et al., "MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia," Nature genetics, vol. 30, no. 1, p. 41, 2002.

    Article  PubMed  CAS  Google Scholar 

  50. Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001;98(24):13790–5.

    Article  CAS  PubMed  Google Scholar 

  51. Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7(6):673–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1(2):203–9.

    Article  CAS  PubMed  Google Scholar 

  53. Shipp MA, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8(1):68–74.

    Article  CAS  PubMed  Google Scholar 

  54. J. Thomas, "gene expression programming for Java," ed, 2010.

  55. Rajaguru H, Ganesan K, Bojan VK. Earlier detection of cancer regions from MR image features and SVM classifiers. Int J Imaging Syst Technol. 2016;26(3):196–208.

    Article  Google Scholar 

  56. H. A. Le Thi and M. C. Nguyen, "DCA based algorithms for feature selection in multi-class support vector machine," Annals of Operations Research, journal article vol. 249, no. 1, pp. 273–300, February 01 2017.

  57. Priyadarsini RP, Valarmathi M, Sivakumari S. Gain ratio based feature selection method for privacy preservation. ICTACT J Soft Comput. 2011;1(04):2229–6956.

    Google Scholar 

  58. Karegowda AG, Manjunath A, Jayaram M. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inform Technol Knowl Manag. 2010;2(2):271–7.

    Google Scholar 

  59. Yang P, Zhou BB, Zhang Z, Zomaya AY. A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data. BMC Bioinformatics. 2010;11(1):S5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip Rev Dev Biol. 2015;4(2):59–84.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable comments.

Funding

No funding was received.

Author information

Authors and Affiliations

Authors

Contributions

RA designed the study, wrote the code and drafted the manuscript, JH designed the model and the experiments and revised the manuscript. HA and YX participated in the model design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Russul Alanni.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alanni, R., Hou, J., Azzawi, H. et al. Deep gene selection method to select genes from microarray datasets for cancer classification. BMC Bioinformatics 20, 608 (2019). https://doi.org/10.1186/s12859-019-3161-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-3161-2

Keywords