Integrating joint feature selection into subspace learning: A formulation of 2DPCA for outliers robust feature selection

doi:10.1016/j.neunet.2019.08.030

Neural Networks

Volume 121, January 2020, Pages 441-451

https://doi.org/10.1016/j.neunet.2019.08.030 Get rights and content

Abstract

Since the principal component analysis and its variants are sensitive to outliers that affect their performance and applicability in real world, several variants have been proposed to improve the robustness. However, most of the existing methods are still sensitive to outliers and are unable to select useful features. To overcome the issue of sensitivity of PCA against outliers, in this paper, we introduce two-dimensional outliers-robust principal component analysis (ORPCA) by imposing the joint constraints on the objective function. ORPCA relaxes the orthogonal constraints and penalizes the regression coefficient, thus, it selects important features and ignores the same features that exist in other principal components. It is commonly known that square Frobenius norm is sensitive to outliers. To overcome this issue, we have devised an alternative way to derive objective function. Experimental results on four publicly available benchmark datasets show the effectiveness of joint feature selection and provide better performance as compared to state-of-the-art dimensionality-reduction methods.

Introduction

With the recent advancement in data acquisition devices, acquiring data at faster rates and increased resolution has improved substantially over recent years. The data interpretation process, however, is facing several challenges due to high dimensionality. Not only for the classification, dimensionality reduction is also a serious challenge for several other domains such as data visualization, data compression, pattern recognition, and computer vision. The aim of dimensionality reduction is to transform the high-dimensional data into low-dimensional representation by preserving the quality of the data so that it could be classified efficiently. To deal with this issue, several vector-based methods are in use during the last two decades such as Principal Component Analysis (PCA) (Turk & Pentland, 1991), Linear Discriminant Analysis (LDA) (Belhumeur et al., 1997, Razzak et al., 2010, Ye et al., 2018), LPP (He & Niyogi, 2004), SPP (Qiao, Chen, & Tan, 2010), SPPE (Zhang, Yan, & Zhao, 2013), Isomap (Zhang et al., 2018) and NPE (He & Niyogi, 2004). Principal Component Analysis is one of the extensively used unsupervised dimensionality reduction method that projects high-dimensional representation into linear orthogonal space. However, one of the major drawbacks is that PCA is linear combination and loading are non-zero. This makes PCA data interpretation difficult, and it is still sensitive to outliers (as its covariance matrix is derived from $ℓ_{2}$ -norm that affects its performance. Thus, it fails to deal with outliers that often appears in real-world data. Moreover, before applying PCA and LDA, there is need to convert the image into one-dimensional vector, thus it may not exploit image’s spatial structural information very well (Feng et al., 2013, He and Niyogi, 2004, Netrapalli et al., 2014, Turk and Pentland, 1991, Vaswani et al., 2018, Xu et al., 2010, Yi et al., 2017, Zou et al., 2006) which is very important for image representation. To overcome these issues, several variants of PCA have been proposed to improve the effectiveness of dimensionality reduction and robustness against outliers.

Matrix-based subspace learning methods have been widely applied for dimensionality reduction (Li et al., 2017, Li et al., 2010, Tian et al., 2017, Yang et al., 2004, Yang et al., 2005). Results showed that 2DPCA (Yang et al., 2004), 2DLDA (Yang et al., 2005), multi-linear PCA (Lu, Plataniotis, & Venetsanopoulos, 2008), and JGSPCA (Khan, Shafait, & Mian, 2015) are far more efficient as compared to one-dimensional subspace learning, due to its direct formulation based on two-dimensional images. Two-dimensional subspace learning methods directly calculate the class scatter metrics from images, hence can reveal the spatial structural information of image that is quite important for image classification task. To select important features, several efforts have been made such as robust 2DPCA, utilization of nuclear norm, $ℓ_{1}$ , $ℓ_{2, 1}$ , and Frobenius-norm that showed considerable improvement against outliers and able to select discriminant patterns.

Recently $ℓ_{1}$ -norm-based subspace learning methods haveshown great performance against outliers for tensor data classification (Razzak et al., 0000, Wang et al., 2012, Wang and Wang, 2013). Ke and Kanade presented matrix factorization as an $ℓ_{1}$ -norm minimization problem that is able to handle missing data straightforwardly. Wang et al. presented robust 2DPCA with non-greedy $ℓ_{1}$ -norm maximization in which all projection directions are optimized simultaneously (Wang & Gao, 2016). Luo et al. extended it by learning the optimization matrix by maximizing the sum of the projected difference between each pair of instances, rather than the difference between each instance and the mean of the data (Luo et al., 2017). Although, $ℓ_{1}$ -based methods provided great performance, these methods do not relate to covariance matrix which characterizes the geometric structure of the data, where as F-norm can exploit efficiently the spatial structure that is embedded in the data. Several efforts have been made to utilize F-norm as subspace learning such as 2DPCA (Yang et al., 2004, Yang et al., 2005), 2D-PCA (Tian et al., 2017), F-norm 2DPCA (Li et al., 2017), NM-2DPCA (Chen et al., 2018, Wang et al., 2017), N-2DNPP (Zhang, Li, Zhao, Zhang, & Yan, 2017). However, either these methods still suffer from the effect of outliers or not able to select important features. Furthermore, sensitivity of F-norm is another challenge. Wang et al. presented non-squared F-norm minimization to overcome this challenge (Wang et al., 2017). However, it affects the selection of important features.

To overcome the aforementioned issue of robust feature selection and sensitivity of Frobenius norm, in this paper, we present a novel formulation for PCA that combines the subspace learning and feature selection together in order to exclude the effect of redundant patterns and joint feature selection. We employed Frobenius norm as distance metric learning and seek the projection matrix by joint minimization of regularizer and penalty terms. We relax the orthogonality constraints of transformation matrix and introduce another transformation that helps to jointly select important features and enhances the robustness against outliers. To overcome the sensitivity issue due to squared Frobenius norm, we devised an efficient way to compute F-Norm. As a result, the proposed objective function not only weakens the effect of large distance but also has rotational invariance property. We can describe the theoretical and empirical key contributions of this work as follows:

•
We present outliers robust two-dimensional principal component analysis by efficiently integrating the robustness of traditional 2DPCA and the regularization term ${‖ Q ‖}_{F}^{2}$ that relaxes the orthogonal constraint.
•
The regularization term ${‖ Q ‖}_{F}^{2}$ reduces the constraints and enables the objective function to select features jointly. Furthermore, the regularization parameter ${‖ Q ‖}_{F}^{2}$ is convex and can be easily optimized.
•
To overcome the sensitivity issue of F-Norm against outliers, we efficiently derived the objective function.
•
Penalty term penalizes all regression coefficientscorresponding to single feature as a whole to make PCA possible to select features jointly. Hence, ORPCA approximates high-dimensional representation in flexible manner. As such, ORPCA has more freedom to select low-dimensional features efficiently.
•
The one major drawback of F-norm is its sensitivity against outliers as outlying measurement arbitrarily skew the solution from desired due to squared objective function. As a result, F-norm is not able to utilize the underlying geometric structure in a real sense. To cope the sensitivity due to squared F-norm, recently, non-square F-norm have been used.
•
The latter method is evaluated empirically on four benchmark datasets. Experimental evaluation (discriminant features, computationally and convergence analysis) shows the considerable improvement in most cases, while time complexity remains very attractive.

The rest of the paper is organized as follows. In Section 2, we present basic notations and related work. In Section 3, we present the motivation followed by the proposed objective function and its optimization. In Section 5, we provide detailed experimental evaluations. Finally, conclusion is drawn in Section 6.

Section snippets

Related work

Recently, subspace-learning techniques have shown their great performance and have been widely applied for high-dimensional data representation and classification. In the recent few years, researchers proposed number of methods to reduce the effect of outliers, and several variants have been presented in literature. PCA is one of the most widely used dimension-reduction approach. Unlike traditional PCA, two-dimensional PCA is based on two-dimensional image matrices rather than one dimensional

Motivation

As the aforementioned analysis in Sections 1 Introduction, 2 Related work, for the classification of high-dimensional noisy data, it is always important to find salient features that belong to specific part of image. Since the outlier does not have a precise mathematical meaning, thus the problem of RPCA problem is not well defined yet. Selection of important information by ignoring the redundant could help to improve the feature selection. However, most of the PCA-based methods are sensitive

Outliers robust 2DPCA

In this section, we present outliers robust dimensionality reduction approach (ORPCA) in detail. As described in earlier sections, the projection procedure consists of all the original features, thus, it may also have irrelevant and redundant features which could influence the performance of dimensionality reduction, in result affecting the classification performance. Furthermore, outliers strongly affect the feature selection which depresses the classification performance. In this work, we

Experimental results

In order to evaluate the performance of the proposed ORPCA, in this section, we have discussed and compared the performance of proposed ORPCA on four commonly used image datasets including AR (Martínez & Kak, 2001), Yale B (Sim, Baker, & Bsat, 2002), ORL and CMU PIE. We have used k-nearest neighbor (where $k = 1$ ) for classification. The main contribution of this work is introducing joint feature selection in order to select useful features by effectively combining the robustness of traditional

Discussion

We notice that methods based on matrix perform better as compared to vector-based methods. Results show that proposed ORPCA finds the representative features from high-dimensional space that are used for classification. Unlike 2DPCA based on $ℓ_{1}$ -norm, ORPCA has rotational invariance property and has the freedom to jointly select the important and contributive features such as nose, eyes, lips in case of face image, while contours of different objects in non-facial datasets. Traditional methods

Conclusion

In this paper, we presented a robust dimensionality reduction method that by relaxing the orthogonal constraints of the transformation matrix and imposing a penalty function on regularization term. In contrast to previous work on robustness in PCA, we jointly select the important features. Introduction of penalty function results in the robustness against outliers by reducing their impact in projection matrix. Compared with state-of-the-art methods, our evaluation results show the improvement

Acknowledgment

This work is partially supported by Australian Research Council Linkage Projects under LP170100891.

References (42)

LiTao et al.
F-norm distance metric based robust 2dpca and face recognition
Neural Networks
(2017)
QiaoLishan et al.
Sparsity preserving projections with applications to face recognition
Pattern Recognition
(2010)
WangQianqian et al.
Optimal mean two-dimensional principal component analysis with f-norm minimization
Pattern Recognition
(2017)
WangHaixian et al.
2dpca with l1-norm for simultaneously robust and sparse modelling
Neural Networks
(2013)
WangLei et al.
Robust auto-weighted projective low-rank and sparse recovery for visual representation
Neural Networks
(2019)
YangJian et al.
Two-dimensional discriminant transform for face recognition
Pattern Recognition
(2005)
YeQiaolin et al.
Lp-and ls-norm distance based robust linear discriminant analysis
Neural Networks
(2018)
YiShuangyan et al.
Joint sparse principal component analysis
Pattern Recognition
(2017)
ZhangYan et al.
Semi-supervised local multi-manifold isomap by linear embedding for feature extraction
Pattern Recognition
(2018)
ZhaoMingbo et al.
Learning from normalized local and global discriminative information for semi-supervised regression and dimensionality reduction
Information Sciences
(2015)

Belhumeur, Peter N., Hespanha, João P., & Kriegman, David J. (1997). Eigenfaces vs. fisherfaces: Recognition using...

ChenYudong et al.

Nuclear norm based two-dimensional sparse principal component analysis

International Journal of Wavelets, Multiresolution and Information Processing

(2018)

Feng, Jiashi, Xu, Huan, & Yan, Shuicheng (2013). Online robust pca via stochastic optimization. In: Advances in neural...

He, Xiaofei, & Niyogi, Partha (2004). Locality preserving projections. In: Advances in neural information processing...

KhanZohaib et al.

Joint group sparse pca for compressed hyperspectral imaging

IEEE Transactions on Image Processing

(2015)

LiXuelong et al.

L1-norm-based 2dpca

IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics)

(2010)

LuHaiping et al.

Mpca: Multilinear principal component analysis of tensor objects

IEEE Transactions on Neural Networks

(2008)

LuoMinnan et al.

Avoiding optimal mean 2, 1-norm maximization-based robust pca for reconstruction

Neural Computation

(2017)

MartinezAleix M.

The AR face database

CVC Technical Report

(1998)

MartínezAleix M. et al.

Pca versus lda

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2001)

Netrapalli, Praneeth, Niranjan, U. N., Sanghavi, Sujay, Anandkumar, Animashree, & Jain, Prateek (2014). Non-convex...

Cited by (22)

Double information preserving canonical correlation analysis
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
These data are called multi-view data (Li et al., 2019b; Salim et al., 2020; Zhao et al., 2017; Sun, 2013; Tang et al., 2018). However, a common situation is that the samples often appear in high dimensional form (Li et al., 2019a; Razzak et al., 2020). In order to avoid “curse of dimensionality” caused by high dimensional samples, researchers have developed a diversity of feature extraction methods to find the low dimensional representation of the original samples without loss of information (Chen et al., 2018; Xia et al., 2010; Zhang et al., 2021a, 2020b, 2017).
The methods based on canonical correlation analysis (CCA-based methods) are typical and effective methods for unsupervised dimensionality reduction of multi-view data. However, the traditional CCA-based methods ignore the dissimilarity information while considering the similarity information of samples, which will make the heterogeneous samples in the subspace cannot be well separated. In this paper, we propose a novel unsupervised multi-view dimensionality reduction method: Double Information Preserving Canonical Correlation Analysis (DIPCCA). DIPCCA aims at finding two projection matrices by integrating two cross double weight graphs with CCA to explore the similarity and dissimilarity information of data in cross views. Furthermore, on the basis of consistency and complementarity, DIPCCA uses the similarity information to maintain the local structure, and the dissimilarity information to disperse embedded samples of distinct clusters, so as to extract more discriminative features. Moreover, CCA and a new locality-preserving CCA are two special cases of DIPCCA when parameters take special values. In order to better extract the features of nonlinear data and more than two views, DIPCCA is extended to Double Information Preserving Kernel Canonical Correlation Analysis and Double Information Preserving Multiple Canonical Correlation Analysis, respectively. Experiments on an artificial dataset and three real multi-view datasets show our proposed methods have better performance than the traditional CCA-based methods.
A new deep technique using R-CNN model and L1NSR feature selection for brain MRI classification
2022, Biomedical Signal Processing and Control
One of the most dangerous diseases in the world is brain tumors. After the brain tumor destroys healthy tissues in the brain, it multiplies abnormally, causing an increase in the internal pressure in the skull. If this condition is not diagnosed early, it can lead to death. Magnetic Resonance Imaging (MRI) is a diagnostic method frequently used in soft tissues with successful results. This study presented a new deep learning-based approach, which automatically detects brain tumors using Magnetic Resonance (MR) images. Convolutional and fully connected layers of a new Residual-CNN (R-CNN) model trained from scratch were used to extract deep features from MR images. The representation power of the deep feature set was increased with the features extracted from all convolutional layers. Among the deep features extracted, the 100 features with the highest distinctiveness were selected with a new multi-level feature selection algorithm named L1NSR. The best performance in the classification stage was obtained by using the SVM algorithm with the Gaussian kernel. The proposed approach was evaluated on two separate data sets composed of 2-class (healthy and tumor) and 4-class (glioma tumor, meningioma tumor, pituitary tumor, and healthy) datasets. Besides, the proposed approach was compared with other state-of-the-art approaches using the respective datasets. The best classification accuracies for 2-class and 4-class datasets were 98.8% and 96.6%, respectively.
Fusion of CNN and sparse representation for threat estimation near power lines and poles infrastructure using aerial stereo imagery
2021, Technological Forecasting and Social Change
Fires or electrical hazards and accidents can occur if vegetation is not controlled or cleared around overhead power lines, resulting in serious risks to people and property and significant costs to the community. There are numerous blackouts due to interfering the trees with the power transmission lines in hilly and urban areas. Power distribution companies are facing a challenge to monitor the vegetation to avoid blackouts and flash-over threats. Recently, several methods have been developed for vegetation monitoring; however, existing methods are either not accurate or could not provide better disparity map in the textureless region. Moreover, are not able to handle depth discontinuity in stereo thus are not able to find a feasible solution in the smooth areas to compute the disparity map. This study presents a cost-effective framework based on UAV and satellite Stereo images to monitor the trees and vegetation, which provide better disparity. We present a novel approach based on the fusion of the convolutional neural network (CNN) and sparse representation that handled textureless region, depth discontinuity and smooth region to produce better disparity map that further used for threat estimation using height and distance of vegetation/trees near power lines and poles. Extensive experimental evaluation on real time powerline monitoring showed considerable imporvemnt in vegetation threat estimation with accuracy of 90.3% in comparison to graph-cut, dynamic programming, belief propagation, and area-based methods.
Discovering dynamic adverse behavior of policyholders in the life insurance industry
2021, Technological Forecasting and Social Change
Citation Excerpt :
As a result, it is not possible to identify real AS users, and many honest policyholders may suffer. It is worth mentioning that the AS detection method developed in this paper is different from the outlier and anomaly detection methods used in other applications (Li et al., 2019; 2012; Li and Wang, 2017; Razzak et al., 2020a; 2020b; Singh and Vardhan, 2019; Tewari and Gupta, 2020; Wang et al., 2019; Yin et al., 2020; 2018) since no explicit labels are provided; instead, we leverage the rule learning technique, which has a long history but still shines in recent works (Zhou et al., 2019). From the above review of the existing studies, it is clear that most of the methods focus on limited aspects and were limited in their performance and capability.
Adverse selection (AS) is one of the significant causes of market failure worldwide. Analysis and deep insights into the Australian life insurance market show the existence of adverse activities to gain financial benefits, resulting in loss to insurance companies. Understanding the behavior of policyholders is essential to improve business strategies and overcome fraudulent claims. However, policyholders’ behavior analysis is a complex process, usually involving several factors depending on their preferences and the nature of data such as data which is missing useful private information, the presence of asymmetric information of policyholders, the existence of anomalous information at the cell level rather than the data instance level and a lack of quantitative research. This study aims to analyze the life insurance policyholder’s behavior to identify adverse behavior (AB). In this study, we present a novel association rule learning-based approach ‘ARLAS’ to detect the AS behavior of policyholders. In addition to the original data, we further created a synthetic AS dataset by randomly flipping the attribute values of 10% of the records in the test set. The experiment results on 31,800 Australian life insurance users show that the proposed approach achieves significant gains in performance comparatively.
Randomized nonlinear one-class support vector machines with bounded loss function to detect of outliers for large scale IoT data
2020, Future Generation Computer Systems
Citation Excerpt :
Unsupervised anomaly detection methods detect the anomalies in an unlabeled data under the assumption that the majority of the instances are normal and small fraction of data show anomalous behavior. One class support vector machines (OCSVM) has proven itself to be one the effective classifier for unsupervised anomaly detection [4,5]. However, it is still sensitive to outliers and computationally complex for large datasets.
Exponential growth of large scale data industrial internet of things is evident due to the enormous deployment of IoT data acquisition devices. Detection of unusual patterns from large scale IoT data is important though challenging task. Recently, one-class support vector machines is extensively being used for anomaly detection. It tries to find an optimal hyperplane in high dimensional data that best separates the data from anomalies with maximum margin. However, the hinge loss of traditional one-class support vector machines is unbounded, which results in larger loss caused by outliers affecting its performance for anomaly detection. Furthermore, existing methods are computationally complex for larger data. In this paper, we present novel anomaly detection for large scale data by using randomized nonlinear features in support vector machines with bounded loss function rather than finding optimized support vectors with unbounded loss function. Extensive experimental evaluation on ten benchmark datasets shows the robustness of the proposed approach against outliers such as 0.8239, 0.7921 , 0.7501, 0.6711 , 0.6692, 0.4789 , 0.6462 , 0.6812 , 0.7271 and 0.7873 accuracy for Gas Sensor Array, Human Activity Recognition, Parkinson’s, Hepatitis, Breast Cancer, Blood Transfusion , Heart, ILPD and Wholesale Customers datasets respectively. In addition to this, introduction of randomized nonlinear feature helps to considerably decrease the computational complexity and space complexity from $O (N^{3})$ to $O (B k n)$ and $O (N^{2})$ to $O (B k n)$ . Thus, very attractive for larger datasets.
A unified robust framework for multi-view feature extraction with L2,1-norm constraint
2020, Neural Networks
Citation Excerpt :
MVL has been demonstrated the effectiveness in many applications, such as classification, clustering and feature selection and has attracted more and more attentions (Huang, Chung, & Wang, 2016; Tang et al., 2018; Xie, Gao, Wang, Zhang, & Gao, 2020; Zong, Zhang, & Liu, 2018). With the development of computer vision technique, the samples have often appeared in a high dimensional form (Han et al., 2018; Li, Li, Gao, & Xie, 2017; Razzak, Saris, Blumenstein, & Xu, 2020; Zhu, Xu, Shen, & Zhao, 2017). In order to avoid curse of dimensionality caused by high dimensional samples, researchers have developed a diversity of feature extraction methods.
Multi-view feature extraction methods mainly focus on exploiting the consistency and complementary information between multi-view samples, and most of the current methods apply the F-norm or L2-norm as the metric, which are sensitive to the outliers or noises. In this paper, based on L2,1-norm, we propose a unified robust feature extraction framework, which includes four special multi-view feature extraction methods, and extends the state-of-art methods to a more generalized form. The proposed methods are less sensitive to outliers or noises. An efficient iterative algorithm is designed to solve L2,1-norm based methods. Comprehensive analyses, such as convergence analysis, rotational invariance analysis and relationship between our methods and previous F-norm based methods illustrate the effectiveness of our proposed methods. Experiments on two artificial datasets and six real datasets demonstrate that the proposed L2,1-norm based methods have better performance than the related methods.

View all citing articles on Scopus

View full text

Integrating joint feature selection into subspace learning: A formulation of 2DPCA for outliers robust feature selection

Abstract

Introduction

Section snippets

Related work

Motivation

Outliers robust 2DPCA

Experimental results

Discussion

Conclusion

Acknowledgment

Neural Networks

Pattern Recognition

Pattern Recognition

Neural Networks

Neural Networks

Pattern Recognition

Neural Networks

Pattern Recognition

Pattern Recognition

Information Sciences

Nuclear norm based two-dimensional sparse principal component analysis

International Journal of Wavelets, Multiresolution and Information Processing

Joint group sparse pca for compressed hyperspectral imaging

IEEE Transactions on Image Processing

L1-norm-based 2dpca

IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics)

Mpca: Multilinear principal component analysis of tensor objects

IEEE Transactions on Neural Networks

Avoiding optimal mean 2, 1-norm maximization-based robust pca for reconstruction

Neural Computation

The AR face database

CVC Technical Report

Pca versus lda

IEEE Transactions on Pattern Analysis and Machine Intelligence