Design of multi-view based email classification for IoT systems via semi-supervised learning

doi:10.1016/j.jnca.2018.12.002

Journal of Network and Computer Applications

Volume 128, 15 February 2019, Pages 56-63

https://doi.org/10.1016/j.jnca.2018.12.002 Get rights and content

Abstract

Suspicious emails are one big threat for Internet of Things (IoT) security, which aim to induce users to click and then redirect them to a phishing webpage. To protect IoT systems, email classification is an essential mechanism to classify spam and legitimate emails. In the literature, most email classification approaches adopt supervised learning algorithms that require a large number of labeled data for classifier training. However, data labeling is very time consuming and expensive, making only a very small set of data available in practice, which would greatly degrade the effectiveness of email classification. To mitigate this problem, in this work, we develop an email classification approach based on multi-view disagreement-based semi-supervised learning. The idea behind is that multi-view method can offer richer information for classification, which is often ignored by the literature. The use of semi-supervised learning can help leverage both labeled and unlabeled data. In the evaluation, we investigate the performance of our proposed approach with two datasets and in a real network environment. Experimental results demonstrate that the use of multi-view data can achieve more accurate email classification than the use of single-view data, and that our approach is more effective as compared to several existing similar algorithms.

Introduction

Internet of Things (IoT) represents a network of physical objects containing embedded technologies to sense, communicate and interact with their internal states or the external environment through the Internet connections. With the rapid development of the Internet, sending emails has emerged as an effective and essential way to communicate within various IoT environments for exchanging information. However, due to the rapid increase of IoT devices and nodes, spam or junk emails have become one annoying issue for Internet Service Providers (ISPs) as well as a big threat for IoT security (Islam and Xiang, 2010; Wang et al., 2018; Zhang et al., 2018). These suspicious emails can cause various security and privacy issues if they are not timely detected and handled, i.e., spammers could send phishing content as HTML mail, which can carry embedded malicious code or can be enclosed with attachments that contain macro virus. The main goal of spam emails is to redirect recipients to pre-built phishing websites that induce users to input their credentials, or automatically infer and collect personal information (Peng et al., 2017a; Shinder, 2013). As a result, there is a great need for an appropriate security mechanism to classify emails and detect malicious content (Liu et al., 2018c; Zhang et al., 2015).

In the literature, many supervised machine learning algorithms have been studied to build an email classification system, such as Naive Bayes (Marsono et al., 2006), decision tree (Shi et al., 2012), k-nearest neighbor (Firte et al., 2010) and support vector machine (SVM) (Amayri and Bouguila, 2010). Although these supervised methods reported good results in spam identification, they still suffer from several issues in a practical scenario.

•
Demand for diverse labeled data. Typically, supervised email classification systems require a large number of labeled data (or instances) for classifier training. In other words, numerous training examples with ground-truth labels should be given in advance. However, only a very small proportion of labeled data is available while most data remains unlabeled in a practical environment.
•
Heavy burden of expert (human) labeling. Human efforts are extensively demanded for labeling data items to train a supervised learning algorithm. However, due to the high cost of expert labeling, it is very difficult to obtain enough labeled data for classifier training, which significantly hinders the development of supervised email classification systems.
•
Hard to handle unseen data. In addition, it is very hard to establish an accurate profile for supervised email classification systems, as the number of labeled data is often limited and insignificant. Nowadays, spammers often manipulate an email to bypass a known email system, i.e., the content and structure of spam emails may be quite different from the emails that are used to train a classifier. Therefore, a traditional supervised email classification system cannot detect ‘zero-day’ emails without appropriate training.

Contributions. Motivated by the challenges above, in this work, we focus on email classification and propose an effective approach by combining both multi-view data and disagreement-based semi-supervised learning. First, we aim to investigate the impact of multi-view data on email classification, which is often ignored by the literature. Then, we apply disagreement-based semi-supervised learning for enhancing the performance of spam detection, through leveraging both labeled and unlabeled data. Our contributions of the work can be summarized as follows:

•
In this work, we develop an email classification model based on both multi-view data and semi-supervised learning, which adopts two feature sets: internal feature set (IFS) and external feature set (EFS). The former contains features that are related to email text (or body), while the latter mainly contains features that are related to routing and forwarding.
•
In addition, we revise and deploy a disagreement-based semi-supervised learning algorithm to automatically leverage both labeled and unlabeled data during email classification. This algorithm can make a label decision by means of either “Average of Probabilities” or “Majority Voting”. These two methods were also compared in the evaluation part.
•
To investigate the performance, we first evaluated our proposed classification approach with two datasets: a public dataset and a real (private) dataset, respectively. Then we collaborated with an IT organization and evaluated our approach in a real network environment. Experimental results indicate that our approach can achieve better classification performance as compared to several similar algorithms.

The remaining parts are organised as follows. In Section 2, we review related research studies regarding the application of machine learning in email classification. Section 3 describes our proposed email classification approach, including how to construct multi-view dataset and how the disagreement-based semi-supervised learning algorithm works. Section 4 presents the experimental settings and analyzes the evaluation results. Finally, we conclude our work in Section 5.

Section snippets

Related work

Email classification is considered to be one promising and commonly adopted method to detect spam emails (e.g., in mobile social networks (Peng et al., 2017b)). Many machine learning algorithms have been studied to distinguish the suspicious emails from the legitimate ones, e.g., supervised learning algorithms and semi-supervised learning algorithms.

Supervised learning algorithms. In the literature, numerous supervised machine learning algorithms have been studied, such as Naive Bayes, decision

Our proposed approach

In this section, we detail the proposed email classification model, including how to construct multi-view data and how the disagreement-based semi-supervised learning algorithm works.

Evaluation

In this section, we evaluate our proposed email classification model using two datasets (a public dataset and a real dataset) and in a real network environment. The use of two datasets attempts to investigate the performance of disagreement-based learning algorithm and the impact of multi-view data. The evaluation in a real network environment aims to explore the real performance of our approach. Below are the metrics adopted in the evaluation.

•
Area under an ROC curve (AUC). This is an important

Conclusion

Suspicious emails are a big threat for IoT security. To mitigate this issue, email classification is one basic and important solution. In the literature, many supervised learning classifiers have been studies; however, several challenges remain unsolved in practice such as the requirement of large labeled data, the heavy burden of expert labeling and the difficulty of handling unseen data.

In this work, we developed an effective email classification model for IoT systems, by combining both

Acknowledgement

Dr. Meng was partially supported by H2020 SU-ICT-03- 2018 CyberSec4Europe.

References (71)

X. Chen et al.
Multi-view dimensionality reduction based on Universum learning
Neurocomputing
(2018)
E.M. El-Alfy et al.
Using GMDH-based networks for improved spam detection and email feature analysis
Appl. Soft Comput.
(2011)
L. Jiang et al.
A trust-based collaborative filtering algorithm for E-commerce recommendation system
J. Ambient Intell. Human. Comput.
(2018)
W. Li et al.
Enhancing collaborative intrusion detection networks against insider attacks using supervised intrusion sensitivity-based trust management model
J. Netw. Comput. Appl.
(2017)
Y. Li et al.
Distance metric optimization driven convolutional neural network for age invariant face recognition
Pattern Recogn.
(2018)
Y. Liu et al.
Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums
Decis. Support Syst.
(2018)
C. Lopes et al.
Symbiotic filtering for spam email detection
Expert Syst. Appl.
(2011)
W. Meng et al.
EFM: enhancing the performance of signature-based network intrusion detection systems using enhanced filter mechanism
Comput. Secur.
(2014)
W. Meng et al.
TouchWB: touch behavioral user authentication based on web browsing on smartphones
J. Netw. Comput. Appl.
(2018)
T. Ouyang et al.
A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise
Comput. Network.
(2014)

S. Peng et al.

Social influence modeling using information theory in mobile social networks

Inf. Sci.

(2017)

T. Peng et al.

Collaborative trajectory privacy preserving scheme in location-based services

Inf. Sci.

(2017)

J. Tang et al.

Multi-view learning based on nonparallel support vector machine

Knowl.-Based Syst.

(2018)

C. Wang et al.

A novel security scheme based on instant encrypted transmission for Internet of Things

Secur. Commun. Network.

(2018)

Y. Zhang et al.

Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

Knowl. Base Syst.

(2014)

O. Amayri et al.

A study of spam filtering using support vector machines

Artif. Intell. Rev.

(2010)

A. Blum et al.

Combining labeled and unlabeled data with co-training

G. Caruana et al.

A survey of emerging approaches to spam filtering

ACM Comput. Surv.

(2008)

V. Cheng et al.

Personalized spam filtering with semi-supervised classifier ensemble

V. Cheng et al.

Combining supervised and semi-supervised classifier for personalized spam filtering

H. Drucker et al.

Support vector machines for spam categorization

IEEE Trans. Neural Network.

(1999)

L. Firte et al.

Spam detection filter using KNN algorithm and resampling

D.M. Freeman

Using Naive Bayes to detect spammy names in social networks

Y. Gao et al.

Semi supervised image spam hunter: a regularized discriminant EM approach

R. Islam et al.

Email classification using data reduction method

S. Kiritchenko et al.

Email classification with temporal features

Proceed. Int. Intell. Inf. Syst. (IIS)

(2004)

J. Kittler et al.

On combining classifiers

IEEE Trans. Pattern Anal. Mach. Intell.

(1998)

K.-Y. Lee et al.

Stitching for multi-view videos with large parallax based on adaptive pixel warping

IEEE Access

(2018)

W. Li et al.

Towards designing an email classification system using multi-view based semi-supervised learning

J. Li et al.

Significant permission identification for machine-learning-based android malware detection

IEEE Trans. Ind. Inf.

(2018)

Y. Liu et al.

Finger vein secure biometric template generation based on deep learning

Soft Comput.

(2018)

L. Liu et al.

Detecting and preventing cyber insider threats: a survey

IEEE Commun. Surv. Tutor.

(2018)

C.-H. Mao et al.

Semi-supervised co-training and active learning based approach for multi-view intrusion detection

M.N. Marsono et al.

Binary LNS-based naive Bayes hardware classifier for spam control

Proc. IEEE Int. Symp. Circ. Syst.

(2006)

S. Martin et al.

Analyzing behaviorial features for email classification

Cited by (42)

A comprehensive examination of email spoofing: Issues and prospects for email security
2024, Computers and Security
Attackers are becoming more skilled in recent years, using sophisticated technology to produce look-alike emails that make it difficult to distinguish between real and fake ones. Most false emails can be detected, but certain undiscovered ones can be dangerous and compromise security. The attacker compromises SMTP to launch an email spoofing attack. This is not difficult given that it was designed without any security safeguards. Spoofers typically exploit the various fields in email headers. By taking advantage of loopholes in email security systems, attackers can create an ideal spoofing mail. As a result, it appears as a reliable source and succeeds in phishing attempts. An in-depth analysis of the email process, its protocols, and authentication mechanisms along with the security measures and adoption rates that led to a variety of spoofing attacks has been examined in our work. Our experiments on renowned mail service suppliers observed that some of them are still vulnerable to associated flaws. Further, we analyzed how different aspects such as age and education, determine whether or not a message is spoofed, and how malware uses email as a command and control to compromise the victim's device and seize control of it. Further, it offers a multitude of mitigation strategies against spoofing attempts that aid aspirants in future research.
Email spam detection using hierarchical attention hybrid deep learning method
2023, Expert Systems with Applications
Email is one of the most widely used ways to communicate, with millions of people and businesses relying on it to communicate and share knowledge and information on a daily basis. Nevertheless, the rise in email users has occurred a dramatic increase in spam emails in recent years. Considering the escalating number of spam emails, it has become crucial to devise effective strategies for spam detection. To tackle this challenge, this article proposes a novel technique for email spam detection that is based on a combination of convolutional neural networks, gated recurrent units, and attention mechanisms. During system training, the network is selectively focused on necessary parts of the email text. The usage of convolution layers to extract more meaningful, abstract, and generalizable features by hierarchical representation is the major contribution of this study. Additionally, this contribution incorporates cross-dataset evaluation, which enables the generation of more independent performance results from the model's training dataset. According to cross-dataset evaluation results, the proposed technique advances the results of the present attention-based techniques by utilizing temporal convolutions, which give us more flexible receptive field sizes are utilized. The suggested technique's findings are compared to those of state-of-the-art models and show that our approach outperforms them.
High-speed anomaly traffic detection based on staged frequency domain features
2023, Journal of Information Security and Applications
Anomaly detection methods based on machine learning assist in identifying attacker behavior concealed in critical infrastructure’s high-speed network traffic. However, these methods generally experience problems including a lack of labeled data and poor performance. We suggest a detection method based on staged frequency domain features to address these issues. A small-step sliding window is used in the training phase to fully understand the frequency domain features of the traffic. We suggest SOM-Kmeans, an integrated clustering technique that can accurately distinguish between malicious and benign flows. We evaluate the SOM-Kmeans accuracy using open datasets and assess its effectiveness in a real network environment. The experimental results demonstrate that our method can detect anomaly traffic at high speed without sacrificing detection accuracy.
Kernel-based adversarial attacks and defenses on support vector classification
2022, Digital Communications and Networks
Citation Excerpt :
The rapid development of machine learning has brought notable achievements in industrial areas, such as image classification [1], intrusion detection [2,3], and the Internet of Things (IoT) [4–6].
While malicious samples are widely found in many application fields of machine learning, suitable countermeasures have been investigated in the field of adversarial machine learning. Due to the importance and popularity of Support Vector Machines (SVMs), we first describe the evasion attack against SVM classification and then propose a defense strategy in this paper. The evasion attack utilizes the classification surface of SVM to iteratively find the minimal perturbations that mislead the nonlinear classifier. Specially, we propose what is called a vulnerability function to measure the vulnerability of the SVM classifiers. Utilizing this vulnerability function, we put forward an effective defense strategy based on the kernel optimization of SVMs with Gaussian kernel against the evasion attack. Our defense method is verified to be very effective on the benchmark datasets, and the SVM classifier becomes more robust after using our kernel optimization scheme.
A security and privacy preserving approach based on social IoT and classification using DenseNet convolutional neural network
2024, Automatika
Multi-Task Romanian Email Classification in a Business Context
2023, Information (Switzerland)

View all citing articles on Scopus

Wenjuan Li is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong (CityU), and is holding a visiting position at Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. Prior to this, she worked as Research Assistant in CityU HK and was previously a Lecturer in the Department of Computer Science, Zhaoqing Foreign Language College, China. She was a Winner of Cyber Quiz and Computer Security Competition, Final Round of Kaspersky Lab “Cyber Security for the Next Generation” Conference in 2014. Her research interests include network management and security, collaborative intrusion detection, spam detection, trust computing, web technology and E-commerce technology. She is a student member of IEEE.

Weizhi Meng is currently an assistant professor in the Cyber Security Section, Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. He obtained his Ph.D. degree in Computer Science from the City University of Hong Kong (CityU), Hong Kong. Prior to joining DTU, he worked as a research scientist in Infocomm Security (ICS) Department, Institute for Infocomm Research, A*STAR, Singapore, and as a senior research associate in CS Department, CityU. He won the Outstanding Academic Performance Award during his doctoral study, and is a recipient of the Hong Kong Institution of Engineers (HKIE) Outstanding Paper Award for Young Engineers/Researchers in both 2014 and 2017. He is also a recipient of Best Paper Award from ISPEC 2018, and Best Student Paper Award from NSS 2016. His primary research interests are cyber security and intelligent technology in security, including intrusion detection, smartphone security, biometric authentication, HCI security, trust computing, blockchain in security, and malware analysis. He served as program committee members for 20+ international conferences. He has been or will be a co-PC chair for IEEE Blockchain 2018, IEEE ATC 2019, IFIPTM 2019, Socialsec 2019. He also served as guest editor for FGCS, JISA, Sensors, CAEE, IJDSN, SCN, WCNC, etc. He is a member of IEEE.

Zhiyuan Tan is a Lecturer in Cybersecurity at the School of Computing, Edinburgh Napier University (ENU), United Kingdom. He is a Member of IEEE and EAI. His research interests include cybersecurity, machine learning, pattern recognition, data analytics, virtualisation and cyber-physical system. Prior to joining ENU in 2016, Dr Tan held different research positions at three research intensive universities, respectively. He was a Postdoctoral Researcher in Cybersecurity at the University of Twente (UT), the Netherlands from 2014 to 2016; a Research Associate at the University of Technology, Sydney (UTS), Australia in 2014; and a Senior Research Assistant at La Trobe University, Australia in 2013. He serves on the editorial board of International Journal of Computer Sciences and its Applications. He is Associate Editor of IEEE Access and has organised Special Issues for International Journal of Distributed Sensor Networks, Computers & Electrical Engineering, IEEE Access, etc.

Yang Xiang received his PhD in Computer Science from Deakin University, Australia. He is the Dean of Digital Research & Innovation Capability Platform, Swinburne University of Technology, Australia. His research interests include cyber security, which covers network and system security, data analytics, distributed systems, and networking. In particular, he is currently leading his team developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC). He has published more than 200 research papers in many international journals and conferences, such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Information Security and Forensics, and IEEE Journal on Selected Areas in Communications. He served as the Associate Editor of IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, Security and Communication Networks (Wiley), and the Editor of Journal of Network and Computer Applications. He is the Coordinator, Asia for IEEE Computer Society Technical Committee on Distributed Processing (TCDP). He is a Senior Member of the IEEE.

^☆: A preliminary version of this paper appears in Proceedings of the 13th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 174–181, 2014 (Li et al., 2014).

View full text

Design of multi-view based email classification for IoT systems via semi-supervised learning☆

Abstract

Introduction

Section snippets

Related work

Our proposed approach

Evaluation

Conclusion

Acknowledgement

Neurocomputing

Appl. Soft Comput.

J. Ambient Intell. Human. Comput.

J. Netw. Comput. Appl.

Pattern Recogn.

Decis. Support Syst.

Expert Syst. Appl.

Comput. Secur.

J. Netw. Comput. Appl.

Comput. Network.

Inf. Sci.

Inf. Sci.

Knowl.-Based Syst.

Secur. Commun. Network.

Knowl. Base Syst.

A study of spam filtering using support vector machines

Artif. Intell. Rev.

Combining labeled and unlabeled data with co-training

A survey of emerging approaches to spam filtering

ACM Comput. Surv.

Personalized spam filtering with semi-supervised classifier ensemble

Combining supervised and semi-supervised classifier for personalized spam filtering

Support vector machines for spam categorization

IEEE Trans. Neural Network.

Spam detection filter using KNN algorithm and resampling

Using Naive Bayes to detect spammy names in social networks

Semi supervised image spam hunter: a regularized discriminant EM approach

Email classification using data reduction method

Email classification with temporal features

Proceed. Int. Intell. Inf. Syst. (IIS)

On combining classifiers

IEEE Trans. Pattern Anal. Mach. Intell.

Stitching for multi-view videos with large parallax based on adaptive pixel warping

IEEE Access

Towards designing an email classification system using multi-view based semi-supervised learning

Significant permission identification for machine-learning-based android malware detection

IEEE Trans. Ind. Inf.

Finger vein secure biometric template generation based on deep learning

Soft Comput.

Detecting and preventing cyber insider threats: a survey

IEEE Commun. Surv. Tutor.

Semi-supervised co-training and active learning based approach for multi-view intrusion detection

Binary LNS-based naive Bayes hardware classifier for spam control

Proc. IEEE Int. Symp. Circ. Syst.

Analyzing behaviorial features for email classification