Design of multi-view based email classification for IoT systems via semi-supervised learning☆
Introduction
Internet of Things (IoT) represents a network of physical objects containing embedded technologies to sense, communicate and interact with their internal states or the external environment through the Internet connections. With the rapid development of the Internet, sending emails has emerged as an effective and essential way to communicate within various IoT environments for exchanging information. However, due to the rapid increase of IoT devices and nodes, spam or junk emails have become one annoying issue for Internet Service Providers (ISPs) as well as a big threat for IoT security (Islam and Xiang, 2010; Wang et al., 2018; Zhang et al., 2018). These suspicious emails can cause various security and privacy issues if they are not timely detected and handled, i.e., spammers could send phishing content as HTML mail, which can carry embedded malicious code or can be enclosed with attachments that contain macro virus. The main goal of spam emails is to redirect recipients to pre-built phishing websites that induce users to input their credentials, or automatically infer and collect personal information (Peng et al., 2017a; Shinder, 2013). As a result, there is a great need for an appropriate security mechanism to classify emails and detect malicious content (Liu et al., 2018c; Zhang et al., 2015).
In the literature, many supervised machine learning algorithms have been studied to build an email classification system, such as Naive Bayes (Marsono et al., 2006), decision tree (Shi et al., 2012), k-nearest neighbor (Firte et al., 2010) and support vector machine (SVM) (Amayri and Bouguila, 2010). Although these supervised methods reported good results in spam identification, they still suffer from several issues in a practical scenario.
- •
Demand for diverse labeled data. Typically, supervised email classification systems require a large number of labeled data (or instances) for classifier training. In other words, numerous training examples with ground-truth labels should be given in advance. However, only a very small proportion of labeled data is available while most data remains unlabeled in a practical environment.
- •
Heavy burden of expert (human) labeling. Human efforts are extensively demanded for labeling data items to train a supervised learning algorithm. However, due to the high cost of expert labeling, it is very difficult to obtain enough labeled data for classifier training, which significantly hinders the development of supervised email classification systems.
- •
Hard to handle unseen data. In addition, it is very hard to establish an accurate profile for supervised email classification systems, as the number of labeled data is often limited and insignificant. Nowadays, spammers often manipulate an email to bypass a known email system, i.e., the content and structure of spam emails may be quite different from the emails that are used to train a classifier. Therefore, a traditional supervised email classification system cannot detect ‘zero-day’ emails without appropriate training.
Contributions. Motivated by the challenges above, in this work, we focus on email classification and propose an effective approach by combining both multi-view data and disagreement-based semi-supervised learning. First, we aim to investigate the impact of multi-view data on email classification, which is often ignored by the literature. Then, we apply disagreement-based semi-supervised learning for enhancing the performance of spam detection, through leveraging both labeled and unlabeled data. Our contributions of the work can be summarized as follows:
- •
In this work, we develop an email classification model based on both multi-view data and semi-supervised learning, which adopts two feature sets: internal feature set (IFS) and external feature set (EFS). The former contains features that are related to email text (or body), while the latter mainly contains features that are related to routing and forwarding.
- •
In addition, we revise and deploy a disagreement-based semi-supervised learning algorithm to automatically leverage both labeled and unlabeled data during email classification. This algorithm can make a label decision by means of either “Average of Probabilities” or “Majority Voting”. These two methods were also compared in the evaluation part.
- •
To investigate the performance, we first evaluated our proposed classification approach with two datasets: a public dataset and a real (private) dataset, respectively. Then we collaborated with an IT organization and evaluated our approach in a real network environment. Experimental results indicate that our approach can achieve better classification performance as compared to several similar algorithms.
The remaining parts are organised as follows. In Section 2, we review related research studies regarding the application of machine learning in email classification. Section 3 describes our proposed email classification approach, including how to construct multi-view dataset and how the disagreement-based semi-supervised learning algorithm works. Section 4 presents the experimental settings and analyzes the evaluation results. Finally, we conclude our work in Section 5.
Section snippets
Related work
Email classification is considered to be one promising and commonly adopted method to detect spam emails (e.g., in mobile social networks (Peng et al., 2017b)). Many machine learning algorithms have been studied to distinguish the suspicious emails from the legitimate ones, e.g., supervised learning algorithms and semi-supervised learning algorithms.
Supervised learning algorithms. In the literature, numerous supervised machine learning algorithms have been studied, such as Naive Bayes, decision
Our proposed approach
In this section, we detail the proposed email classification model, including how to construct multi-view data and how the disagreement-based semi-supervised learning algorithm works.
Evaluation
In this section, we evaluate our proposed email classification model using two datasets (a public dataset and a real dataset) and in a real network environment. The use of two datasets attempts to investigate the performance of disagreement-based learning algorithm and the impact of multi-view data. The evaluation in a real network environment aims to explore the real performance of our approach. Below are the metrics adopted in the evaluation.
- •
Area under an ROC curve (AUC). This is an important
Conclusion
Suspicious emails are a big threat for IoT security. To mitigate this issue, email classification is one basic and important solution. In the literature, many supervised learning classifiers have been studies; however, several challenges remain unsolved in practice such as the requirement of large labeled data, the heavy burden of expert labeling and the difficulty of handling unseen data.
In this work, we developed an effective email classification model for IoT systems, by combining both
Acknowledgement
Dr. Meng was partially supported by H2020 SU-ICT-03- 2018 CyberSec4Europe.
Wenjuan Li is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong (CityU), and is holding a visiting position at Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. Prior to this, she worked as Research Assistant in CityU HK and was previously a Lecturer in the Department of Computer Science, Zhaoqing Foreign Language College, China. She was a Winner of Cyber Quiz and Computer Security Competition,
References (71)
- et al.
Multi-view dimensionality reduction based on Universum learning
Neurocomputing
(2018) - et al.
Using GMDH-based networks for improved spam detection and email feature analysis
Appl. Soft Comput.
(2011) - et al.
A trust-based collaborative filtering algorithm for E-commerce recommendation system
J. Ambient Intell. Human. Comput.
(2018) - et al.
Enhancing collaborative intrusion detection networks against insider attacks using supervised intrusion sensitivity-based trust management model
J. Netw. Comput. Appl.
(2017) - et al.
Distance metric optimization driven convolutional neural network for age invariant face recognition
Pattern Recogn.
(2018) - et al.
Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums
Decis. Support Syst.
(2018) - et al.
Symbiotic filtering for spam email detection
Expert Syst. Appl.
(2011) - et al.
EFM: enhancing the performance of signature-based network intrusion detection systems using enhanced filter mechanism
Comput. Secur.
(2014) - et al.
TouchWB: touch behavioral user authentication based on web browsing on smartphones
J. Netw. Comput. Appl.
(2018) - et al.
A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise
Comput. Network.
(2014)
Social influence modeling using information theory in mobile social networks
Inf. Sci.
Collaborative trajectory privacy preserving scheme in location-based services
Inf. Sci.
Multi-view learning based on nonparallel support vector machine
Knowl.-Based Syst.
A novel security scheme based on instant encrypted transmission for Internet of Things
Secur. Commun. Network.
Binary PSO with mutation operator for feature selection using decision tree applied to spam detection
Knowl. Base Syst.
A study of spam filtering using support vector machines
Artif. Intell. Rev.
Combining labeled and unlabeled data with co-training
A survey of emerging approaches to spam filtering
ACM Comput. Surv.
Personalized spam filtering with semi-supervised classifier ensemble
Combining supervised and semi-supervised classifier for personalized spam filtering
Support vector machines for spam categorization
IEEE Trans. Neural Network.
Spam detection filter using KNN algorithm and resampling
Using Naive Bayes to detect spammy names in social networks
Semi supervised image spam hunter: a regularized discriminant EM approach
Email classification using data reduction method
Email classification with temporal features
Proceed. Int. Intell. Inf. Syst. (IIS)
On combining classifiers
IEEE Trans. Pattern Anal. Mach. Intell.
Stitching for multi-view videos with large parallax based on adaptive pixel warping
IEEE Access
Towards designing an email classification system using multi-view based semi-supervised learning
Significant permission identification for machine-learning-based android malware detection
IEEE Trans. Ind. Inf.
Finger vein secure biometric template generation based on deep learning
Soft Comput.
Detecting and preventing cyber insider threats: a survey
IEEE Commun. Surv. Tutor.
Semi-supervised co-training and active learning based approach for multi-view intrusion detection
Binary LNS-based naive Bayes hardware classifier for spam control
Proc. IEEE Int. Symp. Circ. Syst.
Analyzing behaviorial features for email classification
Cited by (42)
A comprehensive examination of email spoofing: Issues and prospects for email security
2024, Computers and SecurityEmail spam detection using hierarchical attention hybrid deep learning method
2023, Expert Systems with ApplicationsHigh-speed anomaly traffic detection based on staged frequency domain features
2023, Journal of Information Security and ApplicationsKernel-based adversarial attacks and defenses on support vector classification
2022, Digital Communications and NetworksCitation Excerpt :The rapid development of machine learning has brought notable achievements in industrial areas, such as image classification [1], intrusion detection [2,3], and the Internet of Things (IoT) [4–6].
Multi-Task Romanian Email Classification in a Business Context
2023, Information (Switzerland)
Wenjuan Li is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong (CityU), and is holding a visiting position at Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. Prior to this, she worked as Research Assistant in CityU HK and was previously a Lecturer in the Department of Computer Science, Zhaoqing Foreign Language College, China. She was a Winner of Cyber Quiz and Computer Security Competition, Final Round of Kaspersky Lab “Cyber Security for the Next Generation” Conference in 2014. Her research interests include network management and security, collaborative intrusion detection, spam detection, trust computing, web technology and E-commerce technology. She is a student member of IEEE.
Weizhi Meng is currently an assistant professor in the Cyber Security Section, Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. He obtained his Ph.D. degree in Computer Science from the City University of Hong Kong (CityU), Hong Kong. Prior to joining DTU, he worked as a research scientist in Infocomm Security (ICS) Department, Institute for Infocomm Research, A*STAR, Singapore, and as a senior research associate in CS Department, CityU. He won the Outstanding Academic Performance Award during his doctoral study, and is a recipient of the Hong Kong Institution of Engineers (HKIE) Outstanding Paper Award for Young Engineers/Researchers in both 2014 and 2017. He is also a recipient of Best Paper Award from ISPEC 2018, and Best Student Paper Award from NSS 2016. His primary research interests are cyber security and intelligent technology in security, including intrusion detection, smartphone security, biometric authentication, HCI security, trust computing, blockchain in security, and malware analysis. He served as program committee members for 20+ international conferences. He has been or will be a co-PC chair for IEEE Blockchain 2018, IEEE ATC 2019, IFIPTM 2019, Socialsec 2019. He also served as guest editor for FGCS, JISA, Sensors, CAEE, IJDSN, SCN, WCNC, etc. He is a member of IEEE.
Zhiyuan Tan is a Lecturer in Cybersecurity at the School of Computing, Edinburgh Napier University (ENU), United Kingdom. He is a Member of IEEE and EAI. His research interests include cybersecurity, machine learning, pattern recognition, data analytics, virtualisation and cyber-physical system. Prior to joining ENU in 2016, Dr Tan held different research positions at three research intensive universities, respectively. He was a Postdoctoral Researcher in Cybersecurity at the University of Twente (UT), the Netherlands from 2014 to 2016; a Research Associate at the University of Technology, Sydney (UTS), Australia in 2014; and a Senior Research Assistant at La Trobe University, Australia in 2013. He serves on the editorial board of International Journal of Computer Sciences and its Applications. He is Associate Editor of IEEE Access and has organised Special Issues for International Journal of Distributed Sensor Networks, Computers & Electrical Engineering, IEEE Access, etc.
Yang Xiang received his PhD in Computer Science from Deakin University, Australia. He is the Dean of Digital Research & Innovation Capability Platform, Swinburne University of Technology, Australia. His research interests include cyber security, which covers network and system security, data analytics, distributed systems, and networking. In particular, he is currently leading his team developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC). He has published more than 200 research papers in many international journals and conferences, such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Information Security and Forensics, and IEEE Journal on Selected Areas in Communications. He served as the Associate Editor of IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, Security and Communication Networks (Wiley), and the Editor of Journal of Network and Computer Applications. He is the Coordinator, Asia for IEEE Computer Society Technical Committee on Distributed Processing (TCDP). He is a Senior Member of the IEEE.
- ☆
A preliminary version of this paper appears in Proceedings of the 13th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 174–181, 2014 (Li et al., 2014).