Design of multi-view based email classification for IoT systems via semi-supervised learning

https://doi.org/10.1016/j.jnca.2018.12.002Get rights and content

Abstract

Suspicious emails are one big threat for Internet of Things (IoT) security, which aim to induce users to click and then redirect them to a phishing webpage. To protect IoT systems, email classification is an essential mechanism to classify spam and legitimate emails. In the literature, most email classification approaches adopt supervised learning algorithms that require a large number of labeled data for classifier training. However, data labeling is very time consuming and expensive, making only a very small set of data available in practice, which would greatly degrade the effectiveness of email classification. To mitigate this problem, in this work, we develop an email classification approach based on multi-view disagreement-based semi-supervised learning. The idea behind is that multi-view method can offer richer information for classification, which is often ignored by the literature. The use of semi-supervised learning can help leverage both labeled and unlabeled data. In the evaluation, we investigate the performance of our proposed approach with two datasets and in a real network environment. Experimental results demonstrate that the use of multi-view data can achieve more accurate email classification than the use of single-view data, and that our approach is more effective as compared to several existing similar algorithms.

Introduction

Internet of Things (IoT) represents a network of physical objects containing embedded technologies to sense, communicate and interact with their internal states or the external environment through the Internet connections. With the rapid development of the Internet, sending emails has emerged as an effective and essential way to communicate within various IoT environments for exchanging information. However, due to the rapid increase of IoT devices and nodes, spam or junk emails have become one annoying issue for Internet Service Providers (ISPs) as well as a big threat for IoT security (Islam and Xiang, 2010; Wang et al., 2018; Zhang et al., 2018). These suspicious emails can cause various security and privacy issues if they are not timely detected and handled, i.e., spammers could send phishing content as HTML mail, which can carry embedded malicious code or can be enclosed with attachments that contain macro virus. The main goal of spam emails is to redirect recipients to pre-built phishing websites that induce users to input their credentials, or automatically infer and collect personal information (Peng et al., 2017a; Shinder, 2013). As a result, there is a great need for an appropriate security mechanism to classify emails and detect malicious content (Liu et al., 2018c; Zhang et al., 2015).

In the literature, many supervised machine learning algorithms have been studied to build an email classification system, such as Naive Bayes (Marsono et al., 2006), decision tree (Shi et al., 2012), k-nearest neighbor (Firte et al., 2010) and support vector machine (SVM) (Amayri and Bouguila, 2010). Although these supervised methods reported good results in spam identification, they still suffer from several issues in a practical scenario.

  • Demand for diverse labeled data. Typically, supervised email classification systems require a large number of labeled data (or instances) for classifier training. In other words, numerous training examples with ground-truth labels should be given in advance. However, only a very small proportion of labeled data is available while most data remains unlabeled in a practical environment.

  • Heavy burden of expert (human) labeling. Human efforts are extensively demanded for labeling data items to train a supervised learning algorithm. However, due to the high cost of expert labeling, it is very difficult to obtain enough labeled data for classifier training, which significantly hinders the development of supervised email classification systems.

  • Hard to handle unseen data. In addition, it is very hard to establish an accurate profile for supervised email classification systems, as the number of labeled data is often limited and insignificant. Nowadays, spammers often manipulate an email to bypass a known email system, i.e., the content and structure of spam emails may be quite different from the emails that are used to train a classifier. Therefore, a traditional supervised email classification system cannot detect ‘zero-day’ emails without appropriate training.

Contributions. Motivated by the challenges above, in this work, we focus on email classification and propose an effective approach by combining both multi-view data and disagreement-based semi-supervised learning. First, we aim to investigate the impact of multi-view data on email classification, which is often ignored by the literature. Then, we apply disagreement-based semi-supervised learning for enhancing the performance of spam detection, through leveraging both labeled and unlabeled data. Our contributions of the work can be summarized as follows:

  • In this work, we develop an email classification model based on both multi-view data and semi-supervised learning, which adopts two feature sets: internal feature set (IFS) and external feature set (EFS). The former contains features that are related to email text (or body), while the latter mainly contains features that are related to routing and forwarding.

  • In addition, we revise and deploy a disagreement-based semi-supervised learning algorithm to automatically leverage both labeled and unlabeled data during email classification. This algorithm can make a label decision by means of either “Average of Probabilities” or “Majority Voting”. These two methods were also compared in the evaluation part.

  • To investigate the performance, we first evaluated our proposed classification approach with two datasets: a public dataset and a real (private) dataset, respectively. Then we collaborated with an IT organization and evaluated our approach in a real network environment. Experimental results indicate that our approach can achieve better classification performance as compared to several similar algorithms.

The remaining parts are organised as follows. In Section 2, we review related research studies regarding the application of machine learning in email classification. Section 3 describes our proposed email classification approach, including how to construct multi-view dataset and how the disagreement-based semi-supervised learning algorithm works. Section 4 presents the experimental settings and analyzes the evaluation results. Finally, we conclude our work in Section 5.

Section snippets

Related work

Email classification is considered to be one promising and commonly adopted method to detect spam emails (e.g., in mobile social networks (Peng et al., 2017b)). Many machine learning algorithms have been studied to distinguish the suspicious emails from the legitimate ones, e.g., supervised learning algorithms and semi-supervised learning algorithms.

Supervised learning algorithms. In the literature, numerous supervised machine learning algorithms have been studied, such as Naive Bayes, decision

Our proposed approach

In this section, we detail the proposed email classification model, including how to construct multi-view data and how the disagreement-based semi-supervised learning algorithm works.

Evaluation

In this section, we evaluate our proposed email classification model using two datasets (a public dataset and a real dataset) and in a real network environment. The use of two datasets attempts to investigate the performance of disagreement-based learning algorithm and the impact of multi-view data. The evaluation in a real network environment aims to explore the real performance of our approach. Below are the metrics adopted in the evaluation.

  • Area under an ROC curve (AUC). This is an important

Conclusion

Suspicious emails are a big threat for IoT security. To mitigate this issue, email classification is one basic and important solution. In the literature, many supervised learning classifiers have been studies; however, several challenges remain unsolved in practice such as the requirement of large labeled data, the heavy burden of expert labeling and the difficulty of handling unseen data.

In this work, we developed an effective email classification model for IoT systems, by combining both

Acknowledgement

Dr. Meng was partially supported by H2020 SU-ICT-03- 2018 CyberSec4Europe.

Wenjuan Li is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong (CityU), and is holding a visiting position at Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. Prior to this, she worked as Research Assistant in CityU HK and was previously a Lecturer in the Department of Computer Science, Zhaoqing Foreign Language College, China. She was a Winner of Cyber Quiz and Computer Security Competition,

References (71)

  • S. Peng et al.

    Social influence modeling using information theory in mobile social networks

    Inf. Sci.

    (2017)
  • T. Peng et al.

    Collaborative trajectory privacy preserving scheme in location-based services

    Inf. Sci.

    (2017)
  • J. Tang et al.

    Multi-view learning based on nonparallel support vector machine

    Knowl.-Based Syst.

    (2018)
  • C. Wang et al.

    A novel security scheme based on instant encrypted transmission for Internet of Things

    Secur. Commun. Network.

    (2018)
  • Y. Zhang et al.

    Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

    Knowl. Base Syst.

    (2014)
  • O. Amayri et al.

    A study of spam filtering using support vector machines

    Artif. Intell. Rev.

    (2010)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

  • G. Caruana et al.

    A survey of emerging approaches to spam filtering

    ACM Comput. Surv.

    (2008)
  • V. Cheng et al.

    Personalized spam filtering with semi-supervised classifier ensemble

  • V. Cheng et al.

    Combining supervised and semi-supervised classifier for personalized spam filtering

  • H. Drucker et al.

    Support vector machines for spam categorization

    IEEE Trans. Neural Network.

    (1999)
  • L. Firte et al.

    Spam detection filter using KNN algorithm and resampling

  • D.M. Freeman

    Using Naive Bayes to detect spammy names in social networks

  • Y. Gao et al.

    Semi supervised image spam hunter: a regularized discriminant EM approach

  • R. Islam et al.

    Email classification using data reduction method

  • S. Kiritchenko et al.

    Email classification with temporal features

    Proceed. Int. Intell. Inf. Syst. (IIS)

    (2004)
  • J. Kittler et al.

    On combining classifiers

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • K.-Y. Lee et al.

    Stitching for multi-view videos with large parallax based on adaptive pixel warping

    IEEE Access

    (2018)
  • W. Li et al.

    Towards designing an email classification system using multi-view based semi-supervised learning

  • J. Li et al.

    Significant permission identification for machine-learning-based android malware detection

    IEEE Trans. Ind. Inf.

    (2018)
  • Y. Liu et al.

    Finger vein secure biometric template generation based on deep learning

    Soft Comput.

    (2018)
  • L. Liu et al.

    Detecting and preventing cyber insider threats: a survey

    IEEE Commun. Surv. Tutor.

    (2018)
  • C.-H. Mao et al.

    Semi-supervised co-training and active learning based approach for multi-view intrusion detection

  • M.N. Marsono et al.

    Binary LNS-based naive Bayes hardware classifier for spam control

    Proc. IEEE Int. Symp. Circ. Syst.

    (2006)
  • S. Martin et al.

    Analyzing behaviorial features for email classification

  • Cited by (42)

    • High-speed anomaly traffic detection based on staged frequency domain features

      2023, Journal of Information Security and Applications
    • Kernel-based adversarial attacks and defenses on support vector classification

      2022, Digital Communications and Networks
      Citation Excerpt :

      The rapid development of machine learning has brought notable achievements in industrial areas, such as image classification [1], intrusion detection [2,3], and the Internet of Things (IoT) [4–6].

    View all citing articles on Scopus

    Wenjuan Li is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong (CityU), and is holding a visiting position at Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. Prior to this, she worked as Research Assistant in CityU HK and was previously a Lecturer in the Department of Computer Science, Zhaoqing Foreign Language College, China. She was a Winner of Cyber Quiz and Computer Security Competition, Final Round of Kaspersky Lab “Cyber Security for the Next Generation” Conference in 2014. Her research interests include network management and security, collaborative intrusion detection, spam detection, trust computing, web technology and E-commerce technology. She is a student member of IEEE.

    Weizhi Meng is currently an assistant professor in the Cyber Security Section, Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), Denmark. He obtained his Ph.D. degree in Computer Science from the City University of Hong Kong (CityU), Hong Kong. Prior to joining DTU, he worked as a research scientist in Infocomm Security (ICS) Department, Institute for Infocomm Research, A*STAR, Singapore, and as a senior research associate in CS Department, CityU. He won the Outstanding Academic Performance Award during his doctoral study, and is a recipient of the Hong Kong Institution of Engineers (HKIE) Outstanding Paper Award for Young Engineers/Researchers in both 2014 and 2017. He is also a recipient of Best Paper Award from ISPEC 2018, and Best Student Paper Award from NSS 2016. His primary research interests are cyber security and intelligent technology in security, including intrusion detection, smartphone security, biometric authentication, HCI security, trust computing, blockchain in security, and malware analysis. He served as program committee members for 20+ international conferences. He has been or will be a co-PC chair for IEEE Blockchain 2018, IEEE ATC 2019, IFIPTM 2019, Socialsec 2019. He also served as guest editor for FGCS, JISA, Sensors, CAEE, IJDSN, SCN, WCNC, etc. He is a member of IEEE.

    Zhiyuan Tan is a Lecturer in Cybersecurity at the School of Computing, Edinburgh Napier University (ENU), United Kingdom. He is a Member of IEEE and EAI. His research interests include cybersecurity, machine learning, pattern recognition, data analytics, virtualisation and cyber-physical system. Prior to joining ENU in 2016, Dr Tan held different research positions at three research intensive universities, respectively. He was a Postdoctoral Researcher in Cybersecurity at the University of Twente (UT), the Netherlands from 2014 to 2016; a Research Associate at the University of Technology, Sydney (UTS), Australia in 2014; and a Senior Research Assistant at La Trobe University, Australia in 2013. He serves on the editorial board of International Journal of Computer Sciences and its Applications. He is Associate Editor of IEEE Access and has organised Special Issues for International Journal of Distributed Sensor Networks, Computers & Electrical Engineering, IEEE Access, etc.

    Yang Xiang received his PhD in Computer Science from Deakin University, Australia. He is the Dean of Digital Research & Innovation Capability Platform, Swinburne University of Technology, Australia. His research interests include cyber security, which covers network and system security, data analytics, distributed systems, and networking. In particular, he is currently leading his team developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC). He has published more than 200 research papers in many international journals and conferences, such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Information Security and Forensics, and IEEE Journal on Selected Areas in Communications. He served as the Associate Editor of IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, Security and Communication Networks (Wiley), and the Editor of Journal of Network and Computer Applications. He is the Coordinator, Asia for IEEE Computer Society Technical Committee on Distributed Processing (TCDP). He is a Senior Member of the IEEE.

    A preliminary version of this paper appears in Proceedings of the 13th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 174–181, 2014 (Li et al., 2014).

    View full text