Differentially private model publishing in cyber physical systems

https://doi.org/10.1016/j.future.2018.04.016Get rights and content

Highlights

  • We first identify the privacy preserving problem in cyber security systems, especially in the fog computing environment.

  • Based on the problem definition, we propose a novel differentially private data publishing method, MLDP, which successfully transfers the data publishing problem to a machine learning problem, solving the current challenges of cyber security systems.

  • We analyze the key difference between the proposal method with previous differentially private machine learning methods, which can only protect training samples.

  • We analyze both the privacy and the utility of MLDP, demonstrating that the MLDP satisfies differential privacy and proving the accuracy bound of the MLDP. We use extensive experiments on both real and a simulated dataset to prove the effectiveness of the proposed MLDP.

Abstract

With the development of Cyber Physical Systems, privacy issues become an important topics in the past few years. It is worthwhile to apply differential privacy, one of the most influential privacy definitions, in cyber physical system. However, as the essential idea of differential privacy is to release query results rather than entire datasets, a large volume of noise has to be introduced. To provide high quality services we need to decrease the correlation between large sets of queries, while to predict on newly entered queries. This paper transfers the data publishing problem in cyber physical systems to a machine learning problem, in which a prediction model will be shared with clients. The predict model is used to answer current submitted queries and predict results for newly entered queries from the public.

Introduction

With advances of Cyber Physical Systems, privacy preserving and security issues have attracted substantial attentions [[1], [2]]. Fog enhanced Internet of Things is one of the typical cyber physical systems, which is shown in Fig. 1(a), Fig. 1: senor data are processed or pre-processed at the network edge with the local fog computing environment, while the processed data are stored in one or multiple datasets to provide services to clients. In this process, without any privacy preserving methods, sensors information will be obtained by clients directly and the privacy of sensors might be compromised [3].

Most current researches are using cryptography to preserve privacy of sensors [[4], [5]]. However, cryptography needs to maintain encryption keys, and cannot deal with a situation where data need to be shared with the public. Currently, differential privacy is widely used to tackle privacy issues [6]. The essential idea of differential privacy is to release query results rather than sharing datasets with clients. This however may not suitable for the cyber physical system, as the system needs to exchange large amount of queries or information with clients everyday. In that case, a large volume of noise has to be introduced to the releasing queries. This obstacle hinders the implementation of differential privacy in cyber physical systems.

The difficulty in multiple query release lies on the high correlation between those queries [7]. Given a fixed privacy level, the sensitivity is defined to capture the difference in the query results between the addition or removal of a single record in a dataset. Correlations between m queries lead to higher sensitivity (normally m multiplied by the original sensitivity) than independent queries, so that large volume of noise has to be added to the query results [8]. Two challenges need to be tackled in the multiple query release.

  • How to decrease correlation among queries? As the correlation among queries will introduce large noise, and this high level of noise must be added to every query according to the definition of differential privacy. We have to decrease the correlation between queries to reduce the introduced noise.

  • How to deal with newly entered queries? As the cyber physical system cannot know what users will ask after the data has been published, he/she has to consider all possible queries and adds pre-defined noise. When we meet with the scenarios in Big Data, it is impossible to list all queries. Even if the system can list all queries, this pre-defined noise will dramatically decrease the utility of the publishing results.

Several works have been carried out over the last decade to address the first challenge. Xiao et al. [9] proposed a wavelet transformation to decrease the correlation between queries. Li et al. [7] applied the Matrix mechanism to transform the set of queries into a suitable workload. Similarly, Huang et al. [8] transformed the query sets into a set of orthogonal queries. However, they can only partly solve the first challenge, while the second challenge has not been touched by these methods. For complex queries, such as similarity queries, it is hard for a system to figure out all independent queries. In addition, when given a newly entered query, how did the system generate the result by combining of old queries is still remain unknown.

We observe that these two challenges can be overcome by transferring the data publishing problem to a machine learning problem. We treat the queries as training samples which are used to generate a prediction model, rather than releasing a set of queries or a synthetic dataset, as shown in Fig. 1(b). Clients can submit queries to model to obtain query results. For the first challenge, correlation between queries, we will apply limited queries to train the model. These limited queries have lower correlation than in the original query set. If we can guarantee the training queries can cover most possible scenarios, the output model will have higher prediction capability.

For the second challenge, the model can be used to predict the remaining queries, including those fresh queries. Actually, the model prediction is to generate the combination of training queries. Consequently, the quality of the model is determined by two key factors: the coverage of the training samples and the prediction capability of the prediction model. The model can help to answer unlimited number of complex queries.

The target of this paper is to propose a novel differentially private data publishing method for cyber security systems. We make the following contributions:

  • We propose a novel differentially private data publishing method, MLDP, which successfully transfers the data publishing problem to a machine learning problem, solving the current challenges of cyber security systems.

  • We analyze both the privacy and the utility of MLDP, demonstrating that the MLDP satisfies ϵ-differential privacy and proving the accuracy bound of the MLDP.

  • We use extensive experiments on both real and a simulated dataset to prove the effectiveness of the proposed MLDP. After comparing our method with traditional Laplace and other prevalent publishing methods, we conclude that the MLDP demonstrates better performance when answering a large set of queries.

Section snippets

Notation

We consider a finite data universe X with the size |X|. Let r be a record with d attributes sampled from the universe X, while a dataset D is an unordered set of n records from domain X. Two datasets D and D are neighboring datasets if they differ in only one record. A query f is a function that maps dataset D to an abstract range R: f:DR. A group of queries is denoted as F={f1,,fm}, and F(D) denotes {f1(D),,fm(D)}. We use symbol m to denote the number of queries in F.

The maximal difference

Overview

This section presents the implementation of the Machine Learning Differentially Private (MLDP) publishing method. For an original dataset D, suppose a set of query Fy={f1,,fy} on the D is waiting to be published. Fig. 2 presents the flow of the traditional Laplace method and the MLDP method. The first flow shows the Laplace method. When the Fy is querying on D, the method will measure the sensitivity of the query set Fy. To simplify the notation, we re-write the sensitivity of a group of

Privacy analysis of the MLDP method

According to the definition of differential privacy, if the data processing follows the requirement of differential privacy at each step, the result will satisfy with differential privacy [6].

Algorithm 1 shows that the privacy budget is only consumed in Step 2, in which Laplace noise is added to answers of training queries Fx. As the original dataset D is only accessed by Fx, following steps, model training and fresh query prediction, will not disclose any privacy information. Therefore, we

Experiment configuration

The experiments involve four real case datasets with one simulated dataset. Three of them, Search Log, NetTrace, Social, are derived from Hay’s work [15], and can represent the data from Cyber physical systems. Netflix data is a widely used dataset that can represent the service data, while simulated dataset is mainly used to represent the trend of the learning model.

  • NetTrace: this dataset contains the IP-level network trace at a border gateway of a university. Each record reports the number of

Related work

A plethora of methods has been proposed for differentially private data publishing [16]. Among them, two different types of method exist that preserve differential privacy for non-interactive data publishing. One type of method is synthetic dataset publishing another type is the batch queries publishing.

Synthetic dataset publishing attempts to publish a perturbed dataset instead of the original one. Mohammed et al. [17] proposed an anonymized algorithm DiffGen to preserve privacy for data

Conclusions

Differential privacy is an influential notion in the research of privacy preserving data publishing, but the existing differentially private method fails to provide accurate results in cyber physical systems. Two challenges must be tackled in this process: how to decrease the correlation between queries and how to deal with unknown queries before publishing. This paper proposes a query learning solution to deal with both challenges and makes the following contributions: We propose a novel MLDP

Acknowledgments

This work is supported by the National Science Foundation, United States through grants IIS-1526499, and CNS-1626432, and National Natural Science Foundation of China under Grant No. 61672313, 61502362.

Tianqing Zhu received her B.Eng. and M.Eng. degrees from Wuhan University, China, in 2000 and 2004, respectively, and a Ph.D. degree from Deakin University in Computer Science, Australia, in 2014. Dr. Tianqing Zhu is currently a lecturer of cyber security in the School of Information Technology, Deakin University, Australia. Her research interests include privacy preserving, data mining and network security.

References (20)

  • WangZ. et al.

    Achieving location error tolerant barrier coverage for wireless sensor networks

    Comput. Netw.

    (2017)
  • I. Stojmenovic, S. Wen, The fog computing paradigm: Scenarios and security issues, in: Proceedings of the 2014...
  • StojmenovicI. et al.

    An overview of fog computing and its security issues

    Concurr. Comput. : Pract. Exper.

    (2016)
  • ChenL. et al.

    Robustness, security and privacy in location-based services for future IoT: A survey

    IEEE Access

    (2017)
  • MengW. et al.

    When intrusion detection meets blockchain technology: A review

    IEEE Access

    (2018)
  • DworkC.

    A firm foundation for private data analysis

    Commun. ACM

    (2011)
  • LiC. et al.

    Optimal Error of Query Sets Under the Differentially-private Matrix Mechanism, ICDT ’13

  • D. Huang, S. Han, X. Li, P.S. Yu, Orthogonal Mechanism for Answering Batch Queries with Differential Privacy, SSDBM...
  • XiaoX. et al.

    Differential privacy via wavelet transforms

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • BlumA. et al.

    A learning theory approach to non-interactive database privacy

There are more references available in the full text version of this article.

Cited by (0)

Tianqing Zhu received her B.Eng. and M.Eng. degrees from Wuhan University, China, in 2000 and 2004, respectively, and a Ph.D. degree from Deakin University in Computer Science, Australia, in 2014. Dr. Tianqing Zhu is currently a lecturer of cyber security in the School of Information Technology, Deakin University, Australia. Her research interests include privacy preserving, data mining and network security.

Ping Xiong received his B.Eng. degree from LanZhou Jiaotong University, China in 1997. He received his M.Eng. and Ph.D. degrees from Wuhan University, China, in 2002 and 2005, respectively. He is currently the associate professor of School of Information and Security Engineering, Zhongnan University of Economics and Law, China. His research interests are network security, data mining and privacy preserving.

Gang Li received his Ph.D. in computer science from Deakin University (Australia) in 2005, and currently an associate professor in the school of IT, Deakin University. His research interests are in the area of data mining, machine learning and business intelligence. He served on the Program Committee for over 100 international conferences in artificial intelligence, data mining and machine learning, tourism and hospitality management. He is currently an associate editor for Decision Support Systems (Elsevier, 2014-), and has been the guest editor for Enterprise Information Systems (Taylor & Francis), Chinese Journal of Computer, Concurrency and Computation: Practise and Experience (Wiley), and Future Generation Computer Systems (Elsevier).

Wanlei Zhou received the B.Eng. and M.Eng. degrees from Harbin Institute of Technology, Harbin, China in 1982 and 1984, respectively, and the Ph.D. degree from The Australian National University, Canberra, Australia, in 1991, all in Computer Science and Engineering. He also received a D.Sc. degree from Deakin University in 2002. He is currently the Alfred Deakin Professor and Chair of Information Technology, School of Information Technology, Deakin University. Professor Zhou has published more than 300 papers in refereed international journals and refereed international conferences proceedings. He has also chaired many international conferences and has been invited to deliver keynote address in many international conferences. Prof. Zhou’s research interests include distributed systems, network security, bioinformatics, and e-learning. Prof. Zhou is a Senior Member of the IEEE.

Philip S. Yu received the B.S. Degree in E.E. from National Taiwan University, the M.S. and Ph.D. degrees in E.E. from Stanford University, and the M.B.A. degree from New York University. He is a Distinguished Professor in Computer Science at the University of Illinois at Chicago and also holds the Wexler Chair in Information Technology. Before joining UIC, Dr. Yu was with IBM, where he was manager of the Software Tools and Techniques department at the Watson Research Center. His research interest is on big data, including data mining, data stream, database and privacy. He has published more than 1000 papers in refereed journals and conferences. He holds or has applied for more than 300 US patents. Dr. Yu is a Fellow of the ACM and the IEEE. He is the Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data. Dr. Yu is the recipient of ACM SIGKDD 2016 Innovation Award for his influential research and scientific contributions on mining, fusion and anonymization of big data, the IEEE Computer Society’s 2013 Technical Achievement Award for “pioneering and fundamentally innovative contributions to the scalable indexing, querying, searching, mining and anonymization of big data”, and the Research Contributions Award from IEEE Intl. Conference on Data Mining (ICDM) in 2003 for his pioneering contributions to the field of data mining. He also received the ICDM 2013 10-year Highest-Impact Paper Award, and the EDBT Test of Time Award (2014).

View full text