Aggregating privatized medical data for secure querying applications

https://doi.org/10.1016/j.future.2016.11.028Get rights and content

Highlights

  • Proposes solutions for the aggregation and data querying of sensitive data.

  • Delineates applications of aggregation of sensitive medical data.

  • Introduces an efficient diagonal data aggregation method.

  • Presents a method for privately and efficiently querying the aggregated data.

  • Data service manager is untrusted beyond seeing the privatized contributed data.

Abstract

Public and private organizations generate large amounts of data which they are happy to allow others to query as long as it is privatized. (One example is that of medical data which can be used for research purposes.) Aggregation of such data on a cloud provides an opportunity for querying over rich data. This paper provides a solution for sharing sensitive data where large numbers of data contributors publish their privatized data sets which are then aggregated by a cloud manager on a cloud so that data can be made available to anyone who wants to query it. Additionally, our solution determines how aggregated data can be efficiently and effectively queried, while retaining privacy not only of the data, but also of the original data owner, the query and the person querying. We introduce a non-standard diagonal data aggregation method and, by experimental testing, demonstrate that our data querying procedure is efficient, maintains acceptable data privacy and acceptable data utility, along with practical computation and storage costs. Our solution also accepts a number of varied queries including join, aggregate, range, nested, ordered by and pattern matching. Finally, we discuss four potential threats posed by our cloud manager against which our scheme is resistant.

Introduction

The importance of data sharing in the international context of global issues such as health, environmental change, and food production, is amplified by projects such as the International Cancer Genome Consortium (https://icgc.org/icgc/goals-structure-policies-guidelines/b-consortium-goals).

The prodigious amount of data accumulated by science and business needs to be aggregated in order to extract information and gain knowledge. Such data sets are often a result of the systematic collection of published data from multiple sources and are eventually transmitted to a cloud server on which efficient processing is required to produce high-level, high-quality information. Given the sensitive nature of much data and the varying social and legal implications for its disclosure, privacy is a major concern when sharing data. In order to prevent disclosure of individually identifiable information, usually only de-identified data sets are shared. De-identification is implemented by means of privacy preserving data mining methods  [1]. Several major challenges face those wishing to aggregate data for the purposes of data sharing and querying. One challenge is to obtain an aggregated data set which achieves acceptable privacy and utility levels; a second is maintaining practical storage and computation costs. As public data sets are not under the data provider’s control, data confidentiality and integrity are of concern in outsourced databases. In order to protect sensitive data sets, the primary way to make these secure is to privatize the data before sharing  [2], [3], [4]. Once the data have been appropriately aggregated, a cloud service manager hosts the data of its clients and provides a variety of data management functionalities, including modifications and queries. A third challenge is ensuring that the server implements querying operations correctly while not having access to the identifiable information.

Requirements of Privacy Preserving Querying Services: On a daily basis, people query public online data services such as search engines, social network sites and news portals. While users need such public data services, they may also be concerned that their personal information could be disclosed or compromised. User queries can be revealed intentionally to advertisers (in some cases without the user’s knowledge) such as in some of the Google and Facebook applications (http://news.cnet.com/2702-1009_3-986.html). In our proposed data aggregation system, we enable private querying on public data services so that the contents of user queries and their replies are hidden from the service manager.

Private Data Querying Methods: A great deal of work has been done in the area of private query processing  [5], [4], [6]. We review existing privacy preserving data querying methods in Section  2.2. Query processing protocols on encrypted data sets stored on a cloud have been extensively studied (eg.  [7]), while query processing that preserves both the data privacy of the data providers along with the query privacy of the data requester is a relatively new research area.

Contributions of this paper. In this paper, we propose an entirely new approach to aggregate data, which is known as the “diagonal data aggregation” method. We experimentally and comparatively analysed our proposed diagonal data aggregation method with the most popular horizontal and vertical methods in Section  5.1, and we find that our proposed method provides better efficiency in terms of data modification operations with acceptable data privacy and acceptable data utility. We propose a method of data aggregation and data querying which achieves high levels of privacy and utility with acceptable cost, and performs data modification efficiently. However, there is a trade-off for this improvement: increased data storage costs. The results of our work are presented in Table 2, Table 3, Table 4, Table 5, Table 6 of Section  5. We show that the proposed solution provides privacy preserving data querying processes, for several query types, with low computation and communication cost in Table 5, Table 6 of Section  5.2. Our proposal also supports data modification and credential revocation processes.

Our main contributions are (i) computationally efficient data aggregation and data querying procedures (Table 3, Table 6) and (ii) aggregation and querying procedures in which the cloud service manager has no access to the original data. Additionally, our system supports several types of queries (Section  4.4) over public data sets as well as managing data updates and credential revocation.

The paper is designed as follows. In Section  2, we review the current literature relative to the challenges mentioned in this Introduction; in particular, we list the recent work on private data aggregation methods and querying methods in Sections  2.1 Privacy preserving data aggregation, 2.2 Privacy preserving data querying respectively. We identify the gaps for our work in Section  2.3 and provide solutions to fill those gaps in Section  2.4. Sections  3 Communication between the data contributor and the DSM, 4 Communication between the data requester and the DSM provide descriptions of the workflow processes between the components of our architecture. In the experimental Section  5, we aggregate four data sets using our method and then demonstrate the querying process on the result. A comparison with other research work indicates that our aggregation method is more computationally efficient than others (Table 3) and that our data querying method for varied queries is also more computationally efficient than others (Table 6). Table 2, Table 5 present time needed in communication of a DC and the DSM, and a DR and the DSM respectively. In Section  6, we demonstrate the prevention of several insider attacks potentially made from the data server. Table 7 indicates that our architecture provides better protection from insider threats than do several recent papers. Section  7 summarizes and presents conclusions.

Section snippets

Current solutions on privacy preserving data aggregation and data querying

This section presents previous work on sharing public data sets, and is divided in two Subsections. The first Subsection presents the literature on privacy preserving data aggregation methods on distributed data sets. The second Subsection provides the existing work on privately data querying on public data sets.

Communication between the data contributor and the DSM

This section describes the communication processing steps between the DSM and a DC. The DSM and data repository are used to manage and process each step of communication between the DSM and a DC.

The communication steps are presented in Fig. 2. Each module of this communication is illustrated in following Sections 3.1, 3.2, 3.3, and 3.4 respectively.

Communication between the data requester and the DSM

This section details an interaction between the DSM and a DR, presented in Fig. 5. At the DSM end, the DSM, a data repository, policy enforcement side (PEP), and policy decision point (PDP), are used for each communication step. The DSM is responsible for each send and receive process on the cloud server, in which policies are established using PEP and PDP (detail in Section  6), and decisions are made on the basis of these policies. This section illustrates querying processing steps between a

Experimental work

This section demonstrates the implementation of the proposed architecture and compares it with well known existing methods. Our experiments focus on aggregation of the CDSs, the data querying analysis on the DR, and on the communication between the DSM and a DR. We use both real and synthetic data sets for our experimental work. These data sets are described in the paper  [1]. The sizes of the real data sets are 294×14, and 303×14. The sizes of the synthetic data sets are 294×121, and 303×121.

Insider threat model

Section  5 demonstrated the comparative computational efficiency of our data aggregation and data querying procedures as claimed in Section  1. We now address the second claim from that section: the fact that we prevent the cloud service manager from accessing original (plaintext) versions of both submitted data sets and of query responses. In particular, in this section, we focus first on the ability of the DSM to (1) estimate the original data set from the privatized data set and (2) identify

Summary and conclusions

This paper presented a solution for sharing of sensitive data sets in which a large number of data contributors publish their privatized data sets on a cloud server, so that the data sets are made available to anyone who wants access to it, for whatever purpose. Our proposed architecture supports data modification operations. We aggregate data using a new diagonal method which we demonstrate to be more efficient than any current methods. We present a method for efficiently querying the

Acknowledgement

This project was supported by ARC grant LP0989756.

Kalpana Singh is pursuing PostDoc at CEA, France. She did her Ph.D. at Deakin University, Australia. Her academic record is laden with First class throughout. She has been teaching successfully at Department of Computer Science and Information Technology, GLA University INDIA. She has a number of research publications to her credit in reputed journals and conferences in the area of cryptography and Information Security.

References (38)

  • D. Thilakanathan et al.

    A platform for secure monitoring and sharing of generic health data in the cloud

    Future Gener. Comput. Syst.

    (2014)
  • Y. Jararweh et al.

    Software defined cloud: Survey, system and evaluation

    Future Gener. Comput. Syst.

    (2016)
  • K. Singh, J. Rong, L. Batten, Sharing sensitive medical data sets for research purposes - a case study, in: Proceedings...
  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu, Order preserving encryption for numeric data, in: SIGMOD’04 Proceedings of...
  • S. Tu et al.

    Processing analytical queries over encrypted data

    Proc. VLDB Endow.

    (2013)
  • H. Hacigümüş, B. Iyer, C. Li, S. Mehrotra, Executing SQL over encrypted data in the database-service-provider model,...
  • S. Wang et al.

    Towards practical private processing of database queries over public data

    Distrib. Parallel Databases

    (2014)
  • N. Cao et al.

    Privacy-preserving multi-keyword ranked search over encrypted cloud data

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, M. Naor, Our data, ourselves: Privacy via distributed noise...
  • D.X. Song, D. Wagner, A. Perrig, Practical techniques for searches on encrypted data, in: IEEE Symposium on Security...
  • C. Chu et al.

    Key-aggregate cryptosystem for scalable data sharing in cloud storage

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • D. McCarthy, P. Malone, J. Hange, K. Doyle, E. Robson, D. Conway, S. Ivanov, Ł. Radziwonowicz, R. Kleinfeld, T....
  • Q. Zhou et al.

    An efficient secure data aggregation based on homomorphic primitives in wireless sensor networks

    Int. J. Distrib. Sens. Netw.

    (2014)
  • Y. Yang et al.

    Sdap: A secure hop-by-hop data aggregation protocol for sensor networks

    ACM Trans. Inf. Syst. Secur. (TISSEC)

    (2008)
  • N. An, S. Weber, On the performance overhead tradeoff of distributed principal component analysis via data...
  • L. Li et al.

    Privacy-preserving-outsourced association rule mining on vertically partitioned databases

    IEEE Trans. Inf. Forensics Secur.

    (2016)
  • L. Bellatreche, R. Bouchakri, A. Cuzzocrea, S. Maabout, Horizontal partitioning of very-large data warehouses under...
  • M. Kantarcioulu et al.

    Privacy preserving distributed mining of association rules on horizontally partitioned data

    IEEE Trans. Knowl. Data Eng.

    (2004)
  • Cited by (15)

    • Synthetic data generation for tabular health records: A systematic review

      2022, Neurocomputing
      Citation Excerpt :

      In 2010 a survey on PPDP [5] discussed common privacy preservation models and their support for different types of attack, anonymisation techniques and information utility metrics. Anonymisation techniques seek to balance the trade-off between disclosure risk and data utility in the final published data, rendering a modified version of the original dataset in such a way that individuals are no longer identifiable [6,7]. However, the utility of data anonymised using these methods is often adversely impacted and the data remains susceptible to disclosure [8].

    • A comparative evaluation of aggregation methods for machine learning over vertically partitioned data

      2020, Expert Systems with Applications
      Citation Excerpt :

      Early efforts to perform distinct ML tasks in this condition have been discussed in literature almost 20 years ago (Vaidya & Clifton, 2002). Nonetheless, more recently, this topic has received greater attention due to the emergence of technologies such as IoT, which have an inherent distributed nature, as well as to the concern in preserving data privacy in collaborative projects that could benefit from joint analysis of datasets generated in multiple sources (e.g., data science in medical domains) (Singh & Batten, 2017). The main reason is that vertically distributed data are challenging and require advanced data analysis methods that are resource-aware, intelligently reduce the amount of data transmitted, preserve data privacy when needed, and achieve good performance in the ML task (Stolpe, 2016).

    • A multistage protocol for aggregated queries in distributed cloud databases with privacy protection

      2019, Future Generation Computer Systems
      Citation Excerpt :

      A secure e-health cloud system was proposed in [46]. A solution for aggregating sensitive data to form a combined database is provided in [47] in the context of medical records. It allows large numbers of data contributors to publish their privatized datasets which are then aggregated in a cloud.

    • Preface: Security and privacy in big data clouds

      2017, Future Generation Computer Systems
    View all citing articles on Scopus

    Kalpana Singh is pursuing PostDoc at CEA, France. She did her Ph.D. at Deakin University, Australia. Her academic record is laden with First class throughout. She has been teaching successfully at Department of Computer Science and Information Technology, GLA University INDIA. She has a number of research publications to her credit in reputed journals and conferences in the area of cryptography and Information Security.

    Lynn Batten holds the Deakin Research Chair in Mathematics and is Director of Information Security Research at Deakin University. She is a Fellow of the Australian Computer Society, a Graduate of the Australian institute of Company Directors and a Senior Member of the IEEE. Her research interests cover a broad set of areas in information security from cryptography to malicious software and digital forensics.

    1

    DRT/LIST/DACLE/SCSN/L3S, Commissariat à l’Energie Atomique, NanoInnov Centre de Saclay, 91191 Gif-sur-Yvette Cedex, France.

    View full text