Elsevier

Methods

Volume 69, Issue 3, 1 October 2014, Pages 237-246
Methods

Data Integration Protocol In Ten-steps (DIPIT): A new standard for medical researchers

https://doi.org/10.1016/j.ymeth.2014.07.001Get rights and content

Highlights

  • Currently no documented protocols exist for best practice integration of data files.

  • Poor quality integration processes cause errors and loss of confidence in the data.

  • DIPIT is a systematic approach for integrating multiple heterogeneous data files.

  • DIPIT is designed to minimise errors and streamline the integration process.

Abstract

Introduction

The exponential increase in data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making. Benefits from effective data integration to the health and medical research community include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. The costs of poor quality processes elevate the risk of erroneous outcomes, an erosion of confidence in the data and the organisations using these data. To date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT) translational to any field of research.

Methods and results

The DIPIT approach consists of a set of 10 systematic methodological steps to ensure the final data are appropriate for the analysis to meet the research objectives, legal and ethical requirements are met, and that data definitions are clear, concise, and comprehensive. This protocol is neither file specific nor software dependent, but aims to be transportable to any data-merging situation to minimise redundancy and error and translational to any field of research. DIPIT aims to generate a master data file that is of the optimal integrity to serve as the basis for research analysis.

Conclusion

With linking of heterogeneous data files becoming increasingly common across all fields of medicine, DIPIT provides a systematic approach to a potentially complex task of integrating a large number of files and variables. The DIPIT protocol will ensure the final integrated data is consistent and of high integrity for the research requirements, useful for practical application across all fields of medical research.

Introduction

The exponential increase in available data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making [1]. The growth of data analytics [2] has meant that organisational information flows have become more targeted and focussed. Benefits from effective data integration include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. Considerable resources are being invested in quality initiatives surrounding data integration; however, poor quality processes underpinning these analytics elevate the risk of erroneous outcomes. The result can be wasted resources and, ultimately, an erosion of confidence in the data and the organisations using these data. The sharing of information can potentially improve policy-making and integrated public services [1], [3].

Data file integration has enhanced knowledge across a broad spectrum of health and medical research, such as health employee research [4], behavioural survey data [5], social sciences [2], patient hospital records [4], [6], [7], [8], cancer and other health research [9], [10] and in bio molecular systems [11], [12], [13], [14], genetics and genomics [15], [16]. In the field of health and medical science it is becoming an increasing requirement to integrate or merge extensive heterogeneous data files for research purposes, and data files can be linked from multiple providers to perform complex analyses [17], [18], [19]. For example, patient data are compiled from both institutional and community settings, including patient records, digital scans, observational surveys, behavioural surveys and official records, and these are often available in diverse and fragmented formats [6].

The plethora of analytical functions required to effectively and accurately integrate heterogeneous data files is challenging and sometimes overwhelming. Often significant funds are invested in quality initiatives that rely on data integration, but variable methodology and thus quality underpinning these analytics elevates the risk of erroneous outcomes. However, to date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear operational protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT).

Even though the concept of integrating many files to form a single data file for analysis [20] appears relatively straightforward, the actual integration requires careful preparation and a systematic approach to ensure the resulting data are in the correct format, appropriate for the analytical task.

The management process involved in producing a reliable and robust integrated data set from multiple sources, with varying heterogeneous formats, is fraught with potential traps. Large organisations, dedicated to providing data file integration services, have proliferated over the last few decades. The potential to enrich knowledge rises as data integration complexity increases, and with this, potential pitfalls increase. As data files expand in volume and complexity, problems can compound to negatively influence the quality of the final integrated data. The requirements for careful management of the merging and organisational processes are often underestimated, but imperative for reliable results. Many data files often exhibit considerable noise or meaningless data, missing information and unstructured text. All these problems need to be addressed when integrating data [6].

Section snippets

Methods and results

The DIPIT approach consists of a set of systematic methodological steps (Table 1) to ensure that: the final data are appropriate for the analysis to meet the research objectives; legal and ethical requirements are met; and that data definitions are clear, concise, and comprehensive. This protocol is neither file specific nor software dependent, but aims to be transportable to any data-merging situation to minimise redundancy and error. It aims to facilitate the generation of a master file that

Example use of DIPIT

DIPIT was used to integrate data files from the National Health and Nutrition Examination Surveys (NHANES), a United States population-based cross-sectional study, with the research objective to study selected demographic, examination and laboratory risk factors for depression. Table 3 outlines the tools used at each DIPIT step for the integration of the selected 80 demographic, examination and laboratory data files downloaded from the NHANES website based on the guidelines provided [80].

Conclusion

The integration of 80 selected demographic, examination and laboratory data files, downloaded from the NHANES website, has been used to highlight how DIPIT ensures that the final data file was appropriate for the research objectives, that the legal and ethical requirements are met, that data definitions are clear, concise, and comprehensive, and that linkage quality and missing information was identified and addressed.

The linking of as set of heterogeneous files is becoming increasingly common

Competing interests

There are no competing interests with regards to this manuscript.

Authors’ contributions

JFD conceived and designed DIPIT and drafted the manuscript. M.B., F.N.J., L.J.W., S.D. and J.A.P. critically appraised the manuscript. All authors read and approved the final manuscript.

Authors’ Information

J.F.D. is a PhD student with the School of Medicine at Deakin University and sessional academic, lecturing in statistics, with the Department of Statistics, Data Science and Epidemiology at Swinburne University of Technology.

M.B. is currently a NHMRC Senior Principal research Fellow, and is Alfred Deakin Chair of Psychiatry at Deakin University, where he heads the IMPACT Strategic Research Centre. He also is an Honorary Professorial Research fellow in the Department of Psychiatry, the Florey

References (80)

  • S.L. Puller et al.

    Energy Econ.

    (1999)
  • Y. Kiyota et al.

    Am. Heart J.

    (2004)
  • M. Weiner et al.

    Int. J. Med. Inform.

    (2003)
  • V.J. Zhu et al.

    J. Am. Med. Inform. Assoc.

    (2009)
  • J.J. Berman

    Artif. Intell. Med.

    (2002)
  • S.R. Wisniewski et al.

    Biol. Psychiatry

    (2006)
  • A. Pickles

    Encycl. Social Meas.

    (2005)
  • J. Zeng, Center for Technology in Government University at Albany/SUNY,...
  • D. Bollier, C.M. Firestone, The Promise and Peril of Big Data, Aspen Institute, Communications and Society Program...
  • S.S. Dawes

    J. Policy Anal. Manage.

    (1996)
  • P.E. Spector et al.

    J. Appl. Psychol.

    (1991)
  • A.B. Rothbard et al.

    Adm. Policy Ment. Health Ment. Health Serv. Res.

    (1990)
  • P.E. Spector et al.

    J. Appl. Psychol.

    (1988)
  • A. Daemen et al.

    Pac. Symp. Biocomput.

    (2008)
  • J.K. Choi et al.

    Bioinformatics

    (2003)
  • B. Smith et al.

    Nat. Biotechnol.

    (2007)
  • S.P. Akula et al.

    Bioinformation

    (2009)
  • C.F. Quo et al.

    Briefings Bioinf.

    (2012)
  • J.A. Seoane et al.

    Curr. Comput. Aided Drug Des.

    (2013)
  • J.S. Hamid et al.

    Hum. Genomics Proteomics

    (2009)
  • R. Jansen et al.

    J. Struct. Funct. Genomics

    (2002)
  • S. Gomatam et al.

    Stat. Med.

    (2002)
  • O.U. Press, Definition of merge,...
  • Y.S.o. Medicine,...
  • J. Ma et al.

    BMC Med. Res. Methodol.

    (2011)
  • M. Greiver et al.

    BMC Health Serv. Res.

    (2012)
  • J. Braa et al.

    Bull. World Health Organ.

    (2012)
  • L. Gu, R. Baxter, D. Vickers, C. Rainsford, CSIRO Mathematical and Information Sciences Technical Report, 3 (2003)...
  • C.I. Neutel

    Pharmacoepidemiol. Drug Saf.

    (1998)
  • D. Elgesem

    Philos. Perspect. Comput. Mediated Commun.

    (1996)
  • G.B. Bell et al.

    Commun. ACM

    (2001)
  • G.W.M. Krysztof et al.

    Artif. Intell. Med.

    (2002)
  • M.B. Van Der Weyden

    Med. J. Aust.

    (2006)
  • N.H.a.M.R.C, Australian Government, NHMRC,...
  • Y. Reingewertz, Available at SSRN 2200023,...
  • R. Kammann

    Hum. Factors

    (1975)
  • P. Dargan et al.

    Emerg. Med. J.

    (2002)
  • T. Crews, Computer Science Teaching Centre Digital Library, Western Kentucky University, USA, 2001....
  • S. Institute, SAS 9. 3 Output Delivery System: User’s Guide, Sas Institute,...
  • L. StataCorp

    Stata Data Management: Reference Manual: Release 12

    (2011)
  • Cited by (14)

    • Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machine-learning methods (GLUMM)

      2017, European Psychiatry
      Citation Excerpt :

      Oversampling of subgroups of the population of particular public health interest was performed to increase the reliability and precision of population estimates [28]. Questionnaire data from the NHANES website were downloaded and integrated using the Data integration protocol in ten steps (DIPIT) [29]. Variables from the questionnaire component of the NHANES study were categorised according to type (i.e. medical or lifestyle-environ) and initially selected if considered a lifestyle-eviron factor.

    • The association between dietary patterns, diabetes and depression

      2015, Journal of Affective Disorders
      Citation Excerpt :

      The sampling methodology involved a stratified, multistage probability process, with data collected annually but released in blocks of two years. Relevant NHANES data files were downloaded from the website and integrated using the Data Integration Protocol In Ten-Steps (DIPIT) (Dipnall et al., 2014). Across the two years studied, markers of depressive symptoms were available for 4656 participants, and 4588 of these were identified with or without diabetes and thus were included in this study.

    View all citing articles on Scopus
    View full text