Data Integration Protocol In Ten-steps (DIPIT): A new standard for medical researchers
Introduction
The exponential increase in available data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making [1]. The growth of data analytics [2] has meant that organisational information flows have become more targeted and focussed. Benefits from effective data integration include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. Considerable resources are being invested in quality initiatives surrounding data integration; however, poor quality processes underpinning these analytics elevate the risk of erroneous outcomes. The result can be wasted resources and, ultimately, an erosion of confidence in the data and the organisations using these data. The sharing of information can potentially improve policy-making and integrated public services [1], [3].
Data file integration has enhanced knowledge across a broad spectrum of health and medical research, such as health employee research [4], behavioural survey data [5], social sciences [2], patient hospital records [4], [6], [7], [8], cancer and other health research [9], [10] and in bio molecular systems [11], [12], [13], [14], genetics and genomics [15], [16]. In the field of health and medical science it is becoming an increasing requirement to integrate or merge extensive heterogeneous data files for research purposes, and data files can be linked from multiple providers to perform complex analyses [17], [18], [19]. For example, patient data are compiled from both institutional and community settings, including patient records, digital scans, observational surveys, behavioural surveys and official records, and these are often available in diverse and fragmented formats [6].
The plethora of analytical functions required to effectively and accurately integrate heterogeneous data files is challenging and sometimes overwhelming. Often significant funds are invested in quality initiatives that rely on data integration, but variable methodology and thus quality underpinning these analytics elevates the risk of erroneous outcomes. However, to date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear operational protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT).
Even though the concept of integrating many files to form a single data file for analysis [20] appears relatively straightforward, the actual integration requires careful preparation and a systematic approach to ensure the resulting data are in the correct format, appropriate for the analytical task.
The management process involved in producing a reliable and robust integrated data set from multiple sources, with varying heterogeneous formats, is fraught with potential traps. Large organisations, dedicated to providing data file integration services, have proliferated over the last few decades. The potential to enrich knowledge rises as data integration complexity increases, and with this, potential pitfalls increase. As data files expand in volume and complexity, problems can compound to negatively influence the quality of the final integrated data. The requirements for careful management of the merging and organisational processes are often underestimated, but imperative for reliable results. Many data files often exhibit considerable noise or meaningless data, missing information and unstructured text. All these problems need to be addressed when integrating data [6].
Section snippets
Methods and results
The DIPIT approach consists of a set of systematic methodological steps (Table 1) to ensure that: the final data are appropriate for the analysis to meet the research objectives; legal and ethical requirements are met; and that data definitions are clear, concise, and comprehensive. This protocol is neither file specific nor software dependent, but aims to be transportable to any data-merging situation to minimise redundancy and error. It aims to facilitate the generation of a master file that
Example use of DIPIT
DIPIT was used to integrate data files from the National Health and Nutrition Examination Surveys (NHANES), a United States population-based cross-sectional study, with the research objective to study selected demographic, examination and laboratory risk factors for depression. Table 3 outlines the tools used at each DIPIT step for the integration of the selected 80 demographic, examination and laboratory data files downloaded from the NHANES website based on the guidelines provided [80].
Conclusion
The integration of 80 selected demographic, examination and laboratory data files, downloaded from the NHANES website, has been used to highlight how DIPIT ensures that the final data file was appropriate for the research objectives, that the legal and ethical requirements are met, that data definitions are clear, concise, and comprehensive, and that linkage quality and missing information was identified and addressed.
The linking of as set of heterogeneous files is becoming increasingly common
Competing interests
There are no competing interests with regards to this manuscript.
Authors’ contributions
JFD conceived and designed DIPIT and drafted the manuscript. M.B., F.N.J., L.J.W., S.D. and J.A.P. critically appraised the manuscript. All authors read and approved the final manuscript.
Authors’ Information
J.F.D. is a PhD student with the School of Medicine at Deakin University and sessional academic, lecturing in statistics, with the Department of Statistics, Data Science and Epidemiology at Swinburne University of Technology.
M.B. is currently a NHMRC Senior Principal research Fellow, and is Alfred Deakin Chair of Psychiatry at Deakin University, where he heads the IMPACT Strategic Research Centre. He also is an Honorary Professorial Research fellow in the Department of Psychiatry, the Florey
References (80)
- et al.
Energy Econ.
(1999) - et al.
Am. Heart J.
(2004) - et al.
Int. J. Med. Inform.
(2003) - et al.
J. Am. Med. Inform. Assoc.
(2009) Artif. Intell. Med.
(2002)- et al.
Biol. Psychiatry
(2006) Encycl. Social Meas.
(2005)- J. Zeng, Center for Technology in Government University at Albany/SUNY,...
- D. Bollier, C.M. Firestone, The Promise and Peril of Big Data, Aspen Institute, Communications and Society Program...
J. Policy Anal. Manage.
(1996)
J. Appl. Psychol.
Adm. Policy Ment. Health Ment. Health Serv. Res.
J. Appl. Psychol.
Pac. Symp. Biocomput.
Bioinformatics
Nat. Biotechnol.
Bioinformation
Briefings Bioinf.
Curr. Comput. Aided Drug Des.
Hum. Genomics Proteomics
J. Struct. Funct. Genomics
Stat. Med.
BMC Med. Res. Methodol.
BMC Health Serv. Res.
Bull. World Health Organ.
Pharmacoepidemiol. Drug Saf.
Philos. Perspect. Comput. Mediated Commun.
Commun. ACM
Artif. Intell. Med.
Med. J. Aust.
Hum. Factors
Emerg. Med. J.
Stata Data Management: Reference Manual: Release 12
Cited by (14)
Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machine-learning methods (GLUMM)
2017, European PsychiatryCitation Excerpt :Oversampling of subgroups of the population of particular public health interest was performed to increase the reliability and precision of population estimates [28]. Questionnaire data from the NHANES website were downloaded and integrated using the Data integration protocol in ten steps (DIPIT) [29]. Variables from the questionnaire component of the NHANES study were categorised according to type (i.e. medical or lifestyle-environ) and initially selected if considered a lifestyle-eviron factor.
The association between dietary patterns, diabetes and depression
2015, Journal of Affective DisordersCitation Excerpt :The sampling methodology involved a stratified, multistage probability process, with data collected annually but released in blocks of two years. Relevant NHANES data files were downloaded from the website and integrated using the Data Integration Protocol In Ten-Steps (DIPIT) (Dipnall et al., 2014). Across the two years studied, markers of depressive symptoms were available for 4656 participants, and 4588 of these were identified with or without diabetes and thus were included in this study.
Functional Requirements for Medical Data Integration into Knowledge Management Environments: Requirements Elicitation Approach Based on Systematic Literature Analysis
2023, Journal of Medical Internet ResearchGetting RID of the blues: Formulating a Risk Index for Depression (RID) using structural equation modeling
2017, Australian and New Zealand Journal of Psychiatry