Data Integration Protocol In Ten-steps (DIPIT): A new standard for medical researchers

doi:10.1016/j.ymeth.2014.07.001

Methods

Volume 69, Issue 3, 1 October 2014, Pages 237-246

https://doi.org/10.1016/j.ymeth.2014.07.001 Get rights and content

Highlights

•
Currently no documented protocols exist for best practice integration of data files.
•
Poor quality integration processes cause errors and loss of confidence in the data.
•
DIPIT is a systematic approach for integrating multiple heterogeneous data files.
•
DIPIT is designed to minimise errors and streamline the integration process.

Abstract

Introduction

The exponential increase in data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making. Benefits from effective data integration to the health and medical research community include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. The costs of poor quality processes elevate the risk of erroneous outcomes, an erosion of confidence in the data and the organisations using these data. To date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT) translational to any field of research.

Methods and results

The DIPIT approach consists of a set of 10 systematic methodological steps to ensure the final data are appropriate for the analysis to meet the research objectives, legal and ethical requirements are met, and that data definitions are clear, concise, and comprehensive. This protocol is neither file specific nor software dependent, but aims to be transportable to any data-merging situation to minimise redundancy and error and translational to any field of research. DIPIT aims to generate a master data file that is of the optimal integrity to serve as the basis for research analysis.

Conclusion

With linking of heterogeneous data files becoming increasingly common across all fields of medicine, DIPIT provides a systematic approach to a potentially complex task of integrating a large number of files and variables. The DIPIT protocol will ensure the final integrated data is consistent and of high integrity for the research requirements, useful for practical application across all fields of medical research.

Introduction

The exponential increase in available data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making [1]. The growth of data analytics [2] has meant that organisational information flows have become more targeted and focussed. Benefits from effective data integration include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. Considerable resources are being invested in quality initiatives surrounding data integration; however, poor quality processes underpinning these analytics elevate the risk of erroneous outcomes. The result can be wasted resources and, ultimately, an erosion of confidence in the data and the organisations using these data. The sharing of information can potentially improve policy-making and integrated public services [1], [3].

Data file integration has enhanced knowledge across a broad spectrum of health and medical research, such as health employee research [4], behavioural survey data [5], social sciences [2], patient hospital records [4], [6], [7], [8], cancer and other health research [9], [10] and in bio molecular systems [11], [12], [13], [14], genetics and genomics [15], [16]. In the field of health and medical science it is becoming an increasing requirement to integrate or merge extensive heterogeneous data files for research purposes, and data files can be linked from multiple providers to perform complex analyses [17], [18], [19]. For example, patient data are compiled from both institutional and community settings, including patient records, digital scans, observational surveys, behavioural surveys and official records, and these are often available in diverse and fragmented formats [6].

The plethora of analytical functions required to effectively and accurately integrate heterogeneous data files is challenging and sometimes overwhelming. Often significant funds are invested in quality initiatives that rely on data integration, but variable methodology and thus quality underpinning these analytics elevates the risk of erroneous outcomes. However, to date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear operational protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT).

Even though the concept of integrating many files to form a single data file for analysis [20] appears relatively straightforward, the actual integration requires careful preparation and a systematic approach to ensure the resulting data are in the correct format, appropriate for the analytical task.

The management process involved in producing a reliable and robust integrated data set from multiple sources, with varying heterogeneous formats, is fraught with potential traps. Large organisations, dedicated to providing data file integration services, have proliferated over the last few decades. The potential to enrich knowledge rises as data integration complexity increases, and with this, potential pitfalls increase. As data files expand in volume and complexity, problems can compound to negatively influence the quality of the final integrated data. The requirements for careful management of the merging and organisational processes are often underestimated, but imperative for reliable results. Many data files often exhibit considerable noise or meaningless data, missing information and unstructured text. All these problems need to be addressed when integrating data [6].

Section snippets

Methods and results

The DIPIT approach consists of a set of systematic methodological steps (Table 1) to ensure that: the final data are appropriate for the analysis to meet the research objectives; legal and ethical requirements are met; and that data definitions are clear, concise, and comprehensive. This protocol is neither file specific nor software dependent, but aims to be transportable to any data-merging situation to minimise redundancy and error. It aims to facilitate the generation of a master file that

Example use of DIPIT

DIPIT was used to integrate data files from the National Health and Nutrition Examination Surveys (NHANES), a United States population-based cross-sectional study, with the research objective to study selected demographic, examination and laboratory risk factors for depression. Table 3 outlines the tools used at each DIPIT step for the integration of the selected 80 demographic, examination and laboratory data files downloaded from the NHANES website based on the guidelines provided [80].

Conclusion

The integration of 80 selected demographic, examination and laboratory data files, downloaded from the NHANES website, has been used to highlight how DIPIT ensures that the final data file was appropriate for the research objectives, that the legal and ethical requirements are met, that data definitions are clear, concise, and comprehensive, and that linkage quality and missing information was identified and addressed.

The linking of as set of heterogeneous files is becoming increasingly common

Competing interests

There are no competing interests with regards to this manuscript.

Authors’ contributions

JFD conceived and designed DIPIT and drafted the manuscript. M.B., F.N.J., L.J.W., S.D. and J.A.P. critically appraised the manuscript. All authors read and approved the final manuscript.

Authors’ Information

J.F.D. is a PhD student with the School of Medicine at Deakin University and sessional academic, lecturing in statistics, with the Department of Statistics, Data Science and Epidemiology at Swinburne University of Technology.

M.B. is currently a NHMRC Senior Principal research Fellow, and is Alfred Deakin Chair of Psychiatry at Deakin University, where he heads the IMPACT Strategic Research Centre. He also is an Honorary Professorial Research fellow in the Department of Psychiatry, the Florey

References (80)

S.L. Puller et al.
Energy Econ.
(1999)
Y. Kiyota et al.
Am. Heart J.
(2004)
M. Weiner et al.
Int. J. Med. Inform.
(2003)
V.J. Zhu et al.
J. Am. Med. Inform. Assoc.
(2009)
J.J. Berman
Artif. Intell. Med.
(2002)
S.R. Wisniewski et al.
Biol. Psychiatry
(2006)
A. Pickles
Encycl. Social Meas.
(2005)
J. Zeng, Center for Technology in Government University at Albany/SUNY,...
D. Bollier, C.M. Firestone, The Promise and Peril of Big Data, Aspen Institute, Communications and Society Program...
S.S. Dawes
J. Policy Anal. Manage.
(1996)

P.E. Spector et al.

J. Appl. Psychol.

(1991)

A.B. Rothbard et al.

Adm. Policy Ment. Health Ment. Health Serv. Res.

(1990)

P.E. Spector et al.

J. Appl. Psychol.

(1988)

A. Daemen et al.

Pac. Symp. Biocomput.

(2008)

J.K. Choi et al.

Bioinformatics

(2003)

B. Smith et al.

Nat. Biotechnol.

(2007)

S.P. Akula et al.

Bioinformation

(2009)

C.F. Quo et al.

Briefings Bioinf.

(2012)

J.A. Seoane et al.

Curr. Comput. Aided Drug Des.

(2013)

J.S. Hamid et al.

Hum. Genomics Proteomics

(2009)

R. Jansen et al.

J. Struct. Funct. Genomics

(2002)

S. Gomatam et al.

Stat. Med.

(2002)

O.U. Press, Definition of merge,...

Y.S.o. Medicine,...

J. Ma et al.

BMC Med. Res. Methodol.

(2011)

M. Greiver et al.

BMC Health Serv. Res.

(2012)

J. Braa et al.

Bull. World Health Organ.

(2012)

L. Gu, R. Baxter, D. Vickers, C. Rainsford, CSIRO Mathematical and Information Sciences Technical Report, 3 (2003)...

C.I. Neutel

Pharmacoepidemiol. Drug Saf.

(1998)

D. Elgesem

Philos. Perspect. Comput. Mediated Commun.

(1996)

G.B. Bell et al.

Commun. ACM

(2001)

G.W.M. Krysztof et al.

Artif. Intell. Med.

(2002)

M.B. Van Der Weyden

Med. J. Aust.

(2006)

N.H.a.M.R.C, Australian Government, NHMRC,...

Y. Reingewertz, Available at SSRN 2200023,...

R. Kammann

Hum. Factors

(1975)

P. Dargan et al.

Emerg. Med. J.

(2002)

T. Crews, Computer Science Teaching Centre Digital Library, Western Kentucky University, USA, 2001....

S. Institute, SAS 9. 3 Output Delivery System: User’s Guide, Sas Institute,...

L. StataCorp

Stata Data Management: Reference Manual: Release 12

(2011)

Cited by (14)

Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machine-learning methods (GLUMM)
2017, European Psychiatry
Citation Excerpt :
Oversampling of subgroups of the population of particular public health interest was performed to increase the reliability and precision of population estimates [28]. Questionnaire data from the NHANES website were downloaded and integrated using the Data integration protocol in ten steps (DIPIT) [29]. Variables from the questionnaire component of the NHANES study were categorised according to type (i.e. medical or lifestyle-environ) and initially selected if considered a lifestyle-eviron factor.
Key lifestyle-environ risk factors are operative for depression, but it is unclear how risk factors cluster. Machine-learning (ML) algorithms exist that learn, extract, identify and map underlying patterns to identify groupings of depressed individuals without constraints. The aim of this research was to use a large epidemiological study to identify and characterise depression clusters through “Graphing lifestyle-environs using machine-learning methods” (GLUMM).
Two ML algorithms were implemented: unsupervised Self-organised mapping (SOM) to create GLUMM clusters and a supervised boosted regression algorithm to describe clusters. Ninety-six “lifestyle-environ” variables were used from the National health and nutrition examination study (2009–2010). Multivariate logistic regression validated clusters and controlled for possible sociodemographic confounders.
The SOM identified two GLUMM cluster solutions. These solutions contained one dominant depressed cluster (GLUMM5-1, GLUMM7-1). Equal proportions of members in each cluster rated as highly depressed (17%). Alcohol consumption and demographics validated clusters. Boosted regression identified GLUMM5-1 as more informative than GLUMM7-1. Members were more likely to: have problems sleeping; unhealthy eating; ≤ 2 years in their home; an old home; perceive themselves underweight; exposed to work fumes; experienced sex at ≤ 14 years; not perform moderate recreational activities. A positive relationship between GLUMM5-1 (OR: 7.50, P < 0.001) and GLUMM7-1 (OR: 7.88, P < 0.001) with depression was found, with significant interactions with those married/living with partner (P = 0.001).
Using ML based GLUMM to form ordered depressive clusters from multitudinous lifestyle-environ variables enabled a deeper exploration of the heterogeneous data to uncover better understandings into relationships between the complex mental health factors.
The association between dietary patterns, diabetes and depression
2015, Journal of Affective Disorders
Citation Excerpt :
The sampling methodology involved a stratified, multistage probability process, with data collected annually but released in blocks of two years. Relevant NHANES data files were downloaded from the website and integrated using the Data Integration Protocol In Ten-Steps (DIPIT) (Dipnall et al., 2014). Across the two years studied, markers of depressive symptoms were available for 4656 participants, and 4588 of these were identified with or without diabetes and thus were included in this study.
Type 2 diabetes and depression are commonly comorbid high-prevalence chronic disorders. Diet is a key diabetes risk factor and recent research has highlighted the relevance of diet as a possible risk for factor common mental disorders. This study aimed to investigate the interrelationship among dietary patterns, diabetes and depression.
Data were integrated from the National Health and Nutrition Examination Study (2009–2010) for adults aged 18+ (n=4588, Mean age=43 yr). Depressive symptoms were measured by the Patient Health Questionnaire-9 and diabetes status determined via self-report, usage of diabetic medication and/or fasting glucose levels ≥126 mg/dL and a glycated hemoglobin level ≥6.5% (48 mmol/mol). A 24-h dietary recall interview was given to determine intakes. Multiple logistic regression was employed, with depression the outcome, and dietary patterns and diabetes the predictors. Covariates included gender, age, marital status, education, race, adult food insecurity level, ratio of family income to poverty, and serum C-reactive protein.
Exploratory factor analysis revealed five dietary patterns (healthy; unhealthy; sweets; ‘Mexican’ style; breakfast) explaining 39.8% of the total variance. The healthy dietary pattern was associated with reduced odds of depression for those with diabetes (OR 0.68, 95% CI [0.52, 0.88], p=0.006) and those without diabetes (OR 0.79, 95% CI [0.64, 0.97], p=0.029) (interaction p=0.048). The relationship between the sweets dietary pattern and depression was fully explained by diabetes status.
In this study, a healthy dietary pattern was associated with a reduced likelihood of depressive symptoms, especially for those with Type 2 diabetes.
Recent development in bioinformatics for utilizing omics data
2014, Methods
Functional Requirements for Medical Data Integration into Knowledge Management Environments: Requirements Elicitation Approach Based on Systematic Literature Analysis
2023, Journal of Medical Internet Research
Predictors of health-related quality of life following injury in childhood and adolescence: a pooled analysis
2022, Injury Prevention
Getting RID of the blues: Formulating a Risk Index for Depression (RID) using structural equation modeling
2017, Australian and New Zealand Journal of Psychiatry

View all citing articles on Scopus

View full text

Data Integration Protocol In Ten-steps (DIPIT): A new standard for medical researchers

Highlights

Abstract

Introduction

Methods and results

Conclusion

Introduction

Section snippets

Methods and results

Example use of DIPIT

Conclusion

Competing interests

Authors’ contributions

Authors’ Information

Energy Econ.

Am. Heart J.

Int. J. Med. Inform.

J. Am. Med. Inform. Assoc.

Artif. Intell. Med.

Biol. Psychiatry

Encycl. Social Meas.

J. Policy Anal. Manage.

J. Appl. Psychol.

Adm. Policy Ment. Health Ment. Health Serv. Res.

J. Appl. Psychol.

Pac. Symp. Biocomput.

Bioinformatics

Nat. Biotechnol.

Bioinformation

Briefings Bioinf.

Curr. Comput. Aided Drug Des.

Hum. Genomics Proteomics

J. Struct. Funct. Genomics

Stat. Med.

BMC Med. Res. Methodol.

BMC Health Serv. Res.

Bull. World Health Organ.

Pharmacoepidemiol. Drug Saf.

Philos. Perspect. Comput. Mediated Commun.

Commun. ACM

Artif. Intell. Med.

Med. J. Aust.

Hum. Factors

Emerg. Med. J.

Stata Data Management: Reference Manual: Release 12