ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Opinion Article

Best practice data life cycle approaches for the life sciences

[version 1; peer review: 2 approved with reservations]
PUBLISHED 31 Aug 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Research on Research, Policy & Culture gateway.

This article is included in the Agriculture, Food and Nutrition gateway.

This article is included in the EMBL-EBI collection.

Abstract

Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a ‘life cycle’ view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain.

Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on ‘omics’ datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices.

Keywords

data sharing, data management, open science, bioinformatics, reproducibility

Introduction

Technological data production capacity is revolutionising biology1, but is not necessarily correlated with the ability to efficiently analyse and integrate data, or with enabling long-term data sharing and reuse. There are selfish as well as altruistic benefits to making research data reusable2: it allows one to find and reuse one’s own previously-generated data easily; it is associated with higher citation rates3,4; and it ensures eligibility for funding from and publication in venues that mandate data sharing, an increasingly common requirement (e.g. Final NIH statement on sharing research data,Wellcome Trust policy on data management and sharing,Bill & Melinda Gates Foundation open access policy). Currently we are losing data at a rapid rate, with up to 80% unavailable after 20 years5. This affects reproducibility - assessing the robustness of scientific conclusions by ensuring experiments and findings can be reproduced - which underpins the scientific method. Once access to the underlying data is lost, replicability, reproducibility and extensibility6 are reduced.

At a broader societal level, the full value of research data may go beyond the initial use case in unforeseen ways7,8, so ensuring data quality and reusability is crucial to realising its potential value912. The recent publication of the FAIR principles9,13 identifies four key criteria for high-quality research data: the data should be Findable, Accessible, Interoperable and Reusable. Whereas a traditional view of data focuses on collecting, processing, analysing data and publishing results only, a life cycle view reveals the additional importance of finding, storing and sharing data11. Throughout this article, we present a researcher-focused data life cycle framework that has commonalities with other published frameworks [e.g. the DataONE Data Life Cycle, the US geological survey science data lifecycle model and 11,1415], but is aimed at life science researchers specifically (Figure 1).

21ab9ed9-9e58-4fe9-93f0-99b8e769f8b2_figure1.gif

Figure 1. The Data Life Cycle framework for bioscience, biomedical and bioinformatics data that is discussed throughout this article.

Black arrows indicate the ‘traditional’, linear view of research data; the green arrows show the steps necessary for data reusability. This framework is likely to be a simplified representation of any given research project, and in practice there would be numerous ‘feedback loops’ and revisiting of previous stages. In addition, the publishing stage can occur at several points in the data life cycle.

Learning how to find, store and share research data is not typically an explicit part of undergraduate or postgraduate training in the biological sciences1618. The scope, size and complexity of datasets in many fields has increased dramatically over the last 10–20 years, but the knowledge of how to manage this data is currently limited to specific cohorts of ‘information managers’ (e.g. research data managers, research librarians, database curators and IT professionals with expertise in databases and data schemas18). In response to institutional and funding requirements around data availability, a number of tools and educational programs have been developed to help researchers create Data Management Plans to address elements of the data lifecycle19; however, even when a plan is mandated, there is often a gap between the plan and the actions of the researcher10.

During the week of 24–28 October 2016, EMBL Australia Bioinformatics Resource (EMBL-ABR)20 led workshops on the data life cycle for life science researchers working in the plant, animal, microbial and medical domains. The workshops provided opportunities to (i) map the current approaches to the data life cycle in biology and bioinformatics, and (ii) present and discuss best practice approaches and standards for key international projects with Australian life scientists and bioinformaticians. Discussions during these workshops have informed this publication, which targets life science researchers wanting to improve their data management practice; throughout we highlight some specific data management challenges mentioned by participants.

An earlier version of this article can be found on bioRxiv (https://doi.org/10.1101/167619).

Finding data

In biology, research data is frequently published as supplementary material to articles, on personal or institutional websites, or in non-discipline-specific repositories like Figshare and Dryad21. In such cases, data may exist behind a paywall, there is no guarantee it will remain extant, and, unless one already knows it exists and its exact location, it may remain undiscovered22. It is only when a dataset is added to public data repositories, along with accompanying standardized descriptive metadata (see Collecting data), that it can be indexed and made publicly available23. Data repositories also provide unique identifiers that increase findability by enabling persistent linking from other locations and permanent association between data and its metadata.

In the field of molecular biology, a number of bioinformatics-relevant organisations host public data repositories. National and international-level organisations of this kind include the European Bioinformatics Institute (EMBL-EBI)24, the National Centre for Biotechnology Information (NCBI)25, the DNA Data Bank of Japan (DDBJ)26, the Swiss Institute of Bioinformatics (SIB)27, and the four data center members of the worldwide Protein Data Bank28, which mirror their shared data with regular, frequent updates. This shared central infrastructure is hugely valuable to research and development. For example, EMBL-EBI resources have been valued at over £270 million per year and contribute to ~£1 billion in research efficiencies; a 20-fold return on investment29.

Numerous repositories are available for biological data (see Table 1 for an overview), though repositories are still lacking for some data types and sub-domains30. Many specialised data repositories exist outside of the shared central infrastructure mentioned, often run voluntarily or with minimal funding. Support for biocuration, hosting and maintenance of these smaller-scale but key resources is a pressing problem3133. The quality of the user-submitted data in public repositories34,35 can mean that public datasets require extra curation before reuse. Unfortunately, due to low uptake of established methods (see the EMBL-EBI and NCBI third-party annotation policies and;36) to correct the data35, the results of extra curation may not find their way back into the repositories. Repositories are often not easily searched by generic web search engines30. Registries, which form a secondary layer linking multiple, primary repositories, may offer a more convenient way to search across multiple repositories for data relevant to a researcher’s topics of interest37.

Table 1. Overview of some representative databases, registries and other tools to find life science data.

Database/
registry
NameDescriptionDatatypesURL
DatabaseGene OntologyRepository of functional roles of gene products,
including: proteins, ncRNAs, and complexes.
Functional roles as determined experimentally or
through inference. Includes evidence for these
roles and links to literature
http://geneontology.org/
DatabaseKyoto
Encyclopedia
of Genes and
Genomes
(KEGG)
Repository for pathway relationships of
molecules, genes and cells, especially molecular
networks
Protein, gene, cell, and genome pathway
membership data
http://www.genome.jp/kegg/
DatabaseOrthoDBRepository for gene ortholog informationProtein sequences and orthologous group
annotations for evolutionarily related species
groups
http://www.orthodb.org/
Database
with analysis
layer
eggNOGRepository for gene ortholog information with
functional annotation prediction tool
Protein sequences, orthologous group
annotations and phylogenetic trees for
evolutionarily related species groups
http://eggnogdb.embl.de/
DatabaseEuropean
Nucleotide
Archive (ENA)
Repository for nucleotide sequence informationRaw next-generation sequencing data, genome
assembly and annotation data
http://www.ebi.ac.uk/ena
DatabaseSequence Read Archive (SRA)Repository for nucleotide sequence informationRaw high-throughput DNA sequencing and
alignment data
https://www.ncbi.nlm.nih.gov/sra/
DatabaseGenBankRepository for nucleotide sequence informationAnnotated DNA sequenceshttps://www.ncbi.nlm.nih.gov/genbank/
DatabaseArrayExpressRepository for genomic expression dataRNA-seq, microarray, CHIP-seq, Bisulfite-seq and
more (see https://www.ebi.ac.uk/arrayexpress/
help/experiment_types.html for full list)
https://www.ebi.ac.uk/arrayexpress/
DatabaseGene
Expression
Omnibus (GEO)
Repository for genetic/genomic expression dataRNA-seq, microarray, real-time PCR data on
gene expression
https://www.ncbi.nlm.nih.gov/geo/
DatabasePRIDERepository for proteomics dataProtein and peptide identifications, post-
translational modifications and supporting
spectral evidence
https://www.ebi.ac.uk/pride/archive/
DatabaseProtein Data
Bank (PDB)
Repository for protein structure information3D structures of proteins, nucleic acids and
complexes
https://www.wwpdb.org/
DatabaseMetaboLightsRepository for metabolomics experiments and
derived information
Metabolite structures, reference spectra and
biological characteristics; raw and processed
metabolite profiles
http://www.ebi.ac.uk/metabolights/
Ontology/
database
ChEBIOntology and repository for chemical entitiesSmall molecule structures and chemical
properties
https://www.ebi.ac.uk/chebi/
DatabaseTaxonomyRepository of taxonomic classification informationTaxonomic classification and nomenclature data
for organisms in public NCBI databases
https://www.ncbi.nlm.nih.gov/taxonomy
DatabaseBioStudiesRepository for descriptions of biological studies,
with links to data in other databases and
publications
Study descriptions and supplementary fileshttps://www.ebi.ac.uk/biostudies/
DatabaseBiosamplesRepository for information about biological
samples, with links to data generated from these
samples located in other databases
Sample descriptionshttps://www.ebi.ac.uk/biosamples/
Database
with analysis
layer
IntActRepository for molecular interaction informationMolecular interactions and evidence typehttp://www.ebi.ac.uk/intact/
DatabaseUniProtKB
(SwissProt and
TrEMBL)
Repository for protein sequence and function
data. Combines curated (UniProtKB/SwissProt)
and automatically annotated, uncurated
(UniProtKB/TrEMBL) databases
Protein sequences, protein function and
evidence type
http://www.uniprot.org/
DatabaseEuropean
Genome-
Phenome
Archive
Controlled-access repository for sequence and
genotype experiments from human participants
whose consent agreements authorise data
release for specific research use
Raw, processed and/or analysed sequence and
genotype data along with phenotype information
https://www.ebi.ac.uk/ega/
Database
with analysis
layer
EBI
Metagenomics
Repository and analysis service for
metagenomics and metatranscriptomics data.
Data is archived in ENA
Next-generation sequencing metagenomic
and metatranscriptomic data; metabarcoding
(amplicon-based) data
https://www.ebi.ac.uk/metagenomics/
Database
with analysis
layer
MG-RASTRepository and analysis service for
metagenomics data.
Next-generation sequencing metagenomic and
metabarcoding (amplicon-based) data
http://metagenomics.anl.gov/
RegistryOmics DIRegistry for dataset discovery that currently
spans 11 data repositories: PRIDE, PeptideAtlas,
MassIVE, GPMDB, EGA, Metabolights,
Metabolomics Workbench, MetabolomeExpress,
GNPS, ArrayExpress, ExpressionAtlas
Genomic, transcriptomic, proteomic and
metabolomic data
http://www.omicsdi.org
RegistryDataMedRegistry for biomedical dataset discovery that
currently spans 66 data repositories
Genomic, transcriptomic, proteomic,
metabolomic, morphology, cell signalling,
imaging and other data
https://datamed.org
RegistryBiosharingCurated registry for biological databases, data
standards, and policies
Information on databases, standards and
policies including fields of research and usage
recommendations by key organisations
https://biosharing.org/
Registryre3dataRegistry for research data repositories across
multiple research disciplines
Information on research data repositories, terms
of use, research fields
http://www.re3data.org

Collecting data

The most useful data has associated information about its creation, its content and its context - called metadata. If metadata is well structured, uses consistent element names and contains element values with specific descriptions from agreed-upon vocabularies, it enables machine readability, aggregation, integration and tracking across datasets: allowing for Findability, Interoperability and Reusability9,30. One key approach in best-practice metadata collection is to use controlled vocabularies built from ontology terms. Biological ontologies are tools that provide machine-interpretable representations of some aspect of biological reality30,38. They are a way of organising and defining objects (i.e. physical entities or processes), and the relationships between them. Sourcing metadata element values from ontologies ensures that the terms used in metadata are consistent and clearly defined. There are several user-friendly tools available to assist researchers in accessing, using and contributing to ontologies (Table 2).

Table 2. Useful ontology tools to assist in metadata collection.

ToolTaskURL
Ontology Lookup
Service
Discover different ontologies and their contentshttp://www.ebi.ac.uk/ols/
OBO FoundryTable of open biomedical ontologies with information
on development status, license and content
http://obofoundry.org/
ZoomaAssign ontology terms using curated mappinghttp://www.ebi.ac.uk/spot/zooma/
WebulousCreate new ontology terms easilyhttps://www.ebi.ac.uk/efo/webulous/
OntobeeA linked data server that facilitates ontology data
sharing, visualization, and use.
http://www.ontobee.org

Adopting standard data and metadata formats and syntax is critical for compliance with FAIR principles9,23,30,37,39. Biological and biomedical research has been considered an especially challenging research field in this regard, as datatypes are extremely heterogeneous and not all have defined data standards39,40; many existing data standards are complex and therefore difficult to use40, or only informally defined, and therefore subject to variation, misrepresentation, and divergence over time39. Nevertheless, well-established standards exist for a variety of biological data types (Table 3). FAIRsharing is a useful registry of data standards and policies that also indicates the current status of standards for different data types and those recommended by databases and research organisations37.

Table 3. Overview of common standard data formats for ‘omics data.

Data typeFormat nameDescriptionReference or URL for format specificationURLs for repositories
accepting data in this format
Raw DNA/RNA
sequence
FASTA

FASTQ

HDF5

SAM/BAM/
CRAM
FASTA is a common text format to store DNA/RNA/Protein
sequence and FASTQ combines base quality information
with the nucleotide sequence.

HDF5 is a newer sequence read formats used by long read
sequencers e.g. PacBio and Oxford Nanopore.

Raw sequence can also be stored in unaligned SAM/BAM/CRAM format
41
42
https://support.hdfgroup.org/HDF5/
https://samtools.github.io/hts-specs/

https://www.ncbi.nlm.nih.gov/
sra/docs/submitformats/
http://www.ebi.ac.uk/ena/
submit/data-formats
Assembled
DNA sequence
FASTA

Flat file

AGP
Assemblies without annotation are generally stored in
FASTA format.

Annotation can be integrated with assemblies in contig,
scaffold or chromosome flat file format.

AGP files are used to describe how smaller fragments are
placed in an assembly but do not contain the sequence
information themselves
41
http://www.ebi.ac.uk/ena/submit/contig-flat-file
http://www.ebi.ac.uk/ena/submit/scaffold-flat-file

https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_
Specification/
http://www.ebi.ac.uk/ena
/submit/genomes-sequence-
submission
Aligned DNA
sequence
SAM/BAM

CRAM
Sequences aligned to a reference are represented in
sequence alignment and mapping format (SAM). Its binary
version is called BAM and further compression can be
done using the CRAM format
https://samtools.github.io/hts-specs/https://www.ncbi.nlm.nih.gov/
sra/docs/submitformats/#bam
Gene model or
genomic feature
annotation
GTF/GFF/GFF3
BED
GB/GBK
General feature format or general transfer format are
commonly used to store genomic features in tab-delimited
flat text format.

GFF3 is a more advanced version of the basic GFF that
allows description of more complex features.

BED format is a tab-delimited text format that also allows
definition of how a feature should be displayed (e.g. on a
genome browser).

GenBank flat file Format (GB/GBK) is also commonly used
but not well standardised
https://github.com/The-Sequence-Ontology/
Specifications/blob/master/gff3.md

https://genome.ucsc.edu/FAQ/FAQformat.html

https://genome.ucsc.edu/FAQ/FAQformat.html

https://www.ncbi.nlm.nih.gov/Sitemap/
samplerecord.html
http://www.ensembl.org/info/
website/upload/gff.html
http://www.ensembl.org/info/
website/upload/gff3.html
Gene functional
annotation
GAF

(GPAD and
RDF will also
be available in
2018)
A GAF file is a GO Annotation File containing annotations
made to the GO by a contributing resource such as
FlyBase or Pombase. However, the GAF standard is
applicable outside of GO, e.g. using other ontologies such
as PO. GAF (v2) is a simple tab-delimited file format with 17
columns to describe an entity (e.g. a protein), its annotation
and some annotation metadata
http://geneontology.org/page/go-annotation-file-
format-20
http://geneontology.org/page/
submitting-go-annotations
Genetic/genomic
variants
VCFA tab-delimited text format to store meta-information as
header lines followed by information about variants position
in the genome. The current version is VCF4.2
https://samtools.github.io/hts-specs/VCFv4.2.pdfhttp://www.ensembl.org/info/
website/upload/var.html
Interaction dataPSI-MI XML

MITAB
Data formats developed to exchange molecular interaction
data, related metadata and fully describe molecule
constructs
http://psidev.info/groups/molecular-interactionshttp://www.ebi.ac.uk/intact
Raw metabolite
profile
mzML

nmrML
XML based data formats that define mass spectrometry
and nuclear magnetic resonance raw data in Metabolomics
http://www.psidev.info/mzml

http://nmrml.org/
Protein sequenceFASTAA text-based format for representing nucleotide sequences
or protein sequences, in which nucleotides or amino acids
are represented using single-letter codes
[41]http://www.uniprot.org
Raw proteome
profile
mzMLA formally defined XML format for representing mass
spectrometry data. Files typically contain sequences of
mass spectra, plus metadata about the experiment
http://www.psidev.info/mzmlwww.ebi.ac.uk/pride
Organisms and
specimens
Darwin CoreThe Darwin Core (DwC) standard facilitates the exchange
of information about the geographic location of organisms
and associated collection specimens
http://rs.tdwg.org/dwc/

Most public repositories for biological data (see Table 1 and Storing data section) require that minimum metadata be submitted accompanying each dataset (Table 4). This minimum metadata specification typically has broad community input43. Minimum metadata standards may not include the crucial metadata fields that give the full context of the particular research project43, so it is important to gather metadata early, understand how to extend a minimum metadata template to include additional fields in a structured way, and think carefully about all the relevant pieces of metadata information that might be required for reuse.

Table 4. Some community-designed minimum information criteria for metadata specifications in life sciences.

NameDescriptionExamples of projects/databases that
use this specification
URL
MINSEQEMinimum Information about a high-
throughput SEQuencing Experiment
Developed by the Functional Genomics
Data Society. Used in the NCBI Sequence
Read Archive, ArrayExpress
http://fged.org/site_media/pdf/MINSEQE_1.0.pdf
MIxS - MIGS/MIMSMinimum Information about a
(Meta)Genome Sequence. The
MIMS extension includes key
environmental metadata
Developed by the Genomic Standards
Consortium. Numerous adopters including
NCBI/EBI/DDBJ databases
http://wiki.gensc.org/index.php?title=MIGS/MIMS
MIMARKSMinimum Information about a
MARKer gene Sequence. This is
an extension of MIGS/MIMS for
environmental sequences
Developed by the Genomic Standards
Consortium. Numerous adopters including
NCBI/EBI/DDBJ databases
http://wiki.gensc.org/index.php?title=MIMARKS
MIMIxMinimum Information about a
Molecular Interaction eXperiment
Developed by the Proteomics Standards
Initiative. Adopted by the IMEx Consortium
databases
http://www.psidev.info/mimix
MIAPEMinimum Information About a
Proteomics Experiment
Developed by the Proteomics Standards
Initiative. Adopted by PRIDE, World-
2DPAGE and ProteomeXchange
databases
http://www.psidev.info/miape
Metabolomics Standards
Initiative (MSI) standards
Minimal reporting structures that
represent different parts of the
metabolomics workflow
Developed by the Metabolomics
Standards Initiative (MSI) and the
Coordination of Standards in Metabolomics
(COSMOS) consortium
http://www.metabolomics-msi.org/
MIRIAMMinimal Information Required
In the Annotation of Models.
For annotation and curation of
computational models in biology
Initiated by the BioModels.net effort.
Adopted by the EBI BioModels database
and others
http://co.mbine.org/standards/miriam
MIAPPEMinimum Information About a Plant
Phenotyping Experiment. Covers
study, environment, experimental
design, sample management,
biosource, treatment and phenotype
Adopted by the Plant Phenomics and
Genomics Research Data Repository and
the Genetic and Genomic Information
System (GnpIS)
http://cropnet.pl/phenotypes/wp-content/uploads/2016/04/MIAPPE.pdf
MDMMinimal Data for Mapping for
sample and experimental metadata
for pathogen genome-scale
sequence data
Developed by the Global Microbial
Identifier Initiative and EBI. Complies
with EBI ENA database submission
requirements
http://www.ebi.ac.uk/ena/submit/pathogen-data
FAANG sample
metadata specification
Metadata specification for biological
samples derived from animals
(animals, tissue samples, cells or
other biological materials). Complies
with EBI database requirements and
BioSamples database formats
Developed and used by the Functional
Annotation of Animal Genomes Consortium
https://github.com/FAANG/faang-metadata/blob/master/docs/faang_
sample_metadata.md
FAANG experimental
metadata specification
Metadata specification for
sequencing and array experiments
on animal samples
Developed and used by the Functional
Annotation of Animal Genomes Consortium
https://github.com/FAANG/faang-metadata/blob/master/docs/faang_
experiment_metadata.md
FAANG analysis
metadata specification
Metadata specification for analysis
results
Developed and used by the Functional
Annotation of Animal Genomes
Consortium. NB no public repository exists
for this specific datatype
https://github.com/FAANG/faang-metadata/blob/master/docs/faang_
analysis_metadata.md
SNOMED-CTMedical terminology and
pharmaceutical product standard
Commercial but collaboratively-designed
product
http://www.snomed.org/snomed-ct

Processing and analysing data

Recording and reporting how research data is processed and analysed computationally is crucial for reproducibility and assessment of research quality1,44. Full reproducibility requires access to the software, software versions, dependencies and operating system used as well as the data and software code itself45. Therefore, although computational work is often seen as enabling reproducibility in the short term, in the long term it is fragile and reproducibility is limited (e.g. discussion by D. Katz, K. Hinsen and C.T. Brown). Best-practice approaches for preserving data processing and analysis code involve hosting source code in a repository where it receives a unique identifier and is under version control; where it is open, accessible, interoperable and reusable - broadly mapping to the FAIR principles for data. Github and Bitbucket, for example, fulfil these criteria, and Zenodo additionally generates Digital Object Identifiers (DOIs) for submissions and guarantees long-term archiving. Several recent publications have suggested ways to improve current practice in research software development15,4648.

The same points hold for wet-lab data production: for full reproducibility, it is important to capture and enable access to specimen cell lines, tissue samples and/or DNA as well as reagents. Wet-lab methods can be captured in electronic laboratory notebooks and reported in the Biosamples database49, protocols.io or OpenWetWare; specimens can be lodged in biobanks, culture or museum collections5054; but the effort involved in enabling full reproducibility remains extensive. Electronic laboratory notebooks are frequently suggested as a sensible way to make this information openly available and archived55. Some partial solutions exist (e.g. LabTrove, BlogMyData, Benchling and others56), including tools for specific domains such as the Scratchpad Virtual Research Environment for natural history research57. Other tools can act as or be combined to produce notebooks for small standalone code-based projects [Boettiger, 201758 and update], including Jupyter Notebook, Rmarkdown, and Docker. However, it remains a challenge to implement online laboratory notebooks to cover both field/lab work and computer-based work, especially when computer work is extensive, involved and non-modular44. Currently, no best-practice guidelines or minimum information standards exist for use of electronic laboratory notebooks6. We suggest that appropriate minimum information to be recorded for most computer-based tasks should include date, task name and brief description, aim, actual command(s) used, software names and versions used, input/output file names and locations, script names and locations.

During the EMBL-ABR workshop series, participants identified the data processing and analysis stage as one of the most challenging for openness. A few participants had put intensive individual effort into developing custom online lab (and code) notebook approaches, but the majority had little awareness of this as a useful goal. This suggests a gap between modern biological research as a field of data science, and biology as it is still mostly taught in undergraduate courses, with little or no focus on computational analysis, or project or data management. As reported elsewhere1618, this gap has left researchers lacking key knowledge and skills required to implement best practices in dealing with the life cycle of their data.

Publishing data

Traditionally, scientific publications included raw research data, but in recent times datasets have grown beyond the scope of practical inclusion in a manuscript11,44. Selected data outputs are often included without sharing or publishing the underlying raw data14. Journals increasingly recommend or require deposition of raw data in a public repository [e.g. 59], although exceptions have been made for publications containing commercially-relevant data60. The current data-sharing mandate is somewhat field-dependent5,61 and also varies within fields62. For example, in the field of bioinformatics, the UPSIDE principle63 is referred to by some journals (e.g. Bioinformatics), while others have journal- or publisher-specific policies (e.g. BMC Bioinformatics).

The vast majority of scientific journals require inclusion of processing and analysis methods in ‘sufficient detail for reproduction’ (e.g. Public Library of Science submission and data availability guidelines; International Committee of Medical Journal Editors manuscript preparation guidelines; Science instructions for authors; Elsevier Cell Press STAR Methods; and64), though journal requirements are diverse and complex65, and the level of detail authors provide can vary greatly in practice66,67. More recently, many authors have highlighted that full reproducibility requires sharing data and resources at all stages of the scientific process, from raw data (including biological samples) to full methods and analysis workflows1,6,53,67. However, this remains a challenge68,69, as discussed in the Processing and analysing data section. To our knowledge, strategies for enabling computational reproducibility are currently not mandated by any scientific journal.

A recent development in the field of scientific publishing is the establishment of ‘data journals’: scientific journals that publish papers describing datasets. This gives authors a vehicle to accrue citations (still a dominant metric of academic impact) for data production alone, which can often be labour-intensive and expensive yet is typically not well recognised under the traditional publishing model. Examples of this article type include the Data Descriptor in Scientific Data and the Data Note in GigaScience, which do not include detailed new analysis but rather focus on describing and enabling reuse of datasets.

The movement towards sharing research publications themselves (‘Open Access Publishing’) has been discussed extensively elsewhere [e.g. 22,70,71]. Publications have associated metadata (creator, date, title etc.; see Dublin Core Metadata Initiative metadata terms) and unique identifiers (PubMed ID for biomedical and some life science journals, DOIs for the vast majority of journals; see Table 5). The ORCID system enables researchers to claim their own unique identifier, which can be linked to their publications. The use of unique identifiers within publications referring to repository records (e.g. genes, proteins, chemical entities) is not generally mandated by journals, although it would ensure a common vocabulary is used and so make scientific results more interoperable and reusable72. Some efforts are underway to make this easier for researchers: for example, Genetics and other Genetics Society of America journals assist authors in linking gene names to model organism database entries.

Table 5. Identifiers throughout the data life cycle.

NameRelevant stage of
data life cycle
DescriptionURL
Digital Object Identifier (DOI)Publishing, Sharing,
Finding
A unique identifier for a digital (or physical or
abstract) object
https://www.doi.org/
Open Researcher and
Contributor ID (ORCID)
PublishingAn identifier for a specific researcher that
persists across publications and other research
outputs
https://orcid.org/
Repository accession
number
Finding, Processing/
Analyzing, Publishing,
Sharing, Storing
A unique identifier for a record within a
repository. Format will be repository-specific.
Examples include NIH UIDs (unique identifiers)
and accession numbers; ENA accession
numbers; PDB IDs
For example, https://support.ncbi.nlm.nih.gov/link/portal/28045/28049/
Article/499/

http://www.ebi.ac.uk/ena/submit/accession-number-formats
Pubmed ID (PMID)PublishingAn example of a repository-specific unique
identifier: PubMed IDs are used for research
publications indexed in the PubMed database
https://www.ncbi.nlm.nih.gov/pubmed/
International Standard
Serial Number (ISSN)
PublishingA unique identifier for a journal, magazine or
periodical
http://www.issn.org/
International Standard Book
Number (ISBN)
PublishingA unique identifier for a book, specific to the
title, edition and format
https://www.isbn-international.org

Storing data

While primary data archives are the best location for raw data and some downstream data outputs (Table 1), researchers also need local data storage solutions during the processing and analysis stages. Data storage requirements vary among research domains, with major challenges often evident for groups working on taxa with large genomes (e.g. crop plants), which require large storage resources, or on human data, where privacy regulations may require local data storage, access controls and conversion to non-identifiable data if data is to be shared (see the Australian National Data Service de-identification guide, the National Health and Medical Research Council statement on ethical conduct in human research, and the Australian National Medical Research Storage Facility discussion paper on legal, best practice and security frameworks). In addition, long-term preservation of research data should consider threats such as storage failure, mistaken erasure, bit rot, outdated media, outdated formats, loss of context and organisational failure73.

Sharing data

The best-practice approach to sharing biological data is to deposit it (with associated metadata) in a primary archive suitable for that datatype8 that complies with FAIR principles. As highlighted in the Storing data section, these archives assure both data storage and public sharing as their core mission, making them the most reliable location for long-term data storage. Alternative data sharing venues (e.g. FigShare, Dryad) do not require or implement specific metadata or data standards. This means that while these venues have a low barrier to entry for submitters, the data is not FAIR unless submitters have independently decided to comply with more stringent criteria. If available, an institutional repository may be a good option if there is no suitable archive for that datatype. Importantly, plans for data sharing should be made at the start of a research project and reviewed during the project, to ensure ethical approval is in place and that the resources and metadata needed for effective sharing are available at earlier stages of the data life cycle3.

During the EMBL-ABR workshop series, the majority of participants were familiar with at least some public primary data repositories, and many had submitted data to them previously. A common complaint was around usability of current data submission tools and a lack of transparency around metadata requirements and the rationale for them. A few workshop participants raised specific issues about the potential limitations of public data repositories where their data departed from the assumptions of the repository (e.g. unusual gene models supported by experimental evidence that were rejected by the automated NCBI curation system). Most workshop participants were unaware they could provide feedback to the repositories to deal with such situations, and this could also be made clearer on the repository websites. Again, this points in part to existing limitations in the undergraduate and postgraduate training received by researchers, where the concepts presented in this article are presented as afterthoughts, if at all. On the repository side, while there is a lot of useful information and training material available to guide researchers through the submission process (e.g. the EMBL-EBI Train Online webinars and online training modules), it is not always linked clearly from the database portals or submission pages themselves. Similarly, while there are specifications and standards available for many kinds of metadata [Table 4; also see FAIRsharing], many do not have example templates available, which would assist researchers in implementing the standards in practice.

What can the research community do to encourage best-practice?

We believe that the biological/biomedical community and individual researchers have a responsibility to the public to help advance knowledge by making research data FAIR for reuse9, especially if the data were generated using public funding. There are several steps that can assist in this mission:

  • 1. Senior scientists should lead by example and ensure all the data generated by their laboratories is well-managed, fully annotated with the appropriate metadata and made publicly available in an appropriate repository.

  • 2. The importance of data management and benefits of data reuse should be taught at the undergraduate and postgraduate levels18. Computational biology and bioinformatics courses in particular should include material about data repositories, data and metadata standards, data discovery and access strategies. Material should be domain-specific enough for students to attain learning outcomes directly relevant to their research field.

  • 3. Funding bodies are already taking a lead role in this area by requiring the incorporation of a data management plan into grant applications. A next step would be for a formal check, at the end of the grant period, that this plan has been adhered to and data is available in an appropriate format for reuse10.

  • 4. Funding bodies and research institutions should judge quality dataset generation as a valued metric when evaluating grant or promotion applications.

  • 5. Similarly, leadership and participation in community efforts in data and metadata standards, and open software and workflow development should be recognised as academic outputs.

  • 6. Data repositories should ensure that the data deposition and third-party annotation processes are as FAIR and painless as possible to the naive researcher, without the need for extensive bioinformatics support35.

  • 7. Journals should require editors and reviewers to check manuscripts to ensure that all data, including research software code and samples where appropriate, have been made publicly available in an appropriate repository, and that methods have been described in enough detail to allow re-use and meaningful reanalysis8.

  • 8. Finally, researchers reusing any data should openly acknowledge this fact and fully cite the dataset, including unique identifiers8,10,30.

Conclusions

While the concept of a life cycle for research data is appealing from an Open Science perspective, challenges remain for life science researchers to put this into practice. During the EMBL-ABR Data Life Cycle workshop series, we noted limited awareness among attendees of the resources available to researchers that assist in finding, collecting, processing, analysis, publishing, storing and sharing FAIR data. We believe this article provides a useful overview of the relevant concepts and an introduction to key organisations, resources and guidelines to help researchers improve their data management practices.

Furthermore, we note that data management in the era of biology as a data science is a complex and evolving topic and both best practices and challenges are highly domain-specific, even within the life sciences. This factor may not always be appreciated at the organisational level, but has major practical implications for the quality and interoperability of shared life science data. Finally, domain-specific education and training in data management would be of great value to the life science research workforce, and we note an existing gap at the undergraduate, postgraduate and short course level in this area.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 31 Aug 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Griffin PC, Khadake J, LeMay KS et al. Best practice data life cycle approaches for the life sciences [version 1; peer review: 2 approved with reservations] F1000Research 2017, 6:1618 (https://doi.org/10.12688/f1000research.12344.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 31 Aug 2017
Views
41
Cite
Reviewer Report 15 Dec 2017
Sven Nahnsen, Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany 
Approved with Reservations
VIEWS 41
The article "Best practice data life cycle approaches for the life sciences", submitted by Griffin et al. reports opinions on how to best manage the growing complexity of scientific data in the life sciences.

The article touches ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Nahnsen S. Reviewer Report For: Best practice data life cycle approaches for the life sciences [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:1618 (https://doi.org/10.5256/f1000research.13366.r27113)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Reader Comment 04 Jun 2018
    Pip Griffin, The University of Melbourne, Australia
    04 Jun 2018
    Reader Comment
    Response to Review 1
     
    Thank you very much to Dr. Nahnsen for his review. We have responded to his comments below (reviewer comments in italics, our responses in plain text).
     
    The article ... Continue reading
COMMENTS ON THIS REPORT
  • Reader Comment 04 Jun 2018
    Pip Griffin, The University of Melbourne, Australia
    04 Jun 2018
    Reader Comment
    Response to Review 1
     
    Thank you very much to Dr. Nahnsen for his review. We have responded to his comments below (reviewer comments in italics, our responses in plain text).
     
    The article ... Continue reading
Views
49
Cite
Reviewer Report 21 Nov 2017
Johannes Starlinger, Department of Anesthesiology and Operative Intensive Care Medicine, Charité – Universitätsmedizin Berlin, Berlin, Germany;  Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany 
Approved with Reservations
VIEWS 49
The article gives a brief overview of the data life cycle in the life sciences and offers an entry point for accessing relevant information about current approaches to increasing compliance with the FAIR data sharing principles at each step of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Starlinger J. Reviewer Report For: Best practice data life cycle approaches for the life sciences [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:1618 (https://doi.org/10.5256/f1000research.13366.r27111)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Reader Comment 04 Jun 2018
    Pip Griffin, The University of Melbourne, Australia
    04 Jun 2018
    Reader Comment
    Response to Review 2
     
    We thank Dr. Starlinger for his review and respond to his comments below (reviewer comments in italics, our responses in plain text).
     
    The article gives a brief overview ... Continue reading
COMMENTS ON THIS REPORT
  • Reader Comment 04 Jun 2018
    Pip Griffin, The University of Melbourne, Australia
    04 Jun 2018
    Reader Comment
    Response to Review 2
     
    We thank Dr. Starlinger for his review and respond to his comments below (reviewer comments in italics, our responses in plain text).
     
    The article gives a brief overview ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 31 Aug 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.