Review
Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks

https://doi.org/10.1016/j.bbapap.2015.04.015Get rights and content

Highlights

  • Sequence–function space can be visualized using protein sequence similarity networks.

  • The EFI-EST webtool is available for generating sequence similarity networks.

  • A tutorial is provided that describes the use of EFI-EST.

  • The community is encouraged to use EFI-EST without cost.

Abstract

The Enzyme Function Initiative, an NIH/NIGMS-supported Large-Scale Collaborative Project (EFI; U54GM093342; http://enzymefunction.org/), is focused on devising and disseminating bioinformatics and computational tools as well as experimental strategies for the prediction and assignment of functions (in vitro activities and in vivo physiological/metabolic roles) to uncharacterized enzymes discovered in genome projects. Protein sequence similarity networks (SSNs) are visually powerful tools for analyzing sequence relationships in protein families (H.J. Atkinson, J.H. Morris, T.E. Ferrin, and P.C. Babbitt, PLoS One 2009, 4, e4345). However, the members of the biological/biomedical community have not had access to the capability to generate SSNs for their “favorite” protein families. In this article we announce the EFI-EST (Enzyme Function Initiative-Enzyme Similarity Tool) web tool (http://efi.igb.illinois.edu/efi-est/) that is available without cost for the automated generation of SSNs by the community. The tool can create SSNs for the “closest neighbors” of a user-supplied protein sequence from the UniProt database (Option A) or of members of any user-supplied Pfam and/or InterPro family (Option B). We provide an introduction to SSNs, a description of EFI-EST, and a demonstration of the use of EFI-EST to explore sequence–function space in the OMP decarboxylase superfamily (PF00215). This article is designed as a tutorial that will allow members of the community to use the EFI-EST web tool for exploring sequence/function space in protein families.

Section snippets

Introduction: the functional assignment challenge

The identities and functions of the complete set of proteins encoded by a genome should allow a comprehensive understanding of the physiology of the organism. However, a conservative estimate is that only ~ 50% of the proteins discovered in genome projects have reliable functional annotations in the sequence databases—the remainder have unknown, uncertain, or incorrect functional annotations (and the identities of these are unknown!) [1], [2]. Genome projects should provide information of

Sequence similarity networks (SSNs)

Dendrograms and trees (Fig. 2A) are the most common tools for surveying sequence–function space in enzyme families. However, their construction and interpretation is computationally intensive and requires an accurate sequence alignment that is difficult to achieve on large-scale. Babbitt and coworkers described the use of protein sequence similarity networks (SSNs) as an “easy to compute” alternate method for assessing sequence relationships within enzyme families [12]. In an SSN (Fig. 2B),

The value of SSNs and how to use them

An SSN is a visual aid that allows a user to segregate a functionally diverse superfamily (different substrate specificities and/or reaction mechanisms) into putative isofunctional groups [12]. At small alignment scores (low sequence identity), most of the nodes in the SSN for a homologous family will be connected to one another by edges resulting in a single large cluster (“hairball”, Fig. 4A). As the alignment score used to draw edges is increased (the sequence identity is increased), edges

Representative node networks

The number of nodes in an SSN (N) is the number of sequences in the family; the number of edges connecting the nodes varies with the alignment score and is ≤[N × (N  1)] / 2. The memory available to Cytoscape 3.2 that is used to visualize SSNs limits the number of edges that can be displayed: with 4GB RAM, an SSN with ≤~ 500,000 edges can be opened and manipulated; with 64GB RAM, an SSN with ≤~ 5,000,000 edges can be opened and manipulated.

The edge/node ratio in an SSN is determined by the degree of

Sequences and node attributes used by EFI-EST

EFI-EST uses sequences from the UniProt database and their associated descriptions (node attributes; vide infra) from the UniProtKB (http://www.uniprot.org/uniprot/). The EFI's choice of UniProt, instead of GenBank (http://www.ncbi.nlm.nih.gov/genbank) [23], recognizes the ability of any member of the community to update or correct an annotation in UniProtKB based on experimental evidence; in contrast, GenBank is an archive—the annotations can be changed only by the depositor of the sequence.

Protein families/domains described by Pfam and InterPro

EFI-EST provides the user with two options for generating SSNs:

  • 1)

    Option A to explore local sequence–function space defined by a user-specified sequence, often resulting in a small fraction of the membership of a Pfam and/or InterPro entry. The user either has the sequence or can find it in UniProt (or GenBank).

  • 2)

    Option B to generate the SSN for any Pfam or InterPro entry (or combination of Pfam and/or InterPro entries that populate a homologous protein family). The InterPro entry identifiers

Generating SSNs with EFI-EST

The following subsections describe the steps involved in generating the SSN; Section 8 illustrates the use of these steps.

Example: generation, visualization, and analysis of the SSN for the OMP decarboxylase superfamily (Pfam Entry PF00215)

For the remainder of this review, the OMP decarboxylase superfamily (Pfam entry PF00215) is used to illustrate the use of both Options A and B. The OMP decarboxylase (OMPDC) superfamily is functionally diverse with three characterized reactions (Fig. 13): 1) OMPDC in pyrimidine nucleotide biosynthesis; 2) 3-keto-l-gulonate 6-monophosphate decarboxylase (KGPDC) in l-ascorbate catabolism [24]; and 3) d-arabino-hex-3-ulose synthase (HUMPS) into two metabolic contexts, detoxification of

Summary

Assignment of in vitro enzymatic activities and in vivo metabolic (physiological) functions to uncharacterized enzymes discovered in genome projects is a major challenge confronting many segments of the biological community. The identification of isofunctional clusters is the first step in exploring sequence–function space in enzyme families and devising strategies to determine the functions in unexplored space.

This review describes the use of the EFI-EST web tool to facilitate analysis of

Transparency document

Transparency document.

Acknowledgements

This work was supported by the NIH U54GM093342. The authors acknowledge Gabriel Horton (UIUC) for web design and thank the HPCBio group (UIUC) and Drs. Suwen Zhao (UCSF), Matthew P. Jacobson (UCSF), Michael Carter (UIUC), and Brian San Francisco (UIUC) for their helpful discussions.

References (52)

  • W. Tian et al.

    How well is enzyme function conserved as a function of pairwise sequence identity?

    J. Mol. Biol.

    (2003)
  • S.D. Brown et al.

    Inference of functional properties from large-scale analysis of enzyme superfamilies

    J. Biol. Chem.

    (2012)
  • J.A. Gerlt et al.

    Divergent evolution in enolase superfamily: strategies for assigning functions

    J. Biol. Chem.

    (2012)
  • A.M. Schnoes et al.

    Annotation error in public databases: misannotation of molecular function in enzyme superfamilies

    PLoS Comput. Biol.

    (2009)
  • R. Caspi et al.

    The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases

    Nucleic Acids Res.

    (2014)
  • C. UniProt

    UniProt: a hub for protein information

    Nucleic Acids Res.

    (2015)
  • S. Zhao et al.

    Prediction and characterization of enzymatic activities guided by sequence similarity and genome neighborhood networks

    Elife

    (2014)
  • J.C. Hermann et al.

    Predicting substrates by docking high-energy intermediates to enzyme structures

    J. Am. Chem. Soc.

    (2006)
  • L. Song et al.

    Prediction and assignment of function for a divergent N-succinyl amino acid racemase

    Nat. Chem. Biol.

    (2007)
  • C. Kalyanaraman et al.

    Studying enzyme-substrate specificity in silico: a case study of the Escherichia coli glycolysis pathway

    Biochemistry

    (2010)
  • S. Zhao et al.

    Discovery of new enzymes and metabolic pathways by using structure and genome context

    Nature

    (2013)
  • J.A. Gerlt et al.

    The Enzyme Function Initiative

    Biochemistry

    (2011)
  • R.D. Finn et al.

    Pfam: the protein families database

    Nucleic Acids Res.

    (2014)
  • H.J. Atkinson et al.

    Using sequence similarity networks for visualization of relationships across diverse protein superfamilies

    PLoS One

    (2009)
  • D. Peterhoff et al.

    A comprehensive analysis of the geranylgeranylglyceryl phosphate synthase enzyme family identifies novel members and reveals mechanisms of substrate specificity and quaternary structure organization

    Mol. Microbiol.

    (2014)
  • K.L. Dunbar et al.

    Discovery of a new ATP-binding motif involved in peptidic azoline biosynthesis

    Nat. Chem. Biol.

    (2014)
  • M.A. Ortega et al.

    Structure and mechanism of the tRNA-dependent lantibiotic dehydratase NisB

    Nature

    (2015)
  • M.W. Vetting et al.

    Experimental strategies for functional annotation and metabolism discovery: targeted screening of solute binding proteins and unbiased panning of metabolomes

    Biochemistry

    (2015)
  • E. Akiva et al.

    The Structure–Function Linkage Database

    Nucleic Acids Res.

    (2014)
  • A. Mitchell et al.

    The InterPro protein families database: the classification resource after 15 years

    Nucleic Acids Res.

    (2015)
  • A.E. Barber et al.

    Pythoscape: a framework for generation of large protein similarity networks

    Bioinformatics

    (2012)
  • X. Zhang et al.

    A unique cis-3-hydroxy-l-proline dehydratase in the enolase superfamily

    J. Am. Chem. Soc.

    (2015)
  • D.M. Schmidt et al.

    Evolutionary potential of (beta/alpha)8-barrels: functional promiscuity produced by single substitutions in the enolase superfamily

    Biochemistry

    (2003)
  • W. Li et al.

    Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

    Bioinformatics

    (2006)
  • D.A. Benson et al.

    GenBank

    Nucleic Acids Res.

    (2015)
  • W.S. Yew et al.

    Utilization of L-ascorbate by Escherichia coli K-12: assignments of functions to products of the yjf-sga and yia-sgb operons

    J. Bacteriol.

    (2002)
  • Cited by (0)

    View full text