Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics
ReviewEnzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks
Section snippets
Introduction: the functional assignment challenge
The identities and functions of the complete set of proteins encoded by a genome should allow a comprehensive understanding of the physiology of the organism. However, a conservative estimate is that only ~ 50% of the proteins discovered in genome projects have reliable functional annotations in the sequence databases—the remainder have unknown, uncertain, or incorrect functional annotations (and the identities of these are unknown!) [1], [2]. Genome projects should provide information of
Sequence similarity networks (SSNs)
Dendrograms and trees (Fig. 2A) are the most common tools for surveying sequence–function space in enzyme families. However, their construction and interpretation is computationally intensive and requires an accurate sequence alignment that is difficult to achieve on large-scale. Babbitt and coworkers described the use of protein sequence similarity networks (SSNs) as an “easy to compute” alternate method for assessing sequence relationships within enzyme families [12]. In an SSN (Fig. 2B),
The value of SSNs and how to use them
An SSN is a visual aid that allows a user to segregate a functionally diverse superfamily (different substrate specificities and/or reaction mechanisms) into putative isofunctional groups [12]. At small alignment scores (low sequence identity), most of the nodes in the SSN for a homologous family will be connected to one another by edges resulting in a single large cluster (“hairball”, Fig. 4A). As the alignment score used to draw edges is increased (the sequence identity is increased), edges
Representative node networks
The number of nodes in an SSN (N) is the number of sequences in the family; the number of edges connecting the nodes varies with the alignment score and is ≤[N × (N − 1)] / 2. The memory available to Cytoscape 3.2 that is used to visualize SSNs limits the number of edges that can be displayed: with 4GB RAM, an SSN with ≤~ 500,000 edges can be opened and manipulated; with 64GB RAM, an SSN with ≤~ 5,000,000 edges can be opened and manipulated.
The edge/node ratio in an SSN is determined by the degree of
Sequences and node attributes used by EFI-EST
EFI-EST uses sequences from the UniProt database and their associated descriptions (node attributes; vide infra) from the UniProtKB (http://www.uniprot.org/uniprot/). The EFI's choice of UniProt, instead of GenBank (http://www.ncbi.nlm.nih.gov/genbank) [23], recognizes the ability of any member of the community to update or correct an annotation in UniProtKB based on experimental evidence; in contrast, GenBank is an archive—the annotations can be changed only by the depositor of the sequence.
Protein families/domains described by Pfam and InterPro
EFI-EST provides the user with two options for generating SSNs:
- 1)
Option A to explore local sequence–function space defined by a user-specified sequence, often resulting in a small fraction of the membership of a Pfam and/or InterPro entry. The user either has the sequence or can find it in UniProt (or GenBank).
- 2)
Option B to generate the SSN for any Pfam or InterPro entry (or combination of Pfam and/or InterPro entries that populate a homologous protein family). The InterPro entry identifiers
Generating SSNs with EFI-EST
The following subsections describe the steps involved in generating the SSN; Section 8 illustrates the use of these steps.
Example: generation, visualization, and analysis of the SSN for the OMP decarboxylase superfamily (Pfam Entry PF00215)
For the remainder of this review, the OMP decarboxylase superfamily (Pfam entry PF00215) is used to illustrate the use of both Options A and B. The OMP decarboxylase (OMPDC) superfamily is functionally diverse with three characterized reactions (Fig. 13): 1) OMPDC in pyrimidine nucleotide biosynthesis; 2) 3-keto-l-gulonate 6-monophosphate decarboxylase (KGPDC) in l-ascorbate catabolism [24]; and 3) d-arabino-hex-3-ulose synthase (HUMPS) into two metabolic contexts, detoxification of
Summary
Assignment of in vitro enzymatic activities and in vivo metabolic (physiological) functions to uncharacterized enzymes discovered in genome projects is a major challenge confronting many segments of the biological community. The identification of isofunctional clusters is the first step in exploring sequence–function space in enzyme families and devising strategies to determine the functions in unexplored space.
This review describes the use of the EFI-EST web tool to facilitate analysis of
Transparency document
Acknowledgements
This work was supported by the NIH U54GM093342. The authors acknowledge Gabriel Horton (UIUC) for web design and thank the HPCBio group (UIUC) and Drs. Suwen Zhao (UCSF), Matthew P. Jacobson (UCSF), Michael Carter (UIUC), and Brian San Francisco (UIUC) for their helpful discussions.
References (52)
- et al.
How well is enzyme function conserved as a function of pairwise sequence identity?
J. Mol. Biol.
(2003) - et al.
Inference of functional properties from large-scale analysis of enzyme superfamilies
J. Biol. Chem.
(2012) - et al.
Divergent evolution in enolase superfamily: strategies for assigning functions
J. Biol. Chem.
(2012) - et al.
Annotation error in public databases: misannotation of molecular function in enzyme superfamilies
PLoS Comput. Biol.
(2009) - et al.
The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases
Nucleic Acids Res.
(2014) UniProt: a hub for protein information
Nucleic Acids Res.
(2015)- et al.
Prediction and characterization of enzymatic activities guided by sequence similarity and genome neighborhood networks
Elife
(2014) - et al.
Predicting substrates by docking high-energy intermediates to enzyme structures
J. Am. Chem. Soc.
(2006) - et al.
Prediction and assignment of function for a divergent N-succinyl amino acid racemase
Nat. Chem. Biol.
(2007) - et al.
Studying enzyme-substrate specificity in silico: a case study of the Escherichia coli glycolysis pathway
Biochemistry
(2010)