STRUCTURAL GENOMICS RESOURCE: Help Page

MORE

More information about annotations
sgTarget calculates a wide range of properties for each putative target. An understanding of the properties listed below should aid you in filtering and prioritizing potential targets.

GC content
Codon Adaptation Index
Fold prediction
Function prediction
Prevalence
Instability Index
Half-life
Molecular weight
GRand AVerage hydropathY
Isoelectric point
Solubility
Transmembrane regions
Fibrous regions
Disordered regions

Apologies, this page is not yet complete.

GC content
Nucleotide base composition varies both between and within the genomes of different species. Variation in the GC content (percentage of guanine and cytosine nucleosides relative to the number of total nucleosides) of different genomes was first identified through Erwin Chargaffs chemical treatment studies of DNA from a variety of organisms (known as Chargaffs GC rule) (Chargaff, 1951).

The expression of genes from organisms with a GC content that is very different from that of the chosen expression system will often result in translational stalling and low yields of recombinant protein (Baca and Hol, 2000). This is often due to the divergence in preferential synonymous codon usage between the heterologous gene and the expression system’s genome (Grantham et al., 1981). The GC content of a gene can thus prove useful in the selection of targets, in an ad hoc way, through comparison with the range of GC contents that can be expected to successfully express in a particular system.

The number of nucleotides and both the relative and absolute nucleotide base composition of a gene and/or its coding sequence can be calculated given the genes primary structure. sgTarget calculates the GC content for every loaded genes coding sequence, and indicates whether it falls within the GC content range of the most commonly used expression hosts: E. coli and S. cerevisiae. These ranges were established by determining the minimum and maximum GC content observed for every gene in the hosts genome: 26.9 to 66.8% in E. coli and 25.9 to 59.1% in S. cerevisiae (see figure 1).

Figure 1 Distribution of GC content for all (a) E. coli and (b) S. cerevisiae genes. The distribution of GC content for cytoplasmic ribosomal protein coding genes (56 and 137 genes respectively, as annotated in their GenBank genome sequence files) is superposed in grey. Maximum and minimum GC contents are shown as dashed labeled lines.

Codon Adaptation Index
Synonymous codons, those that encode the same amino acid, are not used randomly. The frequencies of synonymous codon usage have been found to be species, and even taxon specific. In E. coli, for example, the amino acid arginine is almost always encoded by the CGT codon (referred to as an optimal codon), whereas in the yeast S. cerevisiae, AGA is the optimal codon for arginine. Certain rare codons, that are found in very small proportions throughout specific genomes, can affect translation rates of genes, particularly when rare codons are present in clusters (Robinson et al., 1984).

The resource identifies and highlights the occurrence of rare codons for the most commonly used expression hosts, E. coli and S. cerevisiae, according to the classification of Sharp and Li (Sharp and Li, 1987).

Synonymous codon usage bias can be measured using the codon adaptation index (CAI) (Sharp and Li, 1987). This index quantifies the relative conformance of a particular gene to an organisms coding strategy, by comparing it to the optimal codon usage for that organisms genome (based on the expected and the observed frequencies of all codons). A gene can have CAI values ranging from 0 to 1, where 1 indicates that it is optimally adapted to the organisms codon preferences

The resource calculates the CAI for every gene with respect to both E. coli and S. cerevisiae using CodonW (Peden). Sharp and colleagues defined CAI thresholds for expression in these two expression systems by contrasting the distribution of CAI values of the ribosomal protein coding genes to that of other genes (Sharp and Li, 1987). However, since these thresholds were defined using a limited dataset (165 E. coli genes and 160 S. cerevisiae genes), their procedure was applied to the now available whole genome sequences for these systems, thus refining the established CAI thresholds (figure 2 shows the distributions for each genome). The resource employs the refined CAI thresholds to classify genes as those likely to have high, low, or no expression in the two hosts (table 1).

Figure 2 The distribution of CAI values for all (a) E. coli and (b) S. cerevisiae genes. The distribution of CAI values for cytoplasmic ribosomal protein coding genes (56 and 137 genes respectively, as annotated in their GenBank genome sequence files) is superposed in grey. Maximum and minimum CAIs are shown as dashed labeled lines.

Table 1 Classes of expression-likelihood according to CAI.

Expression Minimum CAI

Host Class^*

E. coli Low 0.084

High 0.357

S. cerevisiae Low 0.041

High 0.221

* These classes were extrapolated from the data shown in figure 2, as follows: the minimum CAI for the Low expression class is the minimum CAI observed for the genome; the minimum CAI for the High expression class is the minimum CAI observed for cytoplasmic ribosomal protein coding genes.

Fold Prediction
Protein fold prediction is a crucial step in the annotation of gene products for structural genomics (Brenner, S. E. (2000)): it can help to identify protein sequence families lacking structural information, which are high priority targets in some structural genomics projects, as well as helping to avoid the duplication of effort, by highlighting those protein sequences that can be comparatively modeled on a target structure.

sgTarget employs BLASTP, the protein search algorithm of the BLAST software package, to search the sequences of proteins in the PDB, and identify protein sequence similarities that are indicative of structural similarity.

A coorrected normalized score (S''=-ln(E), where E is the Expect or E-value) was calibrated using the SCOP database (Andreeva et al., 2004) as a benchmark of structural similarity. A cutoff value for the corrected normalized score that is indicative of structural similarity was established by examining the ratio of false positives to true positives reported by BLASTP at varying scores, where:

True positives are those alignments reported by BLASTP with a score higher than the cutoff score, for domains which are related at least at the superfamily level; and
False positives are those alignments reported by BLASTP with a score higher than the cutoff score, for domains that are not related at the superfamily or family levels.
Figure 3 shows the two cutoff values that were established through this procedure: a conservative cutoff, which eliminates all false positives; and a natural cutoff, which is the highest score before the number of false positives rises abruptly for a small increase in the number of true positives (the cutoff values are summarized in table 2, which includes their conversion to, the more standard, E-values).

Figure 3 Calibration of the corrected normalized score, S. The number of protein pairs which are related at the family or superfamily levels of SCOP classification and whose alignment is above varying S cutoff values (true positives), as a function of the number of protein pairs which are not related at those SCOP levels but whose alignment is also above the S cutoff (false positives). The S cutoff values shown indicate the corrected normalized score at which the highest sensitivity can be achieved without the introduction of false positives (35.49 bits), the corrected normalized score at which an increase in sensitivity results in a sharp increase of false positives (12.18 bits), the corrected normalized score as defined by Salamov et al., 1999b (7.01 bits, originally derived as 18 nats by virtue of the different normalized score employed by the authors), and the corrected normalized score as defined by Yang and Honig, 2000b (18.75 bits or 13.00 nats), albeit for a different algorithm. The inset shows the ratio of false positives to true positives as a function of S for ratios smaller than 1%, highlighting the sequence similarity thresholds of 12.18, and 35.49 bits for structural significance.

Table 2 Sequence similarity thresholds for structural significance at the superfamily level

Threshold S''(bits) E-value Ratio of false positives (%)

Natural 12.18 2.15 x 10^-4 0.2

Conservative 35.49 2.07 x 10^-11 0.0

Function Prediction
Protein function can form the basis of rational selection and prioritization of targets for structure determination experiments. When the core strategy of a structural genomics project is protein function, the identification of proteins from specific families, those involved in particular metabolic pathways, or any proteins with unknown function is decisive. However, even if the projects strategy is not directly related protein function, function annotations may still help guide the structure determination process, by suggesting, for example, potential biochemical assays, as well as ligands to be tested.

sgTarget employs the InterPro database and InterProScan suite of programs cross-referenced by the GO (Gene Ontology) database to enable the functional annotation of proteins (Mulder et al., 2003; Kanapin et al., 2002; GO Consortium, 2000). The system is suitable for automated function predictions since it integrates state-of-the-art annotation transfer software and databases (through InterProScan-InterPro), whilst circumventing problems associated with the definition of function (by conformance to the GO standard).

Prevalence
The prevalence of a protein and/or protein family can aid in the identification of targets for structure determination by highlighting, on the one hand, ORFan proteins, which may uncover novel folds and functions, and on the other hand, universally widespread proteins, which may be essential to life (Fischer, 1999).

sgTarget uses the BLAST sequence similarity search algorithm to identify homologues for every protein in the NRDB (non-redundant database) of protein sequences. The NRDB database was originally created to hold a non-redundant set of protein sequences but is presently formed through the combination of the protein databases: PIR (Protein Information Resource) (Wu et al., 2003), SWISS-PROT (Boeckmann et al., 2003), TrEMBL (or Translated EMBL) (Boeckmann et al., 2003) and PDB, thus holding the majority of known full protein sequences. The Taxonomy database is is employed to enable the taxonomical classification of the identified homologies, as it is cross-referenced by the NRDB database protein files. Thus, sgTarget estimates the prevalence of a protein through the taxonomical distribution of the protein's homologues in as large a database as possible, but does not attempt to distinguish between true orthologues and paralogues. A stringent threshold is employed in the sequence similarity searches, with alignments longer than forty residues and a maximum Expect (E)-value of 10^-10 being considered a true hit.

Instability Index
The occurrence of certain dipeptides in protein sequences is significantly different for stable and unstable proteins (Guruprasad et al., 1990). Guruprasad and colleagues determined weight values of instability (DIWVs) for each of the 400 possible dipeptides, with basis on their observed contribution to the instability of proteins (Guruprasad et al., 1990). Using these weights, an instability index was developed according to the following equation:

where, DIWV is the dipeptide instability weight value, xi yi + 1 is a dipeptide along the protein sequence, L is the length of the sequence and 10 is a scaling factor (Guruprasad et al., 1990). Generally, stable proteins were found to have instability indices smaller than 40, whereas unstable proteins had instability indices larger than 40 (Guruprasad et al., 1990). This measure cannot take into account higher-order properties that also affect the stability of proteins (e.g. the degree of cross-linking), hence exceptions to this threshold are likely to occur.

Half-life
The N-degron is a degradation signal contained within a protein targeted by the proteolytic system (Varshavsky, 1991). It exists in both eukaryotes and prokaryotes but the mechanism is slightly divergent. Whereas in eukaryotes two determinants are required (a destabilizing amino-terminal residue according to the N-end rule as well as at least one internal lysine residue in spatial proximity to the amino-terminus) (Bachmair et al., 1986; Gonda et al., 1989), in prokaryotes one determinant is sufficient, namely: a destabilizing amino-terminal residue according to the N-end rule (Tobias et al., 1991).

The N-end rule states that the in vivo half-life of a protein is a function of the nature of its amino-terminal residue (Bachmair et al., 1986). It has a hierarchical structure, where certain residues are destabilizing per se (primary destabilizing residues), others are destabilizing through their ability to be conjugated to primary destabilizing residues (secondary destabilizing residues), whereas others are destabilizing through their ability to be converted via selective deamidation into secondary destabilizing residues (tertiary destabilizing residues) (Gonda et al., 1989).

The estimated in vivo half-lives for proteins depending on their amino-terminus residue are tabulated below (table 4):

Table 4 The estimated in vivo half-lives as a function of the protein's aminoterminus residue.^*

Amino Acid Half-life in Bacteria Amino Acid Half-life in Bacteria

A 10 h L 2 min

R 2 min K 2 min

N 10 h M 10 h

D 10 h F 2 min

C 10 h P -

Q 10 h S 10 h

E 10 h T 10 h

G 10 h W 2 min

H 10 h Y 2 min

I 10 h V 10 h

* Data compiled from Tobias et al., 1991.

Molecular weight
The molecular weight of protein can be calculated using the equation:

where, i is the residue position, N is the total number of residues in the protein and Mwi is the molecular weight of the amino acid residue (not withstanding natural isotopic abundance). The molecular weight of proteins that have been labelled with commonly used isotopes or alternative amino acids (e.g., 15N, 13C or selenomethionine) can be calculated in the same manner by adjusting the relevant amino acid molecular weights.

GRand AVerage hydropathY
A GRAVY (Grand Average of Hydropathy) score can be calculated as the sum of the hydropathy values for all the amino acids in a protein sequence divided by the number of residues in the sequence:

where, hi is the hydropathy value for a particular residue and N is the total number of residues in the protein (Kyte and Doolittle, 1982). In essence, a GRAVY score is the relative value for the hydrophobic residues of the protein. Although no positional or interaction effects for adjacent residues are taken into consideration by the GRAVY score, it still provides some indication of the physical state of the protein (Kyte and Doolittle, 1982).

Isoelectric point
The isoelectric point (pI) of a protein molecule is the pH at which there is no electric charge on that protein. It is often the point of lowest solubility for the protein, probably because it is the point at which there are less intermolecular repulsions so that molecules tend to form aggregates. Recently, a significant relationship between the theoretical pI of a protein and the difference between the reported pI and pH for successfully crystallized proteins was also established (Kantardjieff and Rupp, 2004).

The isoelectric point of a protein can be estimated by adding the number of positively charged residues (i.e., protonated lysine, arginine and histidine), minus the number of negatively charged residues (deprotonated tyrosine, cysteine, glutamate and aspartate), plus the number of protonated amino-termini, minus the number of deprotonated carboxyl-termini. This calculation does not take into account any ionization perturbations incurred through electrostatic interactions, which can be very significant.

Solubility
In 1991, Wilkinson and Harrison established a statistical model for the prediction of the likelihood of a proteins solubility in Escherichia coli (in vivo) based on its amino acid composition (Wilkinson and Harrison, 1991). Since then, the model has been revised to include the results of studies on the factors influencing protein folding and inclusion body formation (these bodies result from the precipitation and aggregation of insoluble proteins originated from folding intermediates) (Chrunyk et al., 1993; Hockney, 1994; Kane and Hartley, 1991). The revised Wilkinson-Harrison statistical solubility model (Davis et al., 1999) depends on two parameters only: the fraction of residues with a high index for forming turns (according to the Chou and Fasman index; Chou and Fasman, 1978) and the approximate average charge of the protein in vivo (as an estimation of the proteins net charge) (Davis et al., 1999). A canonical variable (CV) is calculated using these two parameters as follows:

where, N is the number of residues, NRes is the number of residues Res, NAsn+NGly+NPro+NSer / N is the turn forming residue fraction, |((NArg+NLys)(NAsp+NGlu))/N0.03| is the in vivo approximate average charge and the two coefficients (15.43 and 29.56) indicate each parameters relative weights. According to the statistical method used to derive the model, a discriminating value (CV) was also calculated (CV = 1.71). When CV-CV is negative the protein is predicted to be soluble, whereas when CV CV is positive the protein is predicted to be insoluble. The probability of the protein to be soluble or insoluble (depending on CV CV sign) can be estimated using the following equation:

This statistical model has been shown to be useful in the selection of proteins with high solubility (Davis et al., 1999; Harrison, 2000).

Transmembrane regions
Even though membrane proteins can be purified (using standard protocols, whilst maintaining the protein in a suitable detergent solution), they are often too low in abundance and unstable in detergent solutions to afford structural determination studies. Since these proteins and/or domains are not amenable for high-throughput structural studies, it is pertinent for the resource to attempt to identify them.

A recent evaluation of software for predicting membrane spanning regions revealed TMHMM to have the best overall performance (Mör et al., 2001). TMHMM is a method for membrane protein topology prediction based on a HMM (Krogh et al., 2001; Sonnhammer et al., 1998). The HMM combines the constraints discussed above, as well as: the positive-inside rule (states that positively charged residues are mainly found in the cytoplasmic loops (von Heijne, 1986)) and organization restraints (i.e., cytoplasmic and non-cytoplasmic loops have to alternate Krogh et al., 2001), in order to determine both the localization and the topology of transmembrane proteins. TMHMM can correctly identify 80-85% of all membrane-spanning regions (in a data set containing proteins that were not used for training the algorithm) with a false positive rate of 8.6% (Mör et al., 2001), although its topology predictions are not as accurate (correctly predicted for 63% of the proteins). The algorithm can discriminate between soluble and transmembrane proteins with 99% specificity and sensitivity in the absence of signal peptides (Krogh et al., 2001).

sgTarget employs TMHMM to predict the localization of transmembrane segments, thus distinguishing between soluble and membrane proteins.

Fibrous regions
Fibrous proteins often perform structural roles, achieved through the repeated use of secondary structure elements. These repeats form extended filamentous forms, which confer the molecules with their required mechanical properties. Among such conformations are the long regions of antiparallel β-sheet found in silk fibroin, the α-helical coiled-coils present in keratin and the triple helix encountered in collagen. Their structures are also encoded by repetitive sequence elements, thus making it possible to identify certain fibrous proteins and domains, by examining the proteins amino acid sequence.

sgTarget employs the programme ncoils (Russell and Lupas, 1999) to predict coiled-coil regions in protein sequences. This software is a component of InterProScan.

Disordered regions
Intrinsically disordered domains can cause a multitude of adverse effects in structural determination studies. Although some of these segments become ordered upon interaction with binding partners to perform specific functions {Wright, 1999 #338}, their structural characterization would be difficult even when prior knowledge of the required cofactors was available.

sgTarget employs the charge-hydrophobicity phase-space boundary of Uversky et al. {Uversky, 2000 #344}, complemented by the putative lower bound complexity threshold of Romero and colleagues {Romero, 2001 #358}, to predict regions of intrinsic disorder:

First, the low-complexity detection software SEG {Wootton, 1993 #256} is employed to detect any subsequences of at least 45 residues and a complexity value lower than 2.90 (the lower bound complexity threshold of Romero and colleagues). Such regions are annotated as probable non-globular protein stretches.

Then, for the remaining subsequences, the mean hydrophobicity (the sum of the normalized hydrophobicities from {Kyte, 1982 #313} divided by the number of residues) and the mean net charge at pH 7.0 are calculated, and used in the following equation to predict if the subsequence is likely to be intrinsically disordered:

where is the mean hydrophobicity and is the mean net charge (Uversky, 2002; Uversky et al., 2000).

Back to Top of Page