Chapter 2
Protein crystallographic methods

2.1  Introduction to protein crystallography

A crystal is a solid with regular three-dimensional ordering. The regular repeat of the unique unit (unit cell) in a crystal can be represented by a regular repeat of points, the lattice, with the same geometry as the original unit cells. The symmetry of the lattice should reflect the symmetry of the crystal. This means that the symmetry most representative of the crystal does not always give rise to a primitive cell. A primitive unit cell can be described by a single representative point, commonly chosen to be the origin (0,0,0), from which the crystal can be built up by generating points with unit translations along the unit cell edges. Non-primitive lattices would have additional representative points translated 1/2 along appropriate cell edges.

The lattice, or lattice function l(r), can be seen as a three-dimensional array of d-functions corresponding to the lattice points. The unit cell function u(r) describes the detailed information of the structure of the molecule(s) in the unit cell. The crystal structure function c(u) is the convolution of the lattice function with the unit cell function [Sherwood, 1976]:

c(u) = l(r) Ä u(r) (1)

The unit cell contains one or more molecules built up of atoms. In a classical description, the charged particles in the atoms, namely the protons and electrons, interact with electromagnetic radiation like X-rays, producing the diffraction pattern observed when a crystal is put in an X-ray beam. The particles start to oscillate, becoming a source of radiation themselves, thus scattering the incoming wave. Because the intensity of scattered radiation is inversely proportional to the square of the mass of the scattering particle, electrons scatter X-rays around one million times more effectively than protons. It is therefore the electron cloud of atoms that is responsible for the abovementioned diffraction pattern. The scattering from the electron cloud is proportional to that of a free electron, assuming the electrons oscillate in phase. In other words, the scattering of a volume element dr at position r is proportional to the electron density r(r) at r. The use of the term electron "cloud" implies a finite size to the space the electrons around an atom may occupy. The approximation of an atom by a point-atom, useful though it may be in other cases, would in an X-ray experiment for a single atom give rise to equal scattering in all directions due to lack of interference between wavelets scattered by different volume elements dr. However, the real dimensions of the electron cloud are comparable to the wavelength of the X-rays used, resulting in interference which causes scattering to fall off with increasing scattering angle q. It can be shown that thermal motion, influencing the apparent size of the atom, causes the scattering to fall off more rapidly with q. Atoms can be assumed to have spherical symmetry, so the scattering from an individual atom is a function which depends merely on q. This function is known as the 'atomic form factor' or 'atomic scattering factor' and has been tabulated for all chemical elements [International Tables Vol. C, 1992].

The diffraction pattern resulting from the scattering of X-rays by a distribution of electrons characterised by an electron density function r(r) is the Fourier transform F(Dk):

F(Dk ) = ó
õ


all r 
fe r(r) eiDk·r dr
(2)

where fe is the electronic scattering factor and Dk is the scattering vector, which can be expressed as Dk = kd-ki. Here ki is the wave vector of the incident beam and kd the wave vector of the diffracted beam.

If the electron density r(r) is expressed as a function of fractional coordinates x,y,z and only a unit cell is considered, for which fe can be seen as a constant that can be taken up in F, one gets

F(Dk ) = V ó
õ
x = 1

x = 0 
ó
õ
y = 1

y = 0 
ó
õ
z = 1

z = 0 
r(x,y,z) eiD k·(xa+yb+zc) dxdydz
(3)

In order to obtain a complete crystal, the unit cell is convoluted with the crystal lattice. The amplitude of the unit cell diffraction pattern, as described by equation 3, envelops the diffraction pattern of the crystal lattice, which is a series of sharp peaks forming the reciprocal lattice. The effect of this envelopment is the sampling of the unit cell diffraction function at points where the crystal lattice gives rise to a diffraction maximum. The diffraction pattern intensities, representing the reciprocal lattice weighted by the unit cell contents, can thus be described by means of delta functions [Sherwood, 1976]:

|F(Dk )|2 = [ ¥
å
h = -¥
d(Dk · a - 2hp) ]2 [ ¥
å
k = -¥
d(Dk · b - 2kp) ]2 [ ¥
å
l = -¥
d(Dk · c - 2lp) ]2
(4)

where a, b and c are unit cell edge vectors, and h, k and l are integers. This function can only be non-zero at points where Dk·a = 2hp, Dk·b = 2kp and Dk·c = 2lp, known as the Laue equations. In other words, Dk is restricted to those values which correspond to the crystal lattice diffraction maxima, so Dk is proportional to the reciprocal lattice vector Ghkl as follows:

Dk = 2pGhkl (5)

Dk forms an equilateral triangle with the abovementioned vectors ki and kd (see equation 2), where the angle between the incident and diffracted beam vectors is called the 'scattering angle' 2q.

|Dk| = 2ksinq (6)

where k is the magnitude of both the incident and the diffracted beam vectors; k is also equal to 2p/l because of wave theory. The magnitude of the reciprocal lattice vector Ghkl in equation 5 is inversely proportional to the spacing dhkl between reciprocal lattice planes (hkl). Thus, following from equation 6, 1/dhkl = 2/lsinq, so

l = 2dhklsinq (7)

equation 7 is known as Bragg's Law. The angle q is called the 'Bragg angle', and is equal to half of the scattering angle. The wave vector triangle described above can be used in a geometrical interpretation of Bragg's Law: the Ewald sphere (see figure 2.1). The radius of the sphere is taken as k/2p = 1/l.

The Ewald sphere construction
Figure 2.1: The Ewald sphere construction.
I = point where incident beam crosses the Ewald sphere; C = position of the crystal; O = exit point of direct beam, commonly chosen as the origin of reciprocal space; P = exit point of diffracted beam; ki = incident wave vector; kd = diffracted wave vector; Dk = reciprocal lattice vector; q = Bragg angle; 2q = scattering angle.

From this construction it can be seen that, when the crystal rotates around C and its reciprocal lattice rotates around O, diffraction maxima only occur when wave vector Dk is an exact chord of the Ewald sphere, i.e. when a reciprocal lattice point passes through the Ewald sphere. This chord is the reciprocal lattice vector Ghkl.

For Ghkl = ha*+kb*+lc* the expression Dk·r in equation 2 becomes 2p(hx+ky+lz). Equation 3 may now be rewritten as

Fhkl = V ó
õ
1

0 
ó
õ
1

0 
ó
õ
1

0 
r(x,y,z) e2pi(hx+ky+lz) dxdydz
(8)

which is known as the 'structure factor' of the diffraction maximum at reciprocal lattice point hkl. Fhkl is a complex quantity and can be expressed as Fhkl=|Fhkl| eiahkl. The experimentally observable intensity is Ihkl=|Fhkl|2.

One of the more interesting properties of a diffraction experiment is that the diffraction pattern is the Fourier transform of the electron density function, which means that the electron density function r(x,y,z) can be expressed as

r(x,y,z) = 1
V
ó
õ


all Dk 
Fhkl e-2pi(hx+ky+lz)d(D k )
(9)

Because the diffraction pattern is not a continuous function of Dk but a discrete set of values represented by Fhkl for each maximum, this expression reduces to

r(x,y,z) = 1
V

å
h 

å
k 

å
l 
Fhkl e-2pi(hx+ky+lz)
(10)

Assuming all Fhkl can be derived from the diffraction pattern, the electron density can be calculated at every point x,y,z by adding up the contributions from all the reflections hkl. The total density function r(x,y,z) can be computed from the local densities at every point x,y,z. The computation of the electron density from structure factors is called 'Fourier synthesis'.

Unfortunately, experimental data does not provide amplitudes Fhkl, but intensities Ihkl, where Ihkl=|Fhkl|2. For a Fourier synthesis, the complete information of Fhkl, i.e. both |Fhkl| and the phase ahkl, are needed. Thus part of the information is lost, giving rise to what is called the Phase Problem.

2.2  Crystallisation

The crystallisation of proteins and protein-like substances has been a science for over a century. Van Deen (1864) believed that all naturally occurring organic substances can be crystallised when manipulated effectively. Schimper (1881) reported on the first active crystallisation attempt by Maschke in 1859, who evaporated the solution from a preparation of brazil nuts in order to obtain crystals.

Insulin exists in crystalline form in storage granules in the pancreas. Crystallisation of insulin was pioneered by Abel in 1925, after which Scott established the need for zinc for successful reproducible crystallisation. The first insulin crystals used in X-ray work were grown according to his methods [Crowfoot, 1935]. Over the decades, the protocols for insulin crystallisation were standardised and extended to produce different crystal forms. Schlichtkrull (1958) had a major role in the standardisation of the crystallisation with zinc and other metals, ultimately leading to the successful crystallisation of suitable heavy atom derivatives for the solution of the structure of native 2Zn insulin in 1969 [Adams et al.]. It became clear that a minimum amount of 2 Zn atoms per rhombohedral cell (trigonal setting) was needed, or 4 Zn atoms per cell in crystallisations with ³ 6% halide. There is also a maximum amount of 6 Zn atoms per cell, beyond which no crystals will form or crystallisation is very slow [Schlichtkrull, 1958 and Brange, 1987].

Crystallisation of insulin
Figure 2.2: The crystallisation behaviour of insulin. Reproduced from Brange (1987)

Brange (1987) shows a very elegant graph of the crystallisation conditions for insulin in its various crystal forms (see figure 2.2). In general, crystallisations of modified insulins are carried out with the purpose of the modification in mind. Mutants which form supposedly more stable hexamers than native insulin, are crystallised following the 2Zn or 4Zn protocols described in table 3.2. For even more stabilisation of the hexamer, phenol is added in some cases, according to the monoclinic insulin crystallisation protocol. In order to study metal binding, metal salts are added where appropriate, resulting in either aggregation to hexamers (for which metal ions are necessary in general) or dimers (which can be formed without metal ions, or may still be formed in their presence when the metal binding site is removed). Organic solvents like ethanol and acetone are used in insulin experiments for various reasons. Insulin is very soluble in acid solutions containing ethanol, whereas most proteins are not. This property is still utilised in the extraction of insulin from pancreases. Furthermore, the addition of many organic solvents miscible with water (such as ethanol and acetic acid) counteracts the association of insulin molecules due to non-covalent interactions [Brange, 1987]. Organic solvents can also be used as precipitating agents in crystallisations because of their effect in reducing the dielectric constant of the crystallisation mixture [Xiao, 1990]. Some organic solvents (e.g. 2-methyl-2,4-pentanediol, MPD) and organic polymers (e.g. lower molecular weight PEG) present in crystallisations as precipitating agents are directly suitable as cryoprotectants.

2.3  Data collection and processing

Equation 8 can be rewritten in a form more suitable for the calculation of structure factors from the contents of the unit cell, and for the discussion of the symmetry of a diffraction pattern [Sherwood, 1976]. Introducing the atomic electron density function rj (r-rj ) for every atom j, based on a vector triangle r = rj+Rj defining the position of an electron with respect to its atomic nucleus (Rj ) and the position of the atom in the unit cell (rj ), equation 8 becomes

Fhkl = ó
õ


unit cell 

å
j 
rj (Rj ) e2piGhkl·rj e2piGhkl·Rj dRj
(11)

assuming the positions of the atomic nuclei are constant so that dRj is equal to dxdydz. Reversing the integral and the sum, factoring the constant e2piGhkl·rj out of the integral and changing the limits of the integral from unit cell to atom (re-building the integral with the summation over all j atoms), Fhkl becomes

Fhkl =
å
j 
e2piGhkl·rj ó
õ


atom 
rj (Rj ) e2piGhkl·Rj dRj
(12)

The aforementioned atomic scattering factor can now be reintroduced in mathematical form:

fj = ó
õ


atom 
rj (Rj ) e2piGhkl·Rj dRj
(13)

so that

Fhkl =
å
j 
fj e2pi(hxj+kyj+lzj)
(14)

with hxj+kyj+lzj = Ghkl·rj.

This expression is central to the determination of the crystal symmetry and the symmetry of a diffraction pattern. An interesting property of equation 14 is that Fhkl and F¯h¯k¯l only differ in phase, in such a way that F¯h¯k¯l is the complex conjugate of Fhkl. Their magnitudes are equal:

|Fhkl| = |F¯h¯k¯l|
(15)

which is known as Friedel's Law. Thus the observed intensities of reflections hkl and ¯h¯k¯l are equal. This means that a weighted reciprocal lattice will always be centrosymmetrical in nature. Equation 14 can also be used in the determination of the presence of translational symmetry in the crystal. A simple example is reproduced from Sherwood (1976). In the case of a body-centred cubic lattice for metallic caesium with only two independent sites, (0,0,0) and (1/2,1/2,1/2), the structure factor only has two terms:

Fhkl = fj e2pi(h0+k0+l0)+ fj e2pi(h1/2+k1/2+l1/2)
(16)

Since the atomic contents of the two sites are equivalent, fj is equal in both cases, so

Fhkl = fj [1+eip(h+k+l)]
(17)

Because h,k and l are integral numbers, Fhkl can only be 0, if h+k+l is odd, or a finite value 2fj, if h+k+l is even. Half of the diffraction maxima are missing, the so-called systematic absences. It can be shown that all translational symmetry gives rise to specific systematic absences. The systematic absences corresponding to lattice type unambiguously determine, given the crystal system, what is called the Bravais lattice, of which there are 14.

The reciprocal lattice, and thus a diffraction pattern, does not explicitly have translational symmetry. Therefore, the symmetry of a diffraction pattern must be that of one of the 32 point-groups. Only eleven of those have the, by Friedel's Law, required centrosymmetry, and are referred to as the Laue groups.

Buerger has calculated there are 122 different possible symmetry types for diffraction experiments, the diffraction symbols, one of which must fully describe the diffraction pattern. Fifty-eight of these diffraction symbols uniquely define a space group, the others give rise to ambiguity. Many of the latter 64 contain mirror planes and inversion centres, which makes them impossible for natural proteins. So usually in protein crystallography the space group of a crystal is reasonably easily determined from a diffraction pattern, with the help of data processing computer programs.

The theory of data processing is mainly concerned with correcting for the assumptions and approximations made above. What is actually measured in a diffraction experiment, is the total energy scattered by the crystal towards the diffraction maximum hkl. Sherwood (1976) calls this quantity Ehkl:

Ehkl = I0
w
KLhklphklA'hklehkl |(Fhkl)T|2V
(18)

The proportionality with the intensity of the incident beam I0 and the volume of the crystal V is intuitively obvious. A'hkl and ehkl correct for absorption and extinction effects. w is the angular velocity of the rotation of the crystal, which comes into play when considering the amount of time needed for a reciprocal lattice point to pass through the Ewald sphere. K represents a number of fundamental constants involved with the physics of the experiment (e.g. the mass of an electron and the wavelength of the radiation used). The polarisation factor phkl is specific for the type of incident beam used, with different values for e.g. randomly polarised and monochromated beams. The geometry of the diffraction experiment and the position of a diffraction maximum relative to O (see figure 2.1), have an effect on what is recorded when the reciprocal lattice point passes through the Ewald sphere, which is taken into account with the Lorentz factor:

Lhkl = 1
l
w
nn
(19)

where nn is the component of velocity n normal to the Ewald sphere surface in the direction of C. n is the linear velocity in the direction PI if the crystal is rotating with angular velocity w around an axis through O perpendicular to the plane of the paper (see figure 2.1). l is the wavelength of the X-ray experiment.

The effect of thermal motion is taken up in (Fhkl )T, which has (fj )T = fj e-Bj(sin2q/l2) as atomic scattering factor rather than just fj, and thus falls off more rapidly with q, as discussed in section 2.1. Bj is the so-called temperature factor.

In protein crystallography there is another major factor of influence in data collection: protein crystals suffer radiation damage. Performing the diffraction experiment at cryogenic temperatures will reduce this effect dramatically. Any remaining damage needs to be corrected for during data processing.

Apart from the desired discrete diffraction due to the crystal, there is continuous background scattering of X-rays due to scattering by the air and incoherent scattering of the crystal (a quantum mechanical effect). The final number assigned to a diffraction maximum is obtained by integration, the total area of the spot, rather than the absolute peak height. It depends on numerous parameters of the specific diffraction experiment, and thus can not be determined on an absolute scale. After integration, equivalent reflections are scaled and merged, and standard deviations determined by estimation from the counting statistics, attempting to take into account systematic errors of the diffraction experiment. Because of counting errors, some of the resulting measured intensities will have a negative value, although a true intensity can, of course, not be negative [French and Wilson, 1978]. If left negative, such measurements will pose problems upon taking the square root when calculating the structure factor modulus. Simply omitting them will bias the resulting structure, as will resetting them to zero without adjusting their standard deviations. French and Wilson have devised a way of treating negative intensity measurements more advantageously, based on Bayesian statistics. Their methods only have an effect on the smallest terms, without the discontinuity observed in other methods, and without resetting them to a constant value. This should result in least possible loss of information and least possible bias. Unfortunately, some data processing programs still insist on the conventional approach of resetting to zero or omitting.

2.4  Structure Determination

When a new structure can be assumed to be isomorphous with one that is already known, structure determination is simple: (2Fobs-Fcalc) and (Fobs-Fcalc) maps can be calculated with the observed magnitudes of Fhkl from the new structure, and the calculated magnitudes and phases of Fhkl from the presumed isomorphous known structure. These maps should show the small differences between the two structures well. Isomorphism can be assumed on the basis of a strong similarity between the cell constants of two structures, and a low value for the isomorphous R-factor:

Riso =

å
hkl 
||FPH| - |FP||


å
hkl 
|FP|
(20)

In all other cases the crystallographer has to deal with the aforementioned Phase Problem. For small molecules this problem can be solved with several methods based on the Patterson function, or by so-called Direct Methods. Direct methods are based on statistical relationships between structure factors, and often involve probability functions which depend on the reciprocal of the number of atoms. Moreover, they usually require data to extend to atomic resolution. For proteins, with relatively large numbers of atoms and generally lower than atomic resolution data, these methods are in general not powerful enough to solve the structure. However, a lot of effort is applied to determine ever larger structures ab initio, with some successes for small proteins and peptides (see for example Sheldrick et al., 1993).

The Patterson function is defined as

P(u,v,w) = 1
V

å
h 

å
k 

å
l 
|Fhkl|2 e-2pi(hu+kv+lw)
(21)

which is the inverse Fourier transform of the relative intensities |Fhkl|2 which are the experimentally measurable quantities. Because of the symmetry of direct and inverse Fourier transforms, it is valid to say the Patterson function is simply the Fourier transform of |Fhkl|2. Invoking some complex number theory, the convolution theorem and the notion that the electron density is the Fourier transform of the structure factor, it can be proved that

P(u,v,w) = V[r(x,y,z) Ä r(-x,-y,-z)]
(22)

Or, in words, the Patterson function is the convolution of the electron density with its centrosymmetric image, scaled by the volume of the unit cell. (u,v,w) represents a set of vectors between all pairs of atoms in the unit cell. The peaks in the Patterson function occurring at (u,v,w) have the strength of the product of the strengths at either side of the vector, which are directly linked to the number of electrons of the atoms in the atom pair, apart from scale factors. For a structure with N atoms there are N(N-1) non-origin peaks. Thus the Patterson map of a protein would contain a large number of peaks, many of which coincide or overlap, since the width of a Patterson peak is the sum of the widths of the appropriate two peaks in the corresponding electron density maps. This generally makes the direct determination of a protein structure from a Patterson map impossible, although there are exceptions (see for example Frazão et al., 1995). The protein crystallographer therefore has to resort to one of two other methods to overcome the Phase Problem: isomorphous replacement or molecular replacement. Molecular replacement can be used when the structure of a similar protein is known, e.g. a native protein in the case of a mutant, the same protein from a different species, a model from NMR studies, or an otherwise homologous protein which might serve as a model for the protein under study. In the case of a completely "new" protein for which no related structure is known, isomorphous replacement is the only available option. It is by this method that the very first insulin structure was determined [Adams et al., 1969].

2.4.1  Isomorphous replacement

The isomorphous replacement method is based on the heavy atom technique as developed for small molecules. The idea is that the diffraction pattern and phases are dominated by the contribution of one or a few heavy atoms if their atomic number is sufficiently large relative to that of the other atoms present. The position of the heavy atoms can be determined from a Patterson map, after which an electron density map can be phased on those atoms only. This map should reveal the positions of the other atoms. If no heavy atom is present, a suitable atom might be replaced by a heavy atom. If this heavy atom derivative can be crystallised isomorphously, the heavy atom method will facilitate the solution of both the native and modified structures. In proteins the production of a heavy atom derivative is often reduced to the soaking in of suitable heavy atoms into the solvent channels. When the procedure is effective, the structure of the protein is not disturbed and the heavy atoms bind sufficiently to certain groups. The small changes in the diffraction pattern can then be directly attributed to the heavy atoms. If FP is the structure factor for the native protein and FPH that for the derivative, the contribution of the heavy atoms is

FH = FPH - FP (23)

Although only the magnitudes |FP| and |FPH| can be measured, the vector FH can be calculated once the positions of the heavy atoms have been determined. The interpretation of the Patterson function can be made easier by taking (|FPH|-|FP|)2 = |FH|2cos2(aPH-aH) as coefficients. This is called isomorphous-difference Patterson Synthesis and in the resulting Patterson map all the available information which the native protein and the derivative have in common, has cancelled out, leaving a clearer picture of the heavy atom information, although some noise has been introduced. This noise stems from the cosine term in the derivation of the coefficients: ½|FH|2|FH|2 cos2(aPH-aH). Since aPH and aH are not correlated, this term only contributes noise.

In centric zones the structure factors are real quantities, so |FH| = |FPH| ± |FP|. Usually, because |FH| is relatively small, the difference is the correct estimate. Only when |FPH| and |FP| are both also small, the sum can be correct. In acentric cases the Phase Problem is more complicated. It can be solved with the aid of a phasing diagram as devised by Harker (1956). Figure 2.3 shows a phasing diagram for one isomorphous derivative. It can be seen that the phasing circles for the native protein and that for the derivative intersect at two points, giving rise to two phase triangles fulfilling equation 23. The Phase Problem is not solved unambiguously.

SIR phasing diagram
Figure 2.3: Phasing diagram for Single Isomorphous Replacement.
Two circles are drawn, one with radius |FP| centred on the origin, and one with radius |FPH| about the end of the vector -FH. The two points of intersection A and B of the two circles correspond to two possible values of the protein phase.

This can be remedied by adding another derivative circle as in figure 2.4. So long as the heavy atoms do not occupy the same sites, i.e. the derivatives are sufficiently different, and the native protein and the derivatives are sufficiently isomorphous, so that the phasing circles are well-defined, the ambiguity is resolved. The correct solution is the point where the three circles intersect.

MIR phasing diagram
Figure 2.4: Phasing diagram for Multiple Isomorphous Replacement.
See also legend to figure 2.3. When two derivatives are available, a third circle can be drawn, and in favourable circumstances the three circles intersect in one clearly defined point (point A in the diagram), thus resolving the phase ambiguity.

Another method of resolving the ambiguity is to take into account that the electrons in the crystal are not really free; atoms scatter X-rays as if they contain electronic dipole oscillators. This phenomenon, called anomalous scattering, will be especially noticeable if the X-ray wavelength is just shorter than that which corresponds to an absorption edge of the atom in question, so that the electrons in the atom can change quantum state. The atomic scattering factor depends on the frequency of the radiation, and is of the form

fj = (fj )C + Df' + iDf'' (24)

(fj )C is the centrosymmetrical approximation. Df' is the dispersion component and does not involve absorption. It is responsible for the change in refractive index away from unity. Df'' is responsible for the absorption coefficient of the medium. This additional information causes the breakdown of Friedel's Law (see equation 15) and can be used to resolve the phase ambiguity, as can be seen in figure 2.5.

SIRAD phasing diagram
Figure 2.5: Phasing diagram for SIR with Anomalous Dispersion.
See also legend to figure 2.3. The effect of anomalous scattering is to introduce a phase shift, which is different for reflections forming a Friedel pair.
Thus there are circles with radii |(Fhkl)PH| and |(F¯h¯k¯l)*PH| centred at -(FH+FH'') and -(FH-FH'') (points B and C in the diagram), intersecting with the circle corresponding to |FP| in only one point (point A in the diagram).

Without anomalous scattering
In the presence of anomalously scattering heavy atoms

In the case of insulin, anomalous scattering data was used in the form of what is called the 'anomalous Patterson', which uses (|Fhkl| - |F¯h¯k¯l| )2 as coefficients. This method was first used for proteins by Rossmann in 1961 and produced information of the distances between anomalously scattering atoms.

2.4.2  Molecular replacement

Molecular replacement methods were first developed by Rossmann and Blow (1962) to exploit the presence of non-crystallographic symmetry to obtain phase information and reduce phase uncertainties. The ideas were extended by Bricogne (1976) for real-space symmetry averaging to improve the electron density. The methods are based on a simple idea: finding a molecule in different places in a unit cell.

More recently, the term 'molecular replacement' is often used for the process of finding a similar molecule in a different crystal. If the structure of a molecule related to that under study is already known, it can be used to arrive at starting phases through orientation and positioning of the known molecule into the unit cell of the unknown. The basic principle of molecular replacement is rooted in the Patterson function. A Patterson map contains information of two types of vectors:

  1. intramolecular or self-Patterson vectors of pairs of atoms in the same molecule. These vectors are relatively short and are thus clustered around the origin;
  2. intermolecular or cross-Patterson vectors, which are generally longer than self vectors.

The self vector cluster would be equal for non-crystallographically related molecules in the same unit cell but also very similar for similar molecules in different crystals, apart from a rotation difference. The cross vectors provide information on the required translation of molecules to their positions relative to symmetry elements. Thus the process of molecular replacement can be divided into two: orientation and positioning, obliterating the need for a direct six-dimensional search, which is computationally unfeasible.

  The rotation function

The construction of a Patterson function can be regarded as consecutively placing each atom in the origin and drawing vectors from such an atom to all other atoms in the unit cell. The value assigned to the vector is proportional to the product of the atomic numbers at either end of the vector. Thus, the height of the origin peak in a Patterson map is proportional to the sum of the squared atomic numbers, due to the null-vectors from each atom to itself. As stated before, the cross vectors in a protein Patterson are generally longer than the self vectors, unless the protein molecules are unusually closely packed or non-globular in shape.

In the self-rotation function, the Patterson of the unknown structure is rotated with respect to itself. Maximum overlap of these Patterson maps will occur at zero rotation (obviously) and at rotations representing (non-crystallographic) symmetry. The overlap is calculated by integrating the product over an appropriate spherical part of Patterson space [Rossmann and Blow, 1962]. Similarly, two Patterson maps of related molecules (one known, one unknown) can be superposed when performing a cross-rotation function. The origin peak may be excluded by using an inner limit for the integration. The integration radius used should reflect the dimensions of the molecule and packing considerations, in order to exclude cross vectors. In the cross-rotation function the occurrence of cross vectors of the known molecule can be prevented altogether by putting the molecule in an artificially large P1 cell, so that cross vectors lie well outside the integration radius.

The general notation for the rotation function, seen as an overlap function of a Patterson function P(u) and a rotated Patterson function Pr(ur), dependent on Eulerian angles a,b,g, is

R(a,b,g) = ó
õ


U 
P(u).Pr(ur)du
(25)

where U is the appropriate volume of integration, and u is a Patterson vector (u,v,w).

If there is any prior knowledge of non-crystallographic symmetry, this can be used as a constraint in the calculation of what is named the 'locked' rotation function, a procedure first described by Rossmann et al., (1972). Assumed point-group symmetry can be imposed on the unknown molecule, improving the rotation function. The explicit use of non-crystallographic symmetry acts as a way of averaging the previously separate rotation functions Ri [Tong and Rossmann, 1990]:

RLOCKED =
n
å
  i = 1 
Ri

n
(26)

where n is the number of non-crystallographic symmetry elements.

The rotation functions used in the studies of the structures in this thesis, were performed with the program package AMoRe [Navaza, 1994]. The rotation function in AMoRe [Navaza, 1987] is based on the fast rotation function, using spherical harmonics and Bessel function expansions [Crowther, 1972], the detailed mathematics of which are beyond the scope of this thesis. The adaptations made by Navaza permit more accurate computation of the rotation matrices [Navaza, 1990], and enhance the resolution of the rotation function peaks. Removal of the origin (usually performed by using an inner limit for the integration radius) is achieved by excluding the lower order spherically symmetrical Bessel functions.

AMoRe uses an uncommon philosophy at the stage of the structure factor calculation from the model electron density. The model is rotated so that its principle axes of inertia are aligned with the cell axes and then placed, with its centre of mass, at the origin of a P1 cell with linear dimensions around four times the size of the model. Continuous Fourier coefficients are then calculated on a fine grid. Therefore the problem of the need for sampling at non-integral reciprocal lattice points during the calculation of the rotated Patterson function, does not occur. The structure factors can be interpolated from the continuous set, rather than being calculated for every rotation. The loss of accuracy is small: only 1% in terms of crystallographic R-factor for the situation of a quadruple-size box. For a double-size box this discrepancy would be 8% [Castellano et al., 1992].

The program outputs a list of rotation function peaks with Eulerian angles a,b,g and a correlation coefficient. Because the model has been placed at the origin and rotated, the output is not in straightforward Euler angles. It needs to be shifted back, away from the origin, after which it may, in some cases, reveal directly interpretable information about the molecule under study.

  The translation function

The first fast translation function expression was devised by Crowther and Blow (1967):

Tjk(v ) = ó
õ


V 
Pjk(u,v) . Pobs(u)du
(27)

Pjk(u,v) is the cross-Patterson function of the model structure in which two molecules (j and k) are related by crystallographic symmetry. v is the intermolecular vector between the local origins of these two molecules. Pobs(u) is the Patterson function of the unknown structure. This function effectively searches one pair of asymmetric units, i.e. one crystallographic symmetry element, at a time. The function has two limitations:

  1. The self vectors confuse the picture. They can easily be removed by subtraction, resulting in the T1 function:

    Tjk1(v) = ó
    õ


    V 
    [Pobs(u) -
    å
    i 
    Pii(u)] . Pjk(u,v)du
    (28)

    assuming the intramolecular vector set of the unknown structure is the same as that of the model;

  2. The solution of the translation function only represents one symmetry operator; all other symmetry information (including non-crystallographic symmetry) is lost. An improved translation function has been derived, resulting in peaks for all possible intermolecular vectors. The model now has as many molecules in the cell as the unknown structure. It is more convenient to express this translation function in terms of the translation vector t for the identity asymmetric unit. The expression contains the calculated Patterson for the correctly oriented model with unknown translation:

    T0(t) = ó
    õ


    V 
    Pobs(u) . Pcalc(u,t)du
    (29)

Both improvements can be incorporated together in the most commonly used T2 translation function:

T2(t) = ó
õ


V 
[Pobs(u) -
å
i 
Pii(u)] . [Pcalc(u,t) -
å
i 
Pii(u)]du
(30)

Computation of translation functions is usually performed in reciprocal space, sometimes using normalised structure factor amplitudes

Ehkl = Uhkl
á |Uhkl| 2 ñ1/2
(31)

where Uhkl is the unitary structure factor

Uhkl = Fhkl
å
j 
fj
(32)

and á |Uhkl |2 ñ is the average value of |Uhkl|2. The normalised partial structure factor for the j'th asymmetric unit is related to the symmetry-equivalent structure factor for the identity by a simple phase shift, so there is no need to completely recalculate the structure factors every time.

Translation function calculations can be improved by

The translation function may be by-passed altogether by a trial and error search, moving the correctly oriented model through the asymmetric unit and calculating the conventional R-factor as a function of the molecular position

R =

å
hkl 
||Fobs| - k|Fcalc||

å
hkl 
|Fobs|
(33)

During these R-factor search calculations, the Fourier transform of the molecule only needs to be calculated once, for translations correspond to a pure phase shift which can be applied directly. A slight improvement would be to calculate standard linear correlation coefficients, which are insensitive to structure factor scaling, rather than R-factors:

SLCC =

å
hkl 
|Fobs|2 - á |Fobs|2 ñ ) . ( |Fcalc|2 - á |Fcalc|2 ñ )

é
ë

å
hkl 
|Fobs|2 - á |Fobs|2 ñ )2 .
å
hkl 
|Fcalc|2 - á |Fcalc|2 ñ )2 ù
û
1/2

(34)

AMoRe uses straightforward T0 translation functions (Navaza and Vernoslova, 1995), with the option to incorporate information concerning already placed models [Navaza, 1994], enabling the positioning of additional molecules in an asymmetric unit. The program outputs a,b,g, a translation vector (x,y,z) and an R-factor and correlation coefficient, the rotation and translation information again in "internal AMoRe format" (vide supra), which can be converted back to standard Euler angles.

  More about AMoRe

A very powerful component of AMoRe is its fast rigid-body refinement procedure, which is very efficient in the evaluation of the correctness of the molecular replacement solution. The ideas on which it is based, were first developed by Huber and Schneider (1985). It is not a least-squares rigid-body refinement of coordinates, but rather a minimisation of the misfit, where rotations and translations are calculated by interpolation and phase shifts.

The sub-program (FITING) is described by Castellano et al. (1992). The minimisation procedure involves derivatives which are calculated from analytical expressions rather than numerical differentiation as used by Huber and Schneider (1985). The radius of convergence is around 10o for the angles and half the resolution of the data for the translations [Castellano et al., 1992].

2.5  Refinement of protein structures

After determination of an initial model, either through molecular replacement or isomorphous replacement, the structure needs to be refined in order to obtain a set of atomic coordinates that corresponds best to the observed data. Various systematic and random errors have an effect on the accuracy of initial models. These errors have their origin for example

Therefore, the initial model is only an approximation, and the structure factors calculated from the model are in poor agreement with the observed structure factors. This agreement is commonly stated in terms of the reliability index R:

R =

å
hkl 
||Fobs| - k |Fcalc||


å
hkl 
|Fobs|
(35)

where k is the scale factor between |Fobs| and |Fcalc|. Wilson (1950) has shown that for an entirely incorrect non-centrosymmetric structure R=0.586. During the course of refinement, progress is followed by calculating the reliability index at various stages. This R-factor may reach artificially low values due to model bias introduced during the refinement procedure. From the statistical method of cross validation a new method for avoiding bias in macromolecular structure refinement has been developed by Brünger (1992). It is based on the idea that a fully refined model should be equally valid if new data is introduced. In the statistical cross validation method it is common to omit a single data point from the data, after which refinement takes place, rather than obtaining new data after refinement. Since the information from one data point is statistically unreliable, many data points have to be omitted in turn and the remaining information has to be fully refined. This full protocol would be impractical in general, because it would involve as many refinements as there are observations. In protein crystallography, however, the data usually exhibit a certain degree of redundancy, which means that a statistically relevant number of data may be omitted from the data prior to refinement. These data can then be used to follow the progress of refinement by means of the 'free R-factor', as introduced by Brünger (1992):

RTfree =

å
hkl Î T 
||Fobs| - k |Fcalc||


å
hklÎ T 
|Fobs|
(36)

RTfree refers to the 'test set T', a set of reflections (commonly 5-10% of the observed data) which are excluded from refinement. Refinement is carried out with the remaining reflections only, called 'working set W'.

Refinement is usually performed by 'least-squares' minimisation, i.e. the sum of the squared differences between calculated and observed structure factors is minimised. In traditional least-squares methods in protein crystallography, the function to be minimised is commonly chosen as

Q = å
hkl 
w(hkl)(|Fobs| - k |Fcalc|)2
(37)

Q is called the crystallographic residual. The summation is taken over all independent reflections hkl, and w(hkl) is the weight applied to each individual observation. w(hkl) can be any weighting scheme, but is usually, at least initially, chosen to be 1. Some computing packages have the option of use of s, the standard deviation of the observation hkl, in the form of [1/(s2(hkl))]. The minimum of Q is reached when the first derivative with respect to each parameter pj is zero:

Q
pj
= 0
(38)

or

å
hkl 
w(hkl)D |Fcalc|
pj
= 0
(39)

where D = |Fobs|-|Fcalc| (assuming absolute scale). Starting from a set of parameters p and necessary corrections x, D can be expanded as a Taylor series to the first order:

D(p + x) = D(p) - nå
i = 1 
xi |Fcalc|
pj
(40)

for n parameters. Combining equations 39 and 40 results in normal equations of the form:

nå
i = 1 
é
ê
ë
å
hkl 
w(hkl) |Fcalc|
pi
|Fcalc|
pj
ù
ú
û
xi = å
hkl 
w(hkl)D |Fcalc|
pj
(41)

The n normal equations may be expressed in matrix form:

Ax = b or å
i 
aijxi = bj
(42)

where

aij = å
hkl 
w(hkl) |Fcalc|
pi
|Fcalc|
pj
(43)

and

bj = å
hkl 
w(hkl)D |Fcalc|
pj
(44)

These equations are solved by matrix inversion for the set of shifts x which are then applied to the estimates p. The process is repeated until convergence is reached, i.e. when the xi become negligible.

In protein crystallography the sheer size of matrix A prohibits calculation of all the elements and full inversion. A method must be chosen to approximate these steps. In the various refinement packages available today, different ways have been chosen for the minimisation:

The success of protein structure refinement depends largely on the degree of overdetermination of the system, i.e. the number of observed data for every parameter to be refined. The ratio of observed data over parameters may be changed favourably either by providing additional data (often the incorporation of stereochemical knowledge in the form of restraints) or by reducing the number of parameters (often by constraining features of the structure to specific values). In the case of restrained refinement, in practice the function minimised is then a composite of diffraction and stereochemical information:

Qcomposite = Q + Qstereo
(45)

In PROLSQ Qstereo contains terms for bonding distances, planar groups, chiral centres, nonbonded contacts and torsion angles. The formalism of equation 45 can also be used for the description of the minimisation function in refinement packages other than PROLSQ. For example, in CORELS Q would be the usual sum of structure factor differences, while Qstereo would be replaced by three terms, namely for all interatomic distances, nonbonded close contacts and a term to prevent the structure from moving away too far from the input coordinates [Sussman, 1985]. In EREF again the Q represents the crystallographic term, and Qstereo takes on the form of a conformational energy term consisting of contributions for bonds, bond angles, torsion angles and nonbonded interactions as described by Levitt (1974). Each term in Qcomposite requires appropriate weighting and scaling.

In general, least-squares refinement as described above, has a limited radius of convergence. It generally does not correct errors in atomic coordinates which are larger than 1Å and is easily trapped in local minima. To avoid this, a refinement technique is required which will allow up-hill steps away from the nearest minimum. Simulated annealing is such a method. Metropolis and co-workers (1953) introduced an algorithm to simulate a collection of atoms at a given temperature. In each step of the simulation an atom is moved at random and the effect on the total energy is computed. If the energy decreases, the move is accepted, if it increases, it is accepted with a probability which is proportional to exp[-DE/kbT] (DE is the energy difference between the two states, kb is the Boltzmann constant and T the temperature). When the energy function is replaced by the cost function of a minimisation problem, and the atomic parameters by the parameter set x, the Metropolis procedure can be used to generate a set of configurations at a given temperature of the minimisation problem. The temperature parameter determines the amount of energy available to the system to overcome energy barriers, i.e. at a higher temperature more steps are accepted which lead to an increase in the cost function. This allows the minimisation to escape from local minima. Simulated annealing is a process in which the temperature is first raised to a relatively high value, after which the system is slowly cooled until it freezes and no more changes occur. The maximum temperature, the rate of cooling, and the number of rearrangements of x determine the 'annealing schedule'. A direct application of the Metropolis algorithm to the refinement of macromolecules is very inefficient: most of the random changes in x would lead to violation of covalent bonds. Instead, molecular dynamics is used to follow the gradients of a target function which includes an empirical energy term to introduce stereochemical and other restraints. Thus the total energy is the sum of an empirical energy describing the restraints and an effective potential energy to include the X-ray information. The empirical energy contains terms for bond lengths, bond angles, dihedral torsion angles, chiral centres, planarity of aromatic rings and Van der Waals and electrostatic interactions:

Ei = Ebonds + Eangles + Edihedrals + Echiral, planar +EvdWaals, elec
(46)

The program package X-PLOR essentially uses the stereochemical and nonbonded parameters as formulated for the molecular dynamics program CHARMM [Brooks et al., 1983]. Conventional refinement is performed using a conjugate-gradient minimisation algorithm. The application of simulated annealing consists of 'heating and cooling' the structure. An initial temperature, typically 3000K, is applied to the system. Two protocols are possible: either the system is extensively equilibrated at this temperature after which cooling takes place, or the system is allowed to cool directly, very slowly. In the first protocol the initial velocities are assigned from a Maxwellian distribution at the appropriate temperature. The temperature is controlled through rescaling of the velocities at set intervals of the simulation. The scale factor is determined from:

Wa =
á
å
i 
mViold(t)2 ñ

nkbT
(47)

where m is the mass and Viold the velocity of atom i at time t, kb is the Boltzmann constant, T the temperature, n the number of degrees of freedom and á ñ the average over the time interval between rescaling of the velocities. It has been shown [Weis and Brünger, 1989] that towards the end of refinement a reduction of the weight by 10-40% is beneficial, in order to prevent deterioration of the model geometry. Cooling of the system is effected through stepwise lowering of the temperature. At each temperature the system is allowed to equilibrate employing the rescaling as expressed in equation 47.

A large number of applications (for an extensive overview see Goodfellow et al., 1989) has shown that the shifts observed are very much larger than for conventional refinement methods. This leads to dramatic improvements for initial models of macromolecular structures. Although the radius of convergence of refinement with simulated annealing/molecular dynamics techniques is much increased, manual intervention in the form of map inspection and model rebuilding is still necessary.

Until a few years ago, manual rebuilding was mainly performed with the graphics program FRODO [Jones, 1978]. Recently, the molecular modelling package Quanta [Molecular Simulations, 1996] has been extended with a large crystallographic module, containing the application X-AUTOFIT [Oldfield, 1996]. X-AUTOFIT is designed to enhance the efficiency of de novo map building as well as general model building in later stages of macromolecular refinement. The model building is carried out with the aid of real space refinement, regularisation and rigid body refinement algorithms, as well as traditional manual editing facilities. Three refinement techniques are implemented: grid searching about torsions, torsion angle real space gradient refinement and Monte Carlo fitting. For the automatic building and refining of ligand and water coordinates Quanta provides the applications X-LIGAND and X-SOLVATE, respectively. X-LIGAND is designed to search for unsatisfied electron density, sort these in order of volume and fit a ligand automatically. It can be used to fit small (in)organic molecules (such as sulfate ions and phenol molecules), polysaccharides and small polypeptides. X-SOLVATE allows the search of an electron density map for water peaks in a specified space around the protein.

Building up the water structure is also very conveniently performed with the program ARP [Lamzin and Wilson, 1992]. The program seeks to improve the fit of atomic coordinates to electron density maps by appropriate removal and addition of atoms. This procedure is alternated with reciprocal space refinement of the complete set of coordinates with, for example, PROLSQ or REFMAC.

2.6  Maximum entropy and maximum likelihood

2.6.1  The concept of entropy

From information theory came the realisation that there is a unique, unambiguous criterion for the amount of uncertainty represented by a probability distribution

q(x) = N
å
i = 1 
xi

(N is the number of states of xi, and åi qi = 1) or òV xdx for a continuous distribution in volume V. This quantity

S(qi ) = - N
å
i = 1 
qiln qi
(48)

is continuous in qi, increases with increasing uncertainty and is additive for independent sources of uncertainty [Shannon and Weaver, 1949]. In the simple case where the system consists of two events x and y, it can be shown that

S(x,y) £ S(x)+S(y)
(49)

Or, in words, the uncertainty of a joint event is less than or equal to the sum of the individual uncertainties (equality only occurring if the events are independent). Also, the probability of event y is influenced by the occurrence of event x, resulting in a conditional entropy for event y, Sx(y). It then follows that

S(x,y) = S(x)+Sx(y)
(50)

Thus the uncertainty of the joint event x,y is the uncertainty of x plus the uncertainty of y when x is known. Combining equations 49 and 50, it can be seen that S(y) ³ Sx(y), or, the uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed.

The quantity S is equivalent to the expression for entropy as found in statistical mechanics [Jaynes, 1957]. The minimum value of the entropy, S = 0, is reached when one of the probabilities qi is 1, i.e. when there is no uncertainty. In that case there is no freedom of choice and no information is gained from the situation. The unconditional maximum value for S, namely lnN, is reached when all parameters xi are equally probable, i.e. for a uniform distribution. Thus S increases when the qi are more 'average'.

2.6.2  Maximum entropy

Whenever an experiment is undertaken, there is some form of prior knowledge, even if it is only intuitive. Jaynes (1979) writes: "merely knowing the physical meaning of [the] parameters [of our experiment] already constitutes highly relevant prior information which our intuition is able to use at once. Can we analyse how our intuition does this, extract the essence, and express it as a formal mathematical principle that might apply in cases where our intuition fails us?". In fact, an experiment in itself constitutes information, which has an effect on the entropy of the system under study.

In most experiments, the unconditional maximum value for S, as described in the previous section, will not be reached, simply because of the requirements the prior knowledge puts on the experiment. If testable information is available, the process of finding the maximum value of S will be put under constraints. Here 'testable information' is defined as: such that it may be determined unambiguously whether the information does or does not agree with the proposed tested distribution pi, the posterior distribution. An example of testable information is: "It is certain that cosa < 0.7". Untestable information (e.g. "The mean value of cosa might be less than 0.7") can at present not be used in mathematical theory. Jaynes, however, believes that new principles for this must exist [Jaynes, 1979].

In an experiment undertaken to test information, linear constraints may be given in the following form:

N
S
i = 1
pi fk(xi) = Fk for k = 1,m
(51)

where Si pi = 1, which is in fact the zero-order constraint. Fk is the expected value of fk in the experiment (see example I). fk(xi) is a set of m functions of the parameters xi. If m < N, the maximum entropy may be calculated. The derivation of the formula for the maximum entropy may be done as follows (for the mathematical principle of 'Lagrange multipliers' see appendix A):

pi = 1
Z(l1...lm)
exp[l1 f1(xi) + ... + lm fm(xi)]
(52)

where

Z(l1...lm) º N
å
i = 1 
exp[l1 f1(xi) + ... + lm fm(xi)]
(53)

is the partition function, which is a normalisation function so that the sum of the pi is 1. lk are the Lagrange multipliers, which are chosen so as to satisfy the constraints 51, namely

Fk =
lk
lnZ for k = 1,m
(54)

This is a set of m equations for m unknowns. The resulting maximum entropy is a function of the prior information:

S(Fk) = lnZ -
å
k
lk Fk
(55)

  Example I

An example of the use of entropy in sciences other than physics was given by Jaynes in 1962, and was reproduced in an extensive article called "Where do we stand on Maximum Entropy?" [Jaynes, 1979]. A seemingly ordinary die has been tested prior to a real experiment, and the information from this test is: the average number of spots up is not 3.5 as in an 'honest' die, but 4.5. What is the probability assigned to i spots (1 £ i £ 6) on the next throw? The data may be interpreted in the form of a constraint:

6
å
i = 1
i pi = 4.5
(56)

The partition function is Z(l) = åi eli = x(1-x)-1(1-x6) with x º el. Equation 56 becomes
[()/( l)] lnZ = [(1-7x6+6x7)/( (1-x)(1-x6))] = 4.5
or 3x7-5x6+9x-7 = 0. The desired root is x=1.44925, giving l=0.37105 and Z=26.66365. The maximum entropy probabilities are pi = Z-1xi, or {p1...p6}={0.05435, 0.07877, 0.11416, 0.16545, 0.23977, 0.34749}. This is the distribution based on the least assumptions, i.e. the distribution which is nearest to being uniform under the constraints presented in the description of the problem. The entropy of this distribution is S=1.61358. The unconditional maximum entropy, i.e. the entropy of an honest die, is ln6=1.79176, representing a system with no constraints and a totally uniform distribution.

2.6.3  Relative entropy

The availability of prior knowledge implies the existence of a prior probability distribution q(x), which can be either discrete or continuous. After performing an experiment, a posterior probability distribution p(x) may be calculated. The relative entropy of the posterior distribution (the distribution of the outcome of the experiment) with respect to the prior distribution (i.e. before the experiment) is defined as:

S(p,q) = - N
å
i = 1
piln(pi/qi)
(57)

where N is the number of possible states of the system. This illustrates that the probability distribution over the system states is modified by the experiment, from prior distribution q(x) to posterior distribution p(x) (see example II).

  Example II

An experimenter is given a set of dice, commonly expected to have qi=1/6. An experiment of throwing the dice and calculating the entropy of each system is undertaken (see figure 2.6).

Statistics for different dice
Figure 2.6: Throwing dice.
Different types of dice and their associated distributions, indicating the correlation between entropy and information. 0ln0 is assumed to be 0. Example reproduced from Kinneging (1986).

From this example it can be seen that the more predictable the outcome is, the lower is the relative entropy of the system under investigation. Again totally predictable is seen as gaining no information (compare section 2.6.1).

  The minimum entropy principle

The main goal of a successful experiment, is obtaining as much undoubtable data (to avoid confusion, the term 'information' has not been used here) as possible, leaving the least uncertainty. Entropy may therefore be used as a guideline in the choice of experiment. An experiment which provides maximum data has minimum entropy relative to any other experiment pertaining to the same question. These undoubtable minimum-entropy data provide a boundary for the entropy maximisation to obtain the best possible distribution based solely on trusted facts and no unprovable assumptions. In cases where the minimum-entropy experiment can not be chosen, the resulting data will not be optimal.

2.6.4  Bayes' theorem and likelihood

The use of prior knowledge and the inference of information from the observations of an experiment may also be dealt with entirely through statistics. To this end, Bayes proposed the following theorem:

P(B|A) = P(B)P(A|B)
(58)

P(B) is the prior probability of event B. P(A|B) is the conditional probability of obtaining experimental information A in an experiment to test B. P(A|B) modifies prior probability P(B) in such a way that posterior probability P(B|A) is obtained, the probability of B when A has been seen. This simple description implies that experimental information A and the connected probabilities constitute a learning experience about B. The right-hand side of equation 58 can be normalised for a set of events Br, resulting in

P(Br|A) = P(Br)P(A|Br)

å
r
P(Br)P(A|Br)
(59)

In the next step of the learning process, the previous information A is seen as prior information H, after which new experimental information A will be obtained through a new experiment. The following form of Bayes' theorem is used at this stage:

P(Br|A,H) = P(Br|H)P(A|Br,H)
å
r
P(Br|H)P(A|Br,H)
= P(Br,A|H)
å
r
P(Br,A|H)
(60)

The terms of this equation may be explained as follows:

Br
is a set of mutually exclusive and exhaustive events (in the statistical sense), i.e. the knowledge tested through an experiment.
P(Br|H)
is a prior probability, which is the probability of obtaining the parameters Br solely on the basis of prior information H.
P(Br|A,H)
is a posterior probability, i.e. the probability of obtaining new parameters Br on the basis of prior information H and experimental information A.
[(P(A|Br,H))/( å rP(Br|H)P(A|Br,H))]
may be called the likelihood L(Br). This is the probability that experimental information A is obtained in an experiment to study parameters Br when prior information H is available.
P(Br,A|H)
is called the joint probability of Br and A conditional on H.

Thus the theorem states that the probability that Br occurs given the occurrence of A and currently available information H, is proportional to the probability of Br on H multiplied by the probability of A on Br and H. Or in other words, it gives the probabilities of Br when A is known to have occurred.

  Example III

The use of Bayes' Theorem may be explained by means of an example taken from Kendall's Advanced Theory of Statistics (1987). A box contains four dice, which are known to be either (a) all white, or (b) two white and two black. A die is drawn at random and found to be white. What is the probability that all the dice are white? Based on (a) and (b) there are two hypotheses B1 and B2. On B1 the probability P(A|Br,H) (see equation 60) of getting a white die is 1, on B2 it is 1/2. From equation 60 it can be seen that

P(B1|A,H) =
P(B1|H)
P(B1|H)+ 1
2
P(B2|H)
(61)
P(B2|A,H) =
1
2
P(B2|H)

P(B1|H)+ 1
2
P(B2|H)

Bayes postulated that it is reasonable to take prior probabilities to be equal when nothing is known to the contrary, so P(B1|H) = P(B2|H) = 1/2. Thus

P(B1|A,H) = 2
3
and P(B2|A,H) = 1
3
(62)

The experiment continues by replacing the die and again drawing one at random. If it is found to be black, hypothesis (a) is rejected outright. If it is white, new posterior probabilities can be calculated in which the probabilities from equation 62 become priors: P(B1|H) = 2/3 and P(B2|H) = 1/3, where H includes A, the occurrences from the previous experiment. The new posterior probabilities based on the new event A' are

P(B1|A',H) = 4
5
and P(B2|A',H) = 1
5
(63)

On repeating the process and again drawing a white die, the new posterior probability of (a) will be still higher, in agreement with common sense: repeated sampling (with replacement) without producing a black die increases the probability that there are no black dice present.

2.6.5  Entropy and likelihood in protein crystallography

In 1982 Collins proposed the use of iterative entropy maximisation for macromolecular crystallography. The ideas involved have their foundations in information theory and image enhancement, and their application has the potential to produce electron density maps of higher interpretability (through higher signal to noise ratio) than expected for Fourier synthesis of the available structure factors [Collins, 1982]. Applying the general statements from sections 2.6.1 to 2.6.3 to crystallography it can be seen that

Although the Phase Problem will not be solved experimentally in the foreseeable future, and is mathematically indeterminate when only amplitudes are known, some combinations of phases are more acceptable than others when solving a crystal structure. This is because the phases will have to yield a chemically valid distribution of atoms.

Bricogne (1984) proposed the use of a phasing method based on entropy maximisation and likelihood ranking as a means of applying probabilistic direct methods to macromolecules. The mere assumption that a crystal structure is made up of atoms, is enough to provide statistical restraints on the structure factors derived from the data, even if the atoms are randomly placed and independent of each other.

There are a number of advantages to maximum entropy methods (hereafter called MEM) over conventional direct methods (CDM). In principle the starting point of CDM and entropy/likelihood methods is the same, but the latter do not require a uniform distribution of the probability distribution of the atoms in the asymmetric unit [Bricogne and Gilmore, 1990]. Also, MEM do not require high overdetermination by the data, unlike CDM [Bricogne, 1991a]. Thirdly, CDM are most accurate for small Fs because of the method of approximation of the probability distribution P(F) (the joint probability distribution of all Fs), whereas through MEM it is possible to use the more informative large Fs (see below). Fourthly, in deriving P(F) from the random atom model, CDM have, until recently, never made use of the complete P(F) but instead considered individual terms for small groups of reflections which were combined (joined) later. However, Peschar and Schenk (1987) have addressed this problem for CDM.

  Bricogne's Multisolution Tree

For a uniform prior distribution of atoms (q([r]) º 1/V) the joint probability distribution P(F) may be calculated to various degrees of approximation. Observed magnitudes |F| may be substituted into P(F) after which only phases remain as free variables, which yields a conditional distribution for the phases under the constraints of the observed |F|s. This conditional distribution should imply that certain combinations of phases are more probable than others. For small |F|s the conditional distributions of the phases will be least informative because they do not significantly narrow down the uncertainty in F. With MEM 'recentering' is possible by setting up local approximations to P(F) away from |F| = 0 by means of 'trial phase' shifts. This is equivalent to using a non-uniform prior distribution of atoms, the maximum entropy distribution qME([r]). From this description of recentering to obtain local approximations of P(F) it may be seen it is not so much the maximum entropy distribution qME([r]) which is the really useful result, but the optimal approximation to P(F). The maximum entropy distribution is almost just a by-product of the calculations of P(F). The resulting joint probability distribution for all Fs is called PSP(F), the saddlepoint approximation to P(F). In practice, the diagonal approximation to PSP(F) is used.

The first set of phases H chosen on the basis of the first conditional distribution may be seen as the origin-fixing phases and are assigned to the 'root node' of a 'solution tree', which will help to keep track of the phase choices made at various stages. At this node with corresponding basis set H the unique distribution qME, having maximum entropy compatible with the data attached to the node, is constructed by maximising the relative entropy

Sm(q) = - ó
õ


V
q(r) ln
q(r)

m(r)
d r
(64)

under the constraints

ó
õ


V 
q(r)exp(2pi h · r) dr = |U[h]|obsexp[if(h)]    for all h Î H
(65)

m([r]) is the normalised prior distribution of q([r]) which comes about because of the use of unitary structure factors U[h]. Bricogne has named m([r]) the 'prior prejudice' [Bricogne, 1984] and the 'probability density' [Bricogne and Gilmore, 1990]. f([h]) is the phase of reflection [h].

Besides reproducing the amplitudes and phases attached to the node for reflections in H, qME has Fourier coefficients U[k]ME with non-negligible amplitude for many non-basis reflections [k]. This means that the conditional distribution of |U[k]| deviates from the normal Gaussian distribution of Wilson statistics, which would correspond to |U[k]ME| = 0, and that the atoms are distributed according to qME rather than uniformly. The two distributions just derived (q0 and qME) under two different hypotheses on U[h] can not be easily compared because they involve phases f([k]) of non-basis reflections. However these phases may be integrated out to obtain the conditional probability distributions not of structure factors U[k] for reflections [k] in non-basis set K but of their moduli |U[k]|. Then the conditional probabilities assigned to the observed values of the moduli under both hypotheses may be compared, the better assumption being that which assigns the highest probability to them. This quantity acts as a figure of merit and is called the 'likelihood' of the hypothesis, defined as 'the probability it assigned to the actual result of an observation before that observation was performed' [Bricogne, 1991a]. The comparison is best carried out using the 'likelihood ratio':

L(U[h]ME)
L(0)
= P(U[k]obs | U[h] = U[h]ME)
P(U[k]obs | U[h] = 0)
(66)

where U[k]obs = |U[k]obs|. The likelihood ratio measures the extent to which the observed values of yet unphased moduli |U[k]obs| are made more probable by the assumption that U[h] = U[h]ME than by the assumption that U[h] = 0. The logarithm of the likelihood ratio is called the 'log-likelihood gain', which is also often used.

Invoking Bayes' theorem the posterior probability distribution of U[h]ME may be calculated by

Ppost(U[h]ME) µ PSP(U[h]MEL(U[h]ME)
(67)

Then the basis-set phases in U[h] can be refined by maximising the posterior probability with respect to the phases.

From the descriptions so far, it can be summarised that the growth of the 'multisolution tree' has entailed:

  1. choice of basis set H of origin-fixing (and/or enantiomorph-defining) reflections and assignment of the root node;
  2. update of the prior distribution of atoms q([r]) to the maximum entropy distribution qME([r]) compatible with the phase choices, giving rise to structure factor values U[h] for that node;
  3. construction of the conditional distribution PdiagSP of the yet unphased structure factors in non-basis set K;
  4. construction of the likelihood function L by integrating the conditional distribution over the unknown phases and substituting the observed values of the moduli in K;
  5. refining the basis set phases in U[h]ME by maximising L with respect to these basis phases.

The subsequent growth of the tree involves 'branching out' and 'pruning':

  1. identification of the local maxima of the conditional distribution with respect to a subset of unknown phases when the values of the corresponding moduli are introduced;
  2. expansion of the current node by creating a branch leading to a new tip node for each of these maxima;
  3. in order to avoid development of spurious branches while maximising the chance of finding the correct set of phases, two criteria are used at each node:

The validity of the phase choices for H and some for K firmly established, the tree may be further developed from each surviving node by iteration from step 2. The entropy and the likelihood are complementary measures of the worth of the trial phase set, in that the entropy 'looks back' at the cost of accommodating the current phase assumptions, and the likelihood 'looks ahead' at the match between its predictions and the actual observations for yet unphased data.

Over the years Bricogne's theories have been updated with the incorporation of 'all sources of phase information normally used in macromolecular crystallography' [Bricogne, 1993] and the possibility to use the system for data other than from single crystal X-ray crystallography (e.g. powder diffraction data [Gilmore et al., 1991]). Bricogne envisages the complete implementation of the 'Bayesian programme', i.e. the use of Bayesian theory in every stage of macromolecular structure determination, from first phasing (ultimately ab initio) to full refinement [Bricogne, 1993]. Three program packages have been developed as a direct result of these theories, namely MICE [Bricogne and Gilmore, 1990], BUSTER [Bricogne, 1991b,Bricogne, 1991c] and SHARP [de la Fortelle and Bricogne, 1996].

  REFMAC

More recently, the program REFMAC [Murshudov et al., 1996b] has been developed, to perform maximum likelihood refinement. The likelihood function (vide infra) is constructed from available parameters at any point in a refinement procedure, whether immediately after structure determination or much later on in refinement. Instead of maximisation of the likelihood function, an equivalent calculation is performed: the logarithm of this function (the likelihood residual) is minimised to a desired degree of convergence through conventional methods. This procedure of constructing the likelihood function and then minimisation may be seen as a macrocycle of likelihood refinement, which may be iterated to convergence itself. The program uses prior knowledge of the stereochemistry of macromolecules, like in least-squares methods, but works with conditional probability distributions rather than the Gaussian distributions employed in least-squares. Partial phase information may be added and there is an option to use the information provided by experimental standard deviations. Different parts of a structure may be treated in different ways if the errors in the separate domains are known or suspected to be different. Bulk solvent scaling according to Tronrud (1996) has been incorporated.

When experimental data (e.g. observed magnitudes |F|) are available and parameters x (e.g. atomic coordinates) are to be estimated, Bayes' theorem may be written as:

P(x;|F|) = q(x) P(|F|;x)
P(|F|)
= q(x)L(x;|F|)
(68)

where ';' denotes conditionality to avoid confusion with the '|' in |F|. P(x;|F|) is the posterior probability of x on the basis of the data |F|; q(x) is the prior distribution of the atoms. In theory, this could be the maximum entropy distribution qME as described in the previous section. However, REFMAC does not deal with maximum entropy calculations. Therefore, q(x) is typically the distribution of a Molecular or Isomorphous Replacement model.

L(x;|F|) = P(|F|;x)
P(|F|)
(69)

is the likelihood function, where P(|F|;x) is the conditional distribution of experimental data when the parameters are known. For x Fcalc may be used, since Fcalc can be calculated directly from x. Thus

P(|F|;x) º P(|Fobs|all reflections;Fcalcall reflections)
(70)

This joint conditional probability function may be built up by generating the conditional probability distribution of each reflection and multiplying these together:

P(|Fobs|all reflections;Fcalcall reflections) = Õ
all reflections 
P(|Fobs|;Fcalc)
(71)

The refinement consists of maximising the likelihood function, which is proportional to the product function of the conditional probability distributions for each reflection (see equation 71). Maximising this function is equivalent to the minimisation of its negative logarithm, giving rise to what is called the '-log-likelihood':

- lnP(x;|Fobs|all reflections) = - lnq(x) - å
all reflections 
lnL(Fcalc;|Fobs|)
(72)

Because P(|F|) (see equation 69) will not change, effectively the function to be maximised is P(|Fobs|;Fcalc).

The principle of maximum likelihood may be formulated as follows: "an organised search for those combinations of phases associated with a 'basis set' of reflections which have maximum likelihood, i.e. which lead to the assignment of the highest conditional probability to the observed moduli belonging to reflections outside the basis set" [Bricogne and Gilmore, 1990], as described in the previous section. This is only one of many ways of formulating the use of Bayesian statistics and the principle of maximum likelihood in the field of protein crystallography.

One of the parameters of the likelihood function used in REFMAC is sA, which was first described by Srinivasan and Ramachandran (1965). sA is a combined measure of the completeness and the accuracy of a partial structure [Read, 1986]. In REFMAC sA is estimated from the 'free' reflections [Brünger, 1992] rather than all reflections as by Read, which can be done satisfactorily using only 200 reflections [Murshudov et al., 1996b]. From the theories by Luzzati (1952) and Read(1986,1990) the parameters m and Dj are derived:

Dj = ácos(2p h·D rj ) ñ
(73)

where á ñ denotes an expected value or probability weighted average. [h] is the reciprocal lattice vector, and [r]j are the atomic coordinates (in Å).

m = ácos(aN - aPcalc ) ñ
(74)

m has different values for acentric and centric reflections, and also depends on whether the partial structure P is known to contain errors or not. N = P+Q is the total number of atoms when Q is the missing structure. Therefore

FN = |FN|exp(i aN) = FPcalc + FQ
(75)

sA is then

sA = Dj Ö

åP / åN
(76)

where

åN = N
å
j = 1
fj2 = á|FN|2/eñ

in which e is a correction factor for the expected intensity in the appropriate reciprocal-lattice zone, denoting the number of identical contributions arising from symmetry (åP analogous to åN). fj is the atomic scattering factor. sA has a value between 0 and 1, where 0 indicates that the partial structure provides no phase information (because it contains no atoms or the partial structure bears no relation to the total structure), and 1 indicates a perfect and complete partial structure. Dj varies with resolution, so sA needs to be estimated for several resolution ranges.

The likelihood function in REFMAC has the following form:

LLK =
å
h 
LLKh
(77)

where

-LLKh = ì
ï
ï
ï
í
ï
ï
ï
î
c - ln(|Fobs|) +ln(2sFobs2+Swc) + |Fobs|2+|Fwc|2
2sFobs2+Swc
-lnI0( 2|Fobs||Fwc|
2sFobs2+Swc
)
(acentric)
  1
2
ln(sFobs2+Swc) + |Fobs|2+|Fwc|2
2(sFobs2+Swc)
-lncosh( |Fobs||Fwc|
(sFobs2+Swc)
)
(centric)
(78)

Here

Fwc = Npart
å
j = 1
DjFjcalc = |Fwc|exp(ifwc)

is the weighted sum of partial calculated structure factors.

åwc = e Npart
å
j = 1
åj(1-Dj2 ) where åj = Natom
å
k = 1
fkj2 for j-th partial structure.

Natom is the number of atoms, and Npart is the number of partial structures. sFobs is the experimental uncertainty in the amplitude |Fobs|. I0 is a zero order modified Bessel function.

Murshudov et al. (1996b) describe a number of special cases, where the likelihood function is reduced to simpler functions because of various assumptions and approximations:

  1. if some atomic parameters are approximately known and it is assumed that the missing atoms are distributed uniformly over the asymmetric unit the total distribution reduces to a distribution which neglects experimental errors for the missing structure (i.e. Dj for the missing structure is 0, and only the partial structure affects Dj ) [Srinivasan and Ramachandran, 1965];
  2. if the partial structure may be assumed correct and the only error is due to missing atoms, in which case Dj = 1 and åwc = åQ [Sim, 1959];
  3. in the case of a Wilson distribution (vide supra) no atomic parameters are known and the contents of the unit cell is distributed uniformly;
  4. when [(2|Fobs||Fwc|)/( åwc)] is small, the -log-likelihood residual is similar to the normalised intensity based Patterson correlation function (which is related to normalised intensity based least squares) [Bricogne, 1992];
  5. when |Fobs| » |Fwc| so that ln[(|Fobs|)/( |Fwc|)] is small, the -LLK residual is similar to an amplitude based least squares residual, the Gaussian based likelihood function. This assumption is reasonable near the end of refinement;
  6. if all reflections are measured with equal sFobs (which is very unrealistic) and all Dj » 1, -LLK will become a unit weighted least squares residual.

From parameters m and Dj Fourier coefficients are generated which are used in map calculations. Because absent reflections cause unpredictable noise (see for example Cowtan (1996)) REFMAC approximates them by setting them equal to their expected value (this is done for all reflections, including the 'free' reflections). Therefore the map coefficients are:

2FWT = ì
í
î
(2m|Fobs|-Dj|Fcalc|)exp(iacalc)
for reflections present in refinement
D|Fcalc|exp(iacalc)
for missing data
(79)
DFWT = ì
í
î
(m|Fobs|-Dj|Fcalc|)exp(iacalc)
for reflections present in refinement
0
for missing data

The maps produced are more informative and contain less bias than conventional 2|Fobs|-|Fcalc| and |Fobs|-|Fcalc| maps. Luzzati (1953) showed that a map with coefficients |Fobs|exp(iacalc) shows missing atoms at half weight if the structure is almost complete, and slightly lower when more of the structure is missing. Therefore the conventional map with coefficients (2|Fobs|-|Fcalc|)exp(iacalc) will bring up missing atoms with almost full weight. It has been determined [Vijayan, 1980] that for different amounts of missing structure a specific value of n in map coefficients [n|Fobs|-(n-1)|Fcalc|]exp(iacalc) is most appropriate. Also, Main (1979) showed that m|Fobs|exp(iacalc) @ 1/2Fobs+1/2Fcalc so that coefficients which reduce model bias are given by (2m|Fobs|-|Fcalc|)exp(iacalc) in the case of a perfect partial structure. Read (1986) extended this to the case of a partial structure with errors, resulting in the map coefficients (2m|Fobs|-Dj|Fcalc|)exp(iacalc) (see above).

2.7  Protein structure validation

The best statistical validation of a protein structure would be if it were to be determined and refined independently by a statistically significant number of experimenters. Not surprisingly, this never happens in protein crystallography because of the time needed to obtain a fully refined protein structure. The correctness and precision of the atomic parameters in the structure will therefore need to be assessed thoroughly, both during and after refinement. It has to be kept in mind that validation through computer programs is only as good as the parameters prespecified as ideal values inside these programs.

Protein refinement is inherently difficult because of the data being weak, not highly overdetermined, not to atomic resolution and prone to data error. Data quality has already significantly improved during the past five years for various reasons [Dodson, Kleywegt and Wilson, 1996]:

A first indication of the reliability of a protein crystal structure is the resolution to which the data has been collected and the structure refined. This can only be a rough measure, but it has been shown that at higher resolution the structure is closer to the correct structure [Hubbard and Blundell, 1987]. A measure of the accuracy and usefulness of intensity data, by measuring the agreements of equivalent reflections, is:

Rmerge =

å
hkl 
N
å
i = 1 
|áIhklñ-Ihkl(i)|


å
hkl 
N
å
i = 1 
Ihkl(i)
(80)

where Ihkl(i) is the i-th measurement of the reflection with indices hkl and áIhklñ is the mean value of the N equivalent reflections. In general, Rmerge values less than 5% indicate excellent data quality, values between 5 and 10% indicate average data quality, values in the 10-15% range are acceptable but may indicate problems with crystals or experiment, while values greater than 15% indicate poor quality data that may not be useful for crystal structure analysis [Ealick, 1995]. The value of using as much data as possible has been demonstrated by various people. As mentioned before, Cowtan (1996) has shown how missing reflections affect the reproduction of an image from diffraction data. The completeness of low resolution data is important in the placement of missing parts of the structure, while refinement may benefit from the inclusion of high resolution data even if the merging R is up to 40% [Dodson, Kleywegt and Wilson, 1996]. Accurate bond and angle parameters for X-ray protein structure refinement have been extracted from the Cambridge Structural Database and compiled by Engh and Huber (1991), providing a reliable measure for ideal values for bond lengths and angles. Ever increasing computing power allows for more advanced software, both for refinement calculations and graphics facilities. The development of program packages like REFMAC and Quanta's X-AUTOFIT, with extensive maximum likelihood calculations and model building applications respectively, has only been possible because of this.

2.7.1  R-factors

A few specific validation techniques have already been described. The conventional R-factor is the most widely accepted indicator of the general quality of a crystal structure. It is not a good independent validator, since it can be manipulated by excluding data or adding parameters inappropriately. A better indication of the fit between observed and calculated structure factors is the 'free R-factor' [Brünger, 1992], which aids bias-free refinement by indicating overfitting. If the free R-factor does not decrease with a decreasing conventional R-factor in a particular stage of refinement, the new parameters introduced in that stage do not give a significantly improved model. Commonly between 5 and 10 % of diffraction data is excluded from refinement, with a set of between 500 and 1000 reflections considered to produce reliable statistics. Non-crystallographic symmetry will affect the discrepancy between conventional and free R-factors because of phase relationships between reflections in the working and free sets. The expected value of the free R-factor may be estimated as [Cruickshank, 1996]:

EFRF = R Ö
Nobs / ( Nobs-Npar )
(81)

where Nobs is the number of observations, Npar is the number of parameters and R is the conventional R-factor.

R-factors as described above are global indicators, which can not point out local errors. Therefore the real space R-factor has been devised [Jones et al., 1991], which can be plotted as a function of the residues along the polypeptide chain:

Rreal space =
å
|robs - rcalc|

å
|robs + rcalc|
(82)

robs is taken from a real electron density map calculated from the model on a certain grid G. rcalc is taken from an 'artificial' electron density map calculated on the same grid G from a set of idealised Gaussian distribution atoms at all atomic sites in the model, all with average temperature factors. The maps are scaled together with one overall scale factor. A mask is determined for the desired subset of atoms (i.e. a residue, or a side chain) on the same grid G, after which the values for robs and rcalc for this subset may be read from the two map grids, and the R-factor calculated.

2.7.2  Estimates of coordinate precision

  Luzzati plot and Cruickshank's sd

Luzzati (1952) developed the idea of plots of R' = [(á||F|-|F+DF||ñ)/(á|F|ñ)] (for centrosymmetric structures) and R" = [(á||F|-|F+DF||ñ)/(á|F|ñ)] (for non-centrosymmetric structures) vs. sinq, where DF is the error in structure factor F due to coordinate errors D[r]j. He argued that the crystallographic R-factor is dependent on resolution (and even on the way the structure factors are calculated), while curves of R' and R" are independent of those factors. They do, however, depend on the assumption that coordinate errors D[r]j are the sole cause of the difference between Fobs and Fcalc. R' and R" are functions of Dj = ácos(2p[hD[r]j ñ (see equation 73) only. If D[r]j = 0 then Dj = 1 and R'=R"=0, i.e. a completely perfect structure, which would also have the crystallographic R=0. For a completely wrong structure, where the distribution of D[r]j is constant, Dj tends to 0 and R" (for acentric structures like all natural proteins) approaches 2-Ö2. The theoretical curves for R' and R" may indicate an upper limit to the value for D[r]j.

At the Validation Workshop [Dodson, Kleywegt and Wilson, 1996] Cruickshank showed that the Luzzati (1952) plot per se is not an appropriate indication of coordinate error. It is a means for estimating how far the refinement still has to go to reach R = 0, assuming that |Fobs| has no errors, Fcalc is perfect apart from coordinate errors and the Gaussian probability distribution for the coordinate errors is the same for all atoms. The atoms are not required to be identical, and the errors are not required to be small. However, the assumption of perfection apart from coordinate errors is not applicable. The Luzzati plot may only indicate an upper limit for the coordinate error at the end of refinement. Cruickshank has therefore devised a formula for estimating the expected positional error, caused by diffraction error only, for an atom x with average temperature factor [Cruickshank, 1996]:

sd(x) = 0.7 [N/p]1/2C-1/3dminR
(83)

where R is the conventional R-factor, N is the number of non-hydrogen atoms in the asymmetric unit, p is the difference between the number of observations and the number of parameters, C is the fractional completeness of the data, and dmin is the resolution. The smaller dmin and R, the better the precision of the structure. There is a definite direct link between coordinate precision and temperature factor [Dodson, Kleywegt and Wilson, 1996]. With ever higher resolution data it is already becoming possible to use least squares matrix inversion at the end of refinement to obtain estimated standard deviations for the parameters describing the well ordered parts of the cell. This will allow testing of Cruickshank's sd.

  Luzzati, Read and REFMAC: sA

Another estimate of coordinate error is sA, which has been discussed in section 2.6. Luzzati showed that

Dj
=
á cos(2p ®
h
 
·D ®
r
 

j 
)ñ
=
exp[-p3(á| D ®
r
 

j
|ñ)2(sinq/l)2]
(84)

where á ñ denotes average value. Thus, for sA = Dj(åP/åN)1/2 as discussed before,

lnsA = 1/2 ln(åP/åN) - p3(á|D[r]j|ñ)2([(sinq)/( l)])2. If sA can be estimated, a plot of lnsA vs. (sinq/l)2 will yield |D[r]j|2 from the slope and (åP/åN) from the intercept. However, like for the Luzzati plot, the sA theory assumes no errors except coordinate errors, so the sA-plot needs to be interpreted with care. Brünger has shown that the cross-validated, i.e. free R-factor based, Luzzati and sA plots correlate better with the actual coordinate errors than those based on conventional methods [Brünger, 1996]. The sA indicator is used in the calculation of several parameters in REFMAC and thus used explicitly in the improvement of refinement. Maximum likelihood itself is a validation technique in that it uses proper weighting of prior and experimental information by separating errors due to a poor model from experimental errors.

2.7.3  Stereochemistry and residue environment checks

  Ramachandran plot

After experiment duplication, the best validation is by criteria that can not and/or have not been used during the course of refinement. One of the validation tools that is very hard to use during automatic refinement, is the Ramachandran plot [Ramachandran et al., 1963]. It is therefore a good check of the structure, and residues in odd positions in the plot need further investigation. This does mean that most experimenters now use the Ramachandran plot in the model building stages of structure refinement, so that it is not a completely independent check of the stereochemistry of the final structure. In X-AUTOFIT [Oldfield, 1996] it is now possible to apply weak restraints so that residues will assume the Ramachandran conformation of either a helix, a b-strand, or the nearest allowed region. The default, however, is not to apply any restraints automatically.

  SQUID

SQUID displays and analyses molecular coordinates from crystallography, NMR and molecular dynamics [Oldfield, 1992]. The structure analysis consists of various atomic checks (e.g. distance, chirality, neighbours, non-bond clashes etc.), analysis of local structure by packing angles, Ramachandran plot and other properties, an overall check of weight, volume, secondary structure, hydrogen bonds and temperature factors, and the possibility of comparison of all those properties with those in other structures. The program outputs detailed warnings about errors and anomalies in the structure under investigation. It provides better independent checks of the structure through root mean square deviations of c and w angles, which are not easily tampered with by the experimenter. The mean and spread of the distributions of these angles have been determined from a set of reliable structures.

  PROCHECK

PROCHECK is a program to check the stereochemical quality of protein structures [Laskowski et al., 1993]. It outputs various plots and a comprehensive residue-by-residue listing, providing an assessment of the overall quality of the structure as compared with well refined structures of the same resolution and highlighting regions which may need further investigation. It may therefore be used to follow the process of refinement and assess the final structure, as well as in the assessment of existing structures to be used for modeling or molecular replacement study purposes. Multiple X-ray crystal structures as well as ensembles derived by NMR spectroscopy can be checked simultaneously.

  WHAT-IF and other geometry checking programs

WHAT-IF is a computer program originally written to aid macromolecular modeling and drug design [Vriend, 1990]. It provides for displaying, manipulating and analysing small molecules, proteins, nucleic acids, and their interactions, and incorporates access to protein structure databases for reference. WHAT-IF's protein validation portion, called WHAT-CHECK, checks for clashes between symmetry-related molecules, and includes the determination of a 'quality factor' assessing the distribution of different atom types in the environment around side chain fragments. The expected distributions are compiled from a data set of high resolutions structures [Vriend and Sander, 1993].

Several smaller program packages have been specifically written for protein structure validation. The program PAP [Callahan et al., 1990] provides several plots of main chain geometry, and information on temperature factors. GEOM [Cohen, 1993] provides an analysis of main chain distances, angles and torsion angles, and of side chain torsion angles. ERRAT [Colovos and Yeates, 1993] analyses the relative frequencies of non-bonded interactions between any C, N and O atoms. Because this results in the consideration of only six different pairs, the statistics are better but the information is less specific than in the case of analysing all different atoms specifically, as in e.g. WHAT-IF. The program SurVol [Wodak et al., 1995] calculates for each atom in the structure the atomic volume and compares it with a pre-computed average for each atom type. A scoring method was devised in which a high score indicates uncertainty in the structure.

  Folding profile methods

Determining the potential fold of a protein sequence is most often approached by searching databases for sequences of known structure that are similar to the sequence under study. This only works if the sequence identity is reasonably high. The advances of the 'tertiary template method' [Ponder and Richards, 1987] were limited because of the rigidity of the methods employed. Eisenberg and co-workers have devised a method which allows more flexibility, both of the backbone of the protein and of the spacing between particular residues [Bowie et al., 1991]. Three features determine a residue's environment:

  1. the total area of the side chain buried by other protein atoms;
  2. the fraction of the side chain area covered by polar atoms or water;
  3. the local secondary structure, i.e. a-helix, b-sheet and 'other'.

Each residue is then assigned one of 18 environment classes, converting the three-dimensional structure to a one-dimensional string.

It is well established that each of the 20 natural amino acid types has a clear preference for a certain environment. On the basis of well-refined three-dimensional structures a score is assigned to the occurrence of every amino acid in each of the 18 environment classes. These scores can then be used to find the best alignment of amino acid sequences to the one-dimensional string of the protein under study, producing a compatibility score of the sequence with its three-dimensional structure. Local scores may be calculated as an average over 21 residues with the residue-to-be-scored in the centre of the 21 residue window.

Instead of stereochemical parameters, Sippl (1993) uses knowledge of the forces which stabilise proteins in solution in the analysis of the energy distribution of a protein profile. These energy forces are obtained from well-refined structures in the form of potentials as a function of the spatial separation of two atoms. Ca-Ca pairs and Cb-Cb pairs seem to work equally well, indicating that the method can be applied if only a Ca-trace is available. The atom pair interaction energy is a function of the sequence, and the structure with the correct conformation will have a lower energy than any alternative conformation. Since no stereochemical parameters are used in the method, the occurrence of high energies in a particular structure is not a consequence of violations of basic steric requirements.

2.7.4  Brookhaven Protein Data Bank

Over the last twenty years, the Brookhaven Protein Data Bank [Bernstein et al., 1977] has grown to be a collection of protein and nucleic acid structural data from various sources. Most scientific journals require deposition of atomic coordinates (and often also structure factor data) at the Data Bank upon publication of articles about the structure of proteins, nucleic acids and complexes. Although coordinate checking software is available at the Data Bank and rigorous checks are performed, it is the ultimate responsibility of the experimenters to provide a reliable set of data pertaining to the structure described in an article. This data may be used for molecular replacement studies, molecular modelling studies etc., and it should be clear from the combination of the article and the Data Bank information, what the limitations are of the information provided.


File translated from TEX by TTH, version 1.54.
and edited manually afterwards.

Back to the Table of Contents