High-resolution protein-ligand complexes that can be assessed by reconstructing the electron density for the ligand using the deposited structure factors were identified. The complexes have been clustered according to the protein sequences, and clusters have been discarded if they do not represent proteins thought to be of direct interest to the pharmaceutical or agrochemical industry. Rules have been used to exclude complexes containing non-drug-like ligands. One complex from each cluster has been selected where a structure of sufficient quality was available. The final Astex diverse set contains 85 diverse, relevant protein-ligand complexes, which have been prepared in a format suitable for docking and are freely available to the entire research community (http://www.ccdc.cam.ac.uk)1.
The Astex Diverse Test Set is also available on the website in a tar.gz form (compressed archive). The performance of the docking program rDock against this set can be assessed using the "dock" docking protocol -this is the standard protocol- and 10 runs. To uncompress it simply type:
$tar xvfz astex_test_set.tar.gz
A directory named "astex_test_set" will be created which contains a subdirectory called astex_diverse_set and a bash shell script that can be used for validating rDock (you have to edit it first). The astex_diverse_set includes 85 different folders, and each of them caries information for one protein-ligand complex with solved crystal structure. These folders also include the output files generated by running rDock using the standard 'dock' protocol and 10 docking runs. More specifically each folder contains the following files:
Is the crystallographic coordinates - used to establish the docking site but also for the RMSD comparison
Is the ligand to be docked
The input protein in Tripos mol2 format with appropriate atom typing and charges applied
The output file generated by rDock using dock protocol and 10 runs.
The parameter file. Refers to the protein file (.mol2) and a reference ligand (_c.sd)
The receptor cavity file
The output file generated by rbrms command. It has the RMS Difference between each conformation predicted by rbdock and the crystallographic reference conformation contained in the _c.sd file, and the respective scores assigned to that conformation.
The same as above but with the conformations sorted according to the overall score (the lowest score is the best). The overall score corresponds to the validity of the prediction.
The bash shell script included in the tar.gz file is called "validation" and when executed counts the complexes that give an RMSD < 2 A (only the RMSD value with the best score -the lowest- is considered). First open "validation" and replace [directory] with the full path of where astex_diverse_set directory is, without the trailing "/". Then run the script and you will get an output like this (the following is just a part of it):
Now I am testing the ligand-protein complex No. 31, out of 85 :
Now I am testing the ligand-protein complex No. 32, out of 85 :
Now I am testing the ligand-protein complex No. 33, out of 85 :
Now I am testing the ligand-protein complex No. 34, out of 85 :
Bad Prediction: /home/thomas/Documents/Group_Project/astex_diverse_set/1sg0
Now I am testing the ligand-protein complex No. 35, out of 85 :