BUCCANEER (CCP4: Supported Program)


buccaneer - Statistical protein chain tracing


cbuccaneer -mtzin-ref filename -pdbin-ref filename -mtzin-wrk filename -pdbin-wrk filename -seqin-wrk filename -pdbout-wrk colpath -colin-ref-fo colpath -colin-ref-hl colpath -colin-wrk-fo colpath -colin-wrk-hl colpath -resolution resolution -find -grow -join -link -sequence -correct -filter -prune -build -cycles number of cycles -fragments number of fragments -fragments-per-100-residues number of fragments -ramachandran-filter type -main-chain-likelihood-radius radius/A -side-chain-likelihood-radius radius/A -sequence-reliability reliability -new-residue-name type -new-residue-type type -known-structure known-structure-spec -verbose verbosity -stdin
[Keyworded input]


'buccaneer' performs statistical chain tracing by identifying connected alpha-carbon positions using a likelihood-based density target.

The target distributions are generated by a simulation calculation using a known 'reference' structure for which calculated phases are available. The success of the method is dependent on the features of the reference structure matching those of the unsolved, 'work' structure. For almost all cases, a single reference structure can be used, with modifications automatically applied to the reference structure to match its features to the work structure.

Buccaneer does not do refinement or rebuilding. Until these are implemented it will not be directly comparable to existing automated model building packages. However, it is quite quick, and can work at reasonably low resolutions given good phases.


A set of reference structure will have been provided with the program. The structure 1TQW is good for typical protein problems at resolutions up to 1.25A, although in practice including data much beyond 2.0A doesn't make much difference. For exotic cases you might want to provide your own reference structures.

The calculation involves 6 main stages:

Finding C-alphas
Candidate C-alpha positions are located by searching the electron density. This stage can be disabled by providing an input PDB for the work structure.
Growing fragments
The candidate C-alphas or input chains are grown by adding residues at either end, according to the density. This stage can be disabled by the -no-grow keyword.
Joining Fragments
Overlapping fragments are joined to make longer chains. If this leads to a junction in a chain, the contested residue is removed. This stage can be disabled by the -no-join keyword.
Pruning Fragments
Clashing fragments are examined and the one with the worse density is removed. This stage can be disabled by the -no-prune keyword.
Assigning Sequence
Likelihood comparison between the density of each residue in the work structure and the residues of the reference structure allows sequence to be assigned to longer fragments. An optional correction step allows one-residue shifts to be fixed automatically.
Rebuilding allows side chain atoms and carbonyl oxygens to be rebuilt.


Input PDB file containing the final model for the reference structure.
Input 'reference' MTZ file. This contains the data for a known, reference structure. The required columns are F, sigF, and a set of Hendrickson-Lattman (HL) coefficients describing the calculated phases from the final model. Suitable reference structures can be constructed from the PDB using the 'Make Pirate reference' task.
Input 'work' MTZ file. This contains the data for the unknown, work structure. The required columns are F, sigF, and a set of HL coefficients from a phasing program (experimental or molecular replacement).
[Optional] Input PDB file containing an initial model.
Output PDB file. This will contain the new chain trace.


See Note on keyword input.

-colin-ref-fo colpath

Observed F and sigma for reference structure. See Note on column paths.

-colin-ref-hl colpath

Hendrickson-Lattman coefficients for reference structure. If you do not have these, they can be generated using the accompanying chltofom program. See Note on column paths.

-colin-wrk-fo colpath

Observed F and sigma for work structure. See Note on column paths.

-colin-wrk-hl colpath

Hendrickson-Lattman coefficients for work structure. See Note on column paths.

-resolution resolution/A

[Optional] Resolution limit for the calculation. All data is truncated.

[Optional] Approximate umber of fragments to build per 100 residues (assuming average solvent).


[Optional] Enable growing of fragments.


[Optional] Enable growing of fragments.


[Optional] Enable joining of fragments.


[Optional] Enable linking of nearby fragments.


[Optional] Enable sequencing of fragments.


[Optional] Enable correction of any missing or extra residues uncovered during the sequencing process.


[Optional] Enable removal of residues in low density or linking disjoint sequence.


[Optional] Enable pruning of fragments.


[Optional] Enable rebuilding of side-chains and Carbonyl Oxygens.

-cycles number of cycles

[Optional] Number of cycles of building to run. Running multiple cycles leads to a more complete model, although it is not as effective as recycling with refmac.

-fragments number of fragments

[Optional] Maximum number of fragments to build.

-fragments-per-100-residues number of fragments

-ramachandran-filter type

[Optional] Only use particular types of residues when preparing the main chain likelihood search function. By selecting particular secondary structure types, it is possible to preferentially find different types of sequence. type may be one of all, helix, strand, nonhelix.

-main-chain-likelihood-radius radius/A

[Optional] Default 4.0A. For very low resolution maps it may be worth increasing this.

-side-chain-likelihood-radius radius/A

[Optional] Default 5.5A.

-sequence-reliability reliability

[Optional] Values between 0.5 and 1.0 vary the relibility cutoff for docking a sequence. The value is the probability at which the sequence will be accepted. 0.5 means every sequence will be docked, 1.0 means that no sequences are docked. Default = 0.95.

-new-residue-name type

[Optional] Set the name which will be given to newly built residues.

-new-residue-type type

[Optional] Set the type of residue to be used when building new residues.

-known-structure known-structure-spec

A single known-structure group can be specified in the general parameters (above), however for more complex cases multiple groups can be defined using keyword input. The known-structure keyword allows atoms or chains from the input model (given using the 'Specify input model to be extended' button at the top of the window) to be preserved. This can be useful when heavy atoms or nucleotide chains comprise a significant portion of the scattering.
Syntax: known-structure coordinateID:radius
Atoms specified by the coordinateID will be retained in the output structure. If a radius is specified, then no main chain atoms will be built within the given radius of the specified atoms. Multiple known-structure keywords may be given with different radii. Examples:
  • known-structure /A/*/*/:2.0
Keep all atoms in the A chain and don't build within 2A.
  • known-structure /*/*/ZN  /:3.0
Keep all Zinc atoms and don't build within 3A.
  • known-structure //*/*/
Keep all atoms in the unlabelled chain.

-verbose verbosity

Note on column paths:

When using the command line, MTZ columns are described as groups using a slash separated format including the crystal and dataset name. If your data was generated by another column-group using program, you can just specify the name of the group, for example '/native/peak/Fobs'. You can wildcard the crystal and dataset if the file does not contain any duplicate labels, e.g. '/*/*/Fobs'. You can also access individual non-grouped columns from existing files by giving a comma-separated list of names in square brackets, e.g. '/*/*/[FP,SIGFP]'.

Note on keyword input:

Keywords may appear on the command line, or by specifying the '-stdin' flag, on standard input. In the latter case, one keyword is given per line and the '-' is optional, and the rest of the line is the argument of that keyword if required, so quoting is not used in this case.

Reading the Output:

The program outputs a short list of statistics each cycle. The Free-E correlation is probably the most useful (larger is better). After the first cycle these may be biased in various ways. They are fairly useful for selecting a reference structure from a list of candidates or for selecting a radius. They can be used to control the likelihood weighting, but see the notes under the keyword for the appropriate protocol.



Kevin Cowtan, York.