AI Docking: DiffDock – CDD Support

To enable AI in your vault, please contact support@collaborativedrug.com. Once enabled, please note that the number of allowed Folding and Docking jobs are set on a per-account basis. Please contact support@collaborativedrug.com if you have questions regarding the number of allowed jobs for your account.

Implementing a docking protocol in CDD Vault:

Creating a docking protocol in CDD Vault is as simple as creating a traditional protocol. For a refresher on creating a protocol, please refer to this article before proceeding.

In the following example, we will work through creating a docking protocol against the serotonin receptor 5HT2B. The requirements for this type of protocol will be a docking and docking trigger readout definition. The docking readout definition will of course contain the computational model whereas the docking trigger is a safeguard against accidental docking predictions for a large set of ligands requiring significant computational power.

Users will first create the docking trigger readout definition. In this example, we have chosen a “pick list” data type with an allowed value of “Yes”, however, this could be a text or number field as well. Please note that whenever the value of this field changes, the docking protocol will trigger. For example, if a docking trigger is a pick list with values of “Yes” or “No”, the docking protocol would execute regardless of which value is picked or if it is changed from “Yes” to “No”. If users intend to retrigger this protocol after the initial run, it is recommended to have a pick list with two values or a free text/ numeric field.

Next, we will create the docking readout definition. Users will choose “Docking” for the data type as well as define the name of the readout definition, the specific model (DiffDock in this example), and specify the previously created docking trigger. Next, a PDB file will be uploaded into the “Upload PDB” section. Users may upload an experimentally resolved crystal structure or use one of the predicted protein structures from their folding protocols. It is best practice to remove other ligands, cofactors, and water molecules from the input PDB file unless they are directly involved in forming the binding pocket or are essential for the biological activity of the protein. Consider using a solution such as PyMol to clean and prepare PDB files automatically or use a text editor to manually remove lines containing (Keep only ATOM lines if you want a purely protein structure):

HETATM : Usually for non-standard residues (ligands, metal ions, water).
CONECT: Connectivity info, often used for ligands or non-standard residues.

HOH or WAT: Lines containing water molecules
Ligands/cofactors: Lines with non-standard residue names like ATP, HEM, NAD, ZN, etc.

We have now created the minimum requirements to run docking modelling within the safety and convenience of CDD Vault.

In order to run the docking model, users may choose to run a single compound or specify a set of compounds to run in bulk. To run one docking simulation for one compound, create a run within the docking protocol and then select “add a readout” from the All Data tab of that run.

Next, fill out the relevant information including the molecule name, batch number, and the docking trigger prior to clicking “add this readout”.

The job has now been submitted and the job status will periodically update to alert users as to what stage the model currently is including when the job started, its current status, and when it is completed. Docking simulations, depending on the size of the protein structure and the ligand, typically take about 10 minutes per molecule to complete.

While it is important to consider the compute resources to run large data sets in docking models, users may also start docking jobs through the bulk importer found in the “Import Data” tab of CDD Vault. The typical data format to trigger and run docking simulations is quite simple and is exemplified below:

Viewing and interpreting docking data in CDD Vault:

Docking results may be viewed and searched across on the explore data tab just like any other data type stored in Vault.

When a docking protocol has been executed using DiffDock, the following parameters will be output- most notably including the Docking PDB File and the DiffDock Confidence Score:

Docking PDB file - protein + ligand predicted binding site and pose
Docking Score (confidence score- please see below for further details)
Docking Trigger
Docking Job ID
Docking Job Status
Docking Job Errors
Docking Started At
Docking Finished At
Docking Updated At

How to Interpret the DiffDock Confidence Score:

It can be hard to interpret and compare confidence score of different complexes or different protein conformations, however, below is a rough guideline that is typically used (c is the confidence score of the top pose):

c > 0 high confidence
-1.5 < c < 0 moderate confidence
c < -1.5 low confidence

This is assuming the complex is similar to what DiffDock saw in the training set i.e. a not too large drug-like molecule bound to medium size protein (1 or 2 chains) in a conformation that is similar to the bound one (e.g. if it comes from an homologue crystal structure). If you are dealing with a large ligand, a large protein complex and/or an apo/unbound protein conformation you should shift these intervals down.

CDD Vault’s PDB Viewer:

CDD Vault has a native PDB viewer that can be accessed for molecular docking, protein folding, or viewing experimentally resolved crystal structures. The right-hand menu allows users to quickly toggle on the active-site specific view as well as turn on protein residue labels to quickly view where key interactions may be taking place.

Additionally, five options for viewing the protein surface are available including:

Solvent accessible
Solvent excluded
Van der Waals
On sided surface

Users will also have the option to toggle on the van der Waals surface for the ligand.

Please note, whichever view selections you make in the PDB viewer will be retained when you exit back to the search results page as shown below:

Output:

DiffDock does not directly predict the binding affinity of the ligand to the protein. It predicts the 3D structure of the complex and it outputs a confidence score. This latter is a measure of the quality of the prediction, i.e. the model's confidence in its prediction of the binding structure though a correlation with binding affinity (if a ligand does not bind there will be no good pose but it is not a direct measure of affinity) has been observed.

How Does DiffDock Work?

Diffdock is a deep learning model that predicts how small-molecule ligands dock to a protein target using a diffusion-generative approach. It samples likely ligand binding poses across the protein structure. DiffDock is a generative diffusion model in molecular blind docking. DiffDock consists of two models:

The Score Model: Generates a series of potential poses for protein-ligand binding by running a reverse diffusion process. DiffDock does not require any information about a binding pocket. During its diffusion process, the molecule's position relative to the protein (the protein is static), its orientation, and the torsion angles are allowed to change. Running the learned reverse diffusion process transforms a distribution of noisy prior molecule poses to the one learned by the model.
The Confidence Model: As a result, DiffDock outputs many sampled poses and ranks them via its confidence model. This model estimates the quality of each predicted ligand pose or how close it is to the true binding pose (a low root-mean-square deviation).It uses a trained neural network to evaluate each generated pose based on geometric and chemical features. The top-ranked ligand pose and the associated confidence are then taken as DIFFDOCK’s top-1 prediction and confidence score.

CDD Vault reports only the confidence score for the top predicted pose for protein-ligand binding.

Can DiffDock be used for modelling protein-protein or protein-nucleic acid interactions?

While the program might not throw an error when fed with a large biomolecules as input, the model has only been designed, trained and tested for small molecule docking to proteins. Therefore, DiffDock is only likely to be able to deal with small peptides and nucleic acids as ligands, we do not recommend using DiffDock for the interactions of larger biomolecules.

CDD Vault uses the following parameters to run each docking job:

batch_size: int = 32
no_final_step_noise: bool = False
inference_steps: int = 20
actual_steps: int = 20

For further information regarding the theory and methodology behind the development of the Diffdock model, we welcome you to view the original and official publication listed below.

Gabriele Corso, Hannes StÃ¤rk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. Retrieved from https://par.nsf.gov/biblio/10404353. International Conference on Learning Representations (ICLR 2023).