Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

CDD Vault Deep Learning

The Research Informatics Group at CDD has developed a computational tool to enhance drug discovery alongside CDD Vault®. This tool is an innovative iterative deep learning model designed to assist medicinal chemists. Please contact your Account Manager at CDD if you would like to try this new feature.

Method

Chemically Rich Vectors (CRVs)

Deep Learning Network

ChEMBL, SureChEMBL and Enamine

Using the Deep Learning Tool in CDD Vault

Suggest New Molecules with Bioisosteric Replacements

 

 Method

This novel approach to similarity search takes a chemical structure and converts into a numerical representation and couple these numerical representation with a generative model that can generate smiles strings.

 

To implement this, the research informatics group have coupled a graph-based encoder to a complementary decoder, creating an autoencoder.

The  autoencoder is able to map molecules into chemically rich vectors (the so-called CRVs) and then this autoencoder was trained to ensure that the output is identical to the input.

 The resulting neural network was then trained to generate an output that is chemically similar to the input to the encoder.  The resulting pre-trained encoder can then be used to generate CRVs which can be used in predictive models.  

 

Chemically Rich Vectors (CRVs)

Graph convolutional networks, which is a form of deep learning architecture, lead to a description of a structure as a vector of real numbers, the so-called latent vector.

 It has been demonstrated, that the latent vector representation can be used to build QSAR models with state of the art performance.

Therefore these CRVs serve as concise numerical summaries of chemical properties, surpassing conventional molecular descriptors in several ways:

  • They allow for the reversible reconstruction of original molecules, enabling the encoding of comprehensive structural-activity information in a multi-dimensional numerical space.
  • They eliminate the need for expertise in computational chemistry to select molecular descriptors.
  • They establish a foundation for performing inverse QSAR.
  • The models can operate as unsupervised background processes, facilitating automation in drug discovery.

This innovation is a hybrid training system that combines CRVs and labeled graphs. By training the encoder with labeled graphs and the decoder to generate SMILES strings and structural fingerprints,  the graph invariance problem is addressed while retaining the ability to recreate the original molecule.

 

Deep Learning Network

The most relevant aspects of the network architecture for this study are:

  •  The network was trained using structures from ChEMBL version 28.
  • The network is a combination of a graph convolutional network to represent a molecule as a chemically rich vector of length 384 and a generative model to derive structures matching a given CRV.
  • The network was trained using a combination of objectives, to ensure that the CRV represents structural information, covers the latent space well, and can be used to reconstruct the chemical structure represented by a CRV.
  • The CRV structure representation combined with a suitable distance metric, can be used for similarity searches that are complementary to established approaches.

 

ChEMBL, SureChEMBL and Enamine

ChEMBL is a manually curated database of bioactive molecules which includes ~ 2.5m synthesizable compounds with drug like properties. Most of these have been tested against one or multiple targets to provide a rich, accessible chemical space to explore.

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents.

The Enamine collection is a commercially available library.

These datasets exist as private instances behind the CDD Firewall so your independent structure is not exposed to the internet. This new vector-based methodology will help you uncover structurally related compounds quickly. These structures can then be exported for further analysis and purchasing.

 

Using the Deep Learning Tool in CDD Vault

Once a hit compound has been identified from your research, conducting an extended similarity search in the CHEMBL database can be beneficial for several reasons. To uncover structurally similar compounds or analogs to test for improved potency, selectivity, or pharmacokinetic properties, for structure-activity relationship  (SAR) exploration by accessing by accessing the SAR and comparing the biological activities of similar compounds, aiding in the rational design of more effective drugs.

Conducting a thorough similarity search around a hit compound is a crucial step in drug discovery, contributing to the strategic development and eventual success of new therapeutic agents.

 

 From the Molecule page of the hit compound in CDD Vault click in Find ChEMBL compounds using deep learning similarity on the left-hand side of the page.

 

deep_learning.png

 

The search is done against the ~2.5 million structures in ChEMBL and it gives the 100 most similar structures to the query. The information about these structures can be downloaded into csv.

 

deep_learning2.png

The excel file contains three columns: name, smiles and scaffold. These information can then be used either to order compounds from chemical vendors or to further investigate the activities of these compounds in ChEMBL.

 

Th

 


Suggest New Molecules with Bioisosteric Replacements:

 

Method

Using pre-defined retrosynthetic rules (BRICS, Degen et al. 2008), the reference structure is broken down into fragments. A numerical representation of these fragments is then derived using the CDD Deep Learning model. By comparing the numerical representations with a large libraries of suitable replacement fragments found in ChEMBL, the model suggests bioisosteric replacements. The suggestions are sorted by Tanimoto similarity to the original fragment; the similarity uses the Morgan-2 fingerprint folded to 2048 bits.


Each fragment is highlighted in the original reference structure and the suggested replacements shown in the table next to it. Click on a suggested structure to select it. Build up a list of favorites to be included in the downloaded file.


Settings

  • Open the settings dialog by clicking this icon settings icon.png
  • Maximum number of hits per fragment
  • Control display of additional information (e.g. first patent for structure)
  • Select the properties to be shown for each suggestion
  • The following color coding is used: properties will be red if they are higher than the target, blue if lower or black if the same


Download and Visualization

  • Click on the download icon download icon.png to export the suggestions and their properties in a CSV file
  • The file will contain the original reference structure, the selected suggestions and their properties
  • You can use this file to further analyze the suggestions in CDD Visualization or import into an ideas Vault
  • You can directly open the information in visualization using the viz icon.png icon


The exported data contain a synthetic accessibility score. This score measures how unusual the structure is. To do this, we look at the circular substructures up to a radius of two bonds from each atom. For each circular substructure, we identify how often it occurred in structures found in ChEMBL release 20. The synthetic accessibility score is the number of substructures that are not found in ChEMBL. A value of 0 means that no unusual substructures were encountered; larger values reflect that the structure contains unusual substructures which can mean that it is harder to synthesize.

 

Where to find the AI tools:

Search Results Table

  • Run a search on the Explore Data tab
  • Click "Customize your report" and select the 'bioisosteres' option under the Structure section
  • Click the 'bioisosteres' link displayed under the chemical structure

Molecule Overview Page

  • Run a search on the Explore Data tab and click on a molecule ID, or click the Molecules tab and select a molecule
  • The two links for Deep Learning Similarity Searching and Bioisosteric suggestions are displayed under the chemical structure

AI links.png

 

Ketcher Structure Editor

  • Navigate to the Explore Data tab 
  • Click to open the structure editor
  • Draw/lookup a structure in the canvas
  • Launch the AI tools with the links at the bottom of the structure editor
  • You can watch a Demo of how CDD Vault Deep Learning can be used by Chemists and Biologists here.
  • You can watch the Recorded Webinar "Unlocking Potential Hits with an Advanced Deep Learning Methodology" here.