Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

CDD Vault Deep Learning

The Research Informatics Group at CDD has developed a computational tool to enhance drug discovery alongside CDD Vault®. This tool is an innovative iterative deep learning model designed to assist medicinal chemists. Please contact your Account Manager at CDD if you would like to try this new feature.


Chemically Rich Vectors (CRVs)

Deep Learning Network


Using the Deep Learning Tool in CDD Vault



This novel approach to similarity search takes a chemical structure and converts into a numerical representation and couple these numerical representation with a generative model that can generate smiles strings.


To implement this, the research informatics group have coupled a graph-based encoder to a complementary decoder, creating an autoencoder.

The  autoencoder is able to map molecules into chemically rich vectors (the so-called CRVs) and then this autoencoder was trained to ensure that the output is identical to the input.

 The resulting neural network was then trained to generate an output that is chemically similar to the input to the encoder.  The resulting pre-trained encoder can then be used to generate CRVs which can be used in predictive models.  


Chemically Rich Vectors (CRVs)

Graph convolutional networks, which is a form of deep learning architecture, lead to a description of a structure as a vector of real numbers, the so-called latent vector.

 It has been demonstrated, that the latent vector representation can be used to build QSAR models with state of the art performance.

Therefore these CRVs serve as concise numerical summaries of chemical properties, surpassing conventional molecular descriptors in several ways:

  • They allow for the reversible reconstruction of original molecules, enabling the encoding of comprehensive structural-activity information in a multi-dimensional numerical space.
  • They eliminate the need for expertise in computational chemistry to select molecular descriptors.
  • They establish a foundation for performing inverse QSAR.
  • The models can operate as unsupervised background processes, facilitating automation in drug discovery.

This innovation is a hybrid training system that combines CRVs and labeled graphs. By training the encoder with labeled graphs and the decoder to generate SMILES strings and structural fingerprints,  the graph invariance problem is addressed while retaining the ability to recreate the original molecule.


Deep Learning Network

The most relevant aspects of the network architecture for this study are:

  •  The network was trained using structures from ChEMBL version 28.
  • The network is a combination of a graph convolutional network to represent a molecule as a chemically rich vector of length 384 and a generative model to derive structures matching a given CRV.
  • The network was trained using a combination of objectives, to ensure that the CRV represents structural information, covers the latent space well, and can be used to reconstruct the chemical structure represented by a CRV.
  • The CRV structure representation combined with a suitable distance metric, can be used for similarity searches that are complementary to established approaches.



ChEMBL is a manually curated database of bioactive molecules which includes ~ 2.5m synthesizable compounds with drug like properties. Most of these have been tested against one or multiple targets to provide a rich, accessible chemical space to explore.

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents.

CDD vault now has a private instance of the ChEMBL structure database available for you to search, which is behind the CDD Firewall so your independent structure is not exposed.

This new vector-based methodology will help you uncover structurally related compounds quickly. These structures can then be exported for further analysis and purchasing.


Using the Deep Learning Tool in CDD Vault

Once a hit compound has been identified from your research, conducting an extended similarity search in the CHEMBL database can be beneficial for several reasons. To uncover structurally similar compounds or analogs to test for improved potency, selectivity, or pharmacokinetic properties, for structure-activity relationship  (SAR) exploration by accessing by accessing the SAR and comparing the biological activities of similar compounds, aiding in the rational design of more effective drugs.

Conducting a thorough similarity search around a hit compound is a crucial step in drug discovery, contributing to the strategic development and eventual success of new therapeutic agents.


 From the Molecule page of the hit compound in CDD Vault click in Find ChEMBL compounds using deep learning similarity on the left-hand side of the page.




The search is done against the ~2.5 million structures in ChEMBL and it gives the 100 most similar structures to the query. The information about these structures can be downloaded into csv.



The excel file contains three columns: name, smiles and scaffold. These information can then be used either to order compounds from chemical vendors or to further investigate the activities of these compounds in ChEMBL.




  • You can watch a Demo of how CDD Vault Deep Learning can be used by Chemists and Biologists here.
  • You can watch the Recorded Webinar "Unlocking Potential Hits with an Advanced Deep Learning Methodology" here.