For a technical review of the development, execution, and deployment process of CDD Vault’s Zero-Click Inference Models, please see the manuscript below published by the CDD Informatics team.
Peter Gedeck, Jonathan Bisson, Kurt Werle, et al. Automated QSAR — how good is it in practice?. ChemRxiv. 16 January 2026. DOI: https://doi.org/10.26434/chemrxiv-2026-l1d11
How to interact with Zero-Click Inference models in CDD Vault
How to interpret predicted scores and error
Integrated AI workflow in CDD Vault
Introduction to QSAR models
CDD Vault Zero-Click Inference Models are predictive models built directly from your experimental protocol data, requiring no configuration or user interaction. Model training, QC, and deployment occurs automatically, and the model is refined every time new data is added. CDD Vault analyzes protocol data, identifies relevant endpoints, and builds a regression model. If the model is good enough, it is released into your Vault. You do not need to configure or maintain anything.
Quantitative Structure-Activity Relationship (QSAR) is a computational modelling technique that correlates the chemical structure of a molecule with its biological activity. QSAR models work by converting chemical structures into numerical descriptors (predictors) and use machine learning algorithms to learn relationships between these predictors and experimentally measured activities. CDD Inference Models automatically build regression models using molecular fingerprints and machine learning methods including Bayesian ridge regression, random forest, and gradient boosting, selecting the best-performing model through cross-validation.
Using public datasets from ChEMBL, the CDD Informatics team developed a fully automated workflow for model training and continuous evaluation. Regression models are released when a conservative performance threshold is achieved. To give users a handle on model uncertainty, CDD also provides conformal prediction intervals. Automated and continuously updated QSAR modeling can provide practical and scalable decision support for drug discovery, particularly in settings where dedicated modeling expertise is limited.
How the models work
During the development of these inference models, CDD Informatics considered the following constraints.
- The system should be fully automated and require no user input.
- The system should be able to handle a wide variety of datasets, including small datasets with as few as 20 data points, as well as large datasets with several thousand data points.
- The system should be able to provide reliable estimates of model performance.
- The trained models should be easily accessible to users for making predictions on new compounds on the fly; as a cloud based application, models should be executable in the browser.
Training data sets
CDD Informatics used a large collection of public datasets from ChEMBL (version 35) to develop and validate this modeling approach. The datasets cover a wide range of target classes and therapeutic areas (550 diverse targets) and should give a reasonable representation of the type of SAR datasets encountered in drug discovery.
Data Preparation
Chemical structures are standardized and converted into molecular fingerprints using Morgan count vectors of length 2048 with a radius of 3 with atom typed as features. This roughly corresponds to the FCFC6 count vectors.
Automated Model Training
The system automatically builds three candidate regression models using cross-validation (5-fold) and hyperparameter tuning for each dataset.
- Bayesian ridge regression: a linear regression model with L2 regularization that is implicitly tuned during model training.
- Random forest regression: a non-linear ensemble model with tunable hyperparameters
- xgboost: a non-linear gradient boosting model with several tunable hyperparameters.
Model Selection
The system next statistically compares model performance and selects the simplest model that is significantly predictive. Only if a more complex model is significantly better than a simpler model, it is selected. Otherwise, the simpler model is preferred. The selected model is then retrained on the entire dataset using the best hyperparameters found during cross-validation.
Uncertainty Estimation
Conformal regression is an approach that derives prediction intervals for individual predictions based on the errors observed during training. A cross-conformal regression approach is implemented where a model is trained to predict the absolute residuals of the out-of-fold predictions from the cross-validation of the selected model.
This approach generates a prediction interval to quantify confidence for each result giving scientists immediate insight into “how confident should I be in this predicted activity value?”.
Deployment & Continuous Updating (Open Neural Network Exchange)
The final model is a combination of feature pre-processing, the trained machine learning model, and the conformal prediction model. The selected model is packaged for in-browser use through Open Neural Network Exchange (ONNX) format and is automatically retrained whenever new protocol data becomes available.
Security
The CDD Vault Zero-Click Inference Models are designed with security in mind. Your data remains inside your secure and private CDD Vault environment. All models are deployed on CDD controlled servers and at no point is data transferred to external servers.
Model deployment criteria
Model deployment is a fully automated process and ensures that models are only deployed once they have been validated and met a certain set of criteria to ensure robust and accurate predictions:
- Applicable data sets are dose response intercept values calculated within CDD Vault
- While the minimum number of data points to initiate model training is six, six data points are generally considered insufficient and users should not expect to see reasonable models for this number. Overall, users would expect to require at least 30 data points with a reasonable spread to see models of sufficient quality.
- Once training is finished, the model performance is checked against our criteria (r2 > 0.4) and if it is met, the new model is released.
How to interact with Zero-Click Inference Models in CDD Vault
Inference models are directly integrated in your CDD Vault environment in four locations.
Molecule overview page:
Predictions are shown below calculated properties. By default, you will first see the user selected models followed by a list of most recently updated models. You can add or remove models by clicking the pin and un-pin icons that are shown when you move the mouse over the model name. For more control over the model selection, click the pencil icon to open the settings dialog.
Open the settings dialog by clicking the pencil icon to:
- Select a list of preferred models
- Change the order of the models
- Use Filter models to filter models by name
- Select all/none to add or remove all the filtered models
The list of preferred models is saved in your browser and will persist across sessions.
Structure editor:
Model predictions are shown side by side with molecular properties for any molecule you draw allowing for real-time QSAR activity predictions. Select the gear icon to open the prediction settings to toggle on and off physchem property predictions and select/ filter inference model predictions. As protocol names can be very long, CDD labels predictions using A, B, etc… Hover over the labels to see the protocol name.
Bioisiosteric suggestions (AI module):
Predictions are calculated for all suggested molecules. Predictions will more likely be accurate for bioisosteres similar to parent compounds in training sets. Similarity is conveniently shown next to each bioisostere to quantitate trustworthiness. Select the gear icon in the top menu bar to select models, filter models, and sort suggested structures by predicted activity.
Deep learning similarity (AI Module):
Predictions are calculated for all molecules. Deep learning similarity identifies compounds physically available from Enamine for SAR-by-catalogue and in SureCEMBL for patent novelty.
View Inference Model Details for Specific Protocols:
The Inference Models tab provides a centralized view of all available predictive models associated with a specific protocol. Please navigate to the protocol overview page and select the “Inference Models” tab to find this dashboard. For each model, you will see key details including the associated project and the readout definition that was used to model the data. If the protocol has conditions, a separate model for each condition set is reported. The final three columns inform you about the unit of measurement, the status of the model, and the performance.
Model performance is summarized using three standard regression statistics:
- R²
- MAE (Mean Absolute Error)
- RMSE (Root Mean Square Error)
each reported with uncertainty estimates, alongside the number of data points used to train the model.
Models can be filtered by keyword and status:
- Released
- Low Model Quality
- Insufficient Data.
This makes it easy to identify which models are ready for use and which may require additional data.
How to interpret predicted scores and error
All outputs from CDD Zero-Click Inference Models will be reported in the following format:
- Intercept prediction (activity) ± conformal regression interval prediction (uncertainty)
The current implementation does not address the applicability domain of the models directly. CDD Informatics has shown that predicting random, unrelated structures will lead to predictions in the low activity range of the experimental data. This means if the experimental data are 10 𝜇M or higher, the model will predict an activity around 10 𝜇M. Fortunately, we see that at the same time that the conformal prediction interval is predicted to be larger for these compounds, which gives the user an indication of the uncertainty of the prediction and that the prediction should be interpreted with caution.
Integrated AI workflow in CDD Vault
Researchers can now leverage powerful in silico tools embedded directly within CDD Vault to expedite their DMTA cycle at every step. Users may register an initial hit or lead compound (virtual or production) into their private CDD Vault database and then leverage the generative AI bioisostere suggestion tool to iterate on SAR cycles, find new chemical matter, improve specific attributes of their lead compound, or run deep-learning similarity searches to find similar literature/ commercially available compounds. All suggested compounds from either tool will be run through validated, relevant Zero-Click Inference models providing activity predictions for each suggested new compound. Users will also find predicted physical chemical properties in addition to synthetic accessibility scores to help triage new compounds and focus research objectives. Users may next consider pushing top in silico performing compounds into docking models such as DiffDock or Boltz2 to further validate the design hypothesis and improve decision making.
For further information on Zero-Click Inference Models please contact CDD Support at support@collaborativedrug.com.
For a technical review of the development, execution, and deployment process of CDD Vault’s Zero-Click Inference Models, please see the manuscript below published by the CDD Informatics team.
Dr. Peter Gedeck, Dr. Jonathan Bisson, Mr. Kurt Werle, et al. Automated QSAR — how good is it in practice?. ChemRxiv. 16 January 2026. DOI: https://doi.org/10.26434/chemrxiv-2026-l1d11