Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

Log D Calculated by CDD Vault During Chemical Registration

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the Log D Model used in CDD Vault.

Log D Model

Peter Gedeck, CDD Research Informatics

The logD model was developed using the same computational approach described in Gedeck et al. (2017) for logP. The model predicts logD with a median absolute error of 0.38, a mean absolute error of 0.5, and a root mean squared error of 0.7. This performance is comparable to results reported in the literature for other methods.

The approach used in the development of the model allows us to incorporate additional experimental data without having access to structural information (see Gedeck et al., 2017). One limitation of the current implementation is that for structures with tautomeric groups, predictions can vary based on how tautomers are drawn. While we incorporate some normalization of tautomers, the implementation is not complete. We are currently exploring approaches to develop tautomer independent fingerprints. We expect this to improve our model in the future.

 

Further details

Dataset
The model was developed using data from several public datasets.

Dataset

Size

Range

Wang et al. (2015)

1,117

-3.6 to 6.8

Wu et al. (2018)

4,199

-1.5 to 4.5

Combined

5,133

-3.6 to 6.8

 

After cleanup and removing duplicate values, the combined dataset has 5,133 structures for the logD model. Each chemical structure is preprocessed as follows:

  • Explicit hydrogens are removed
  • Normalization of functional groups
  • Normalization of few tautomers
  • Acids and bases are protonated/deprotonated

 

Descriptors
A variety of fingerprints implemented in RDKit were explored:

  • alogP fragment
  • Morgan fragments with radius 1 or 2
  • RDKit path fingerprint with lengths 3 to 7
  • In all cases, fragment counts were used as descriptors.

 

Model training and validation


The descriptors were used to build Bayesian ridge regression models. Bayesian ridge regression is a form of regularized linear regression. Other approaches were explored as well, however they didn’t lead to further improvement.

Based on five-fold cross validation results, Morgan-2 fingerprints gave the best performance.

  • Median absolute error: MedAE = 0.38
  • Mean absolute error: MAE = 0.5
  • Root mean squared error: RMSE = 0.7

While the performance metrics are a useful measure to judge the overall quality of the model, it is useful to look at the distribution of the errors. 62% of all predictions are within 0.5 log unit, 89% within 1 log unit. Only 1.3% have predictions greater than 2. The following figure shows predicted versus experimental logD values.

LogD1.PNG

The performance of the model is similar to results reported in the literature. Tetko and Poda (2004) evaluated the performance of several methods on two Pfizer internal logD datasets. ACD Labs LogD achieved MAE of 0.69 and 0.97 and an RMSE of 0.99 and 1.32. Pallas PrologD had a slightly worse performance; MAE 1.29 and 1.06, RMSE 1.52 and 1.41. The ALOGPS software gave MAE of 1.09 and 1.17, RMSE 1.33 and 1.17. This last result was improved by retraining the ALOGPSS model using parts of the internal dataset. After training, MAE was 0.45 and 0.48, RMSE 0.68 and 0.69. Bruneau and McElroy (2006) used a Bayesian regularized neural network which was trained using internal data from AstraZeneca. They report an RMSE value of 0.63. Schroeter et al. (2007) achieved an RMSE value of 0.66 using an internal dataset from Bayer Schering Pharma. Li Fu et al. (2020) reported RMSE values around 0.5 using several non-linear modeling methods. While this seems like a significant improvement over the previous results, it needs to be noted that the RMSE values dropped considerably when applied to an external test set. In these cases, a linear regression model like we used in our model, had comparable performance to the non-linear models. This is an indication that the non-linear models were overfitting the dataset.

All in all, the performance of our logD model is comparable to other models.

 

References

Bruneau, P.; McElroy, N.R. logD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction. J. Chem. Inf. Model. 2006, 46, 1379-1387

Gedeck, P.; Skolnik, S.; Rodde, S. Developing Collaborative QSAR Models Without Sharing Structures. Journal of Chemical Information and Modeling 2017, DOI: 10.1021/acs.jcim.7b00315.

Li Fu, Lu Liu, Zhi-Jiang Yang, Pan Li, Jun-Jie Ding, Yong-Huan Yun, Ai-Ping Lu, Ting-Jun Hou, and Dong-Sheng Cao. Systematic Modeling of log D7.4 Based on Ensemble Machine Learning, Group Contribution, and Matched Molecular Pair Analysis. J. Chem. Inf. Model. 2020, 60, 1, 63–76, DOI: 10.1021/acs.jcim.9b00718

Schroeter, T. S.; Schwaighofer, A.; Mika, S.; Ter Laak, A.; Suelzle, D.; Ganzer, U.; Heinrich, N.; Muller, K.R. Predicting Lipophilicity of Drug-Discovery Molecules using Gaussian Process Models. ChemMedChem 2007, 2, 1265–1267, DOI: 10.1002/cmdc.200700041

Tetko, I.V.; Poda, G.I. Application of ALOGPS 2.1 to Predict log D Distribution Coefficient for Pfizer Proprietary Compounds. J. Med. Chem. 2004, 47, 23, 5601–5604, DOI: 10.1021/jm049509l.

Wang, J.-B.; Cao, D.-S.; Zhu, M.-F.; Yun, Y.-H.; Xiao, N.; Liang, Y.-Z. InSilico Evaluation of logD7.4 and Comparison with Other Prediction Methods. Journal of Chemometrics 2015, 29, 389–398, DOI: 10.1002/cem.2718.

Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Science 2018, 9, 513–530, DOI: 10.1039/C7SC02664A.