Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the Log D Model used in CDD Vault.
Log D Model
Peter Gedeck, CDD Research Informatics
The development of the logD model was based on the method described in Gedeck et al. (2017). The model predicts logD with a median absolute error of 0.26, a mean absolute error of 0.39, and a root mean squared error of 0.61. This performance is comparable to results reported in the literature for other methods.
One limitation of the current implementation is that for structures with tautomeric groups, predictions can vary based on how tautomers are drawn. While we incorporate some normalization of tautomers, the implementation is not complete. We are currently exploring approaches to develop tautomer independent fingerprints. We expect this to improve our model in the future. Proprietary data can be incorporated during the development process in a confidential manner.
Further details
Dataset
The model was developed using data from several public datasets.
Dataset |
Size |
Range |
Wu et al. (2018) |
4,200 |
-1.5 to 4.5 |
Aliagas et al. (2022) |
4,190 |
-1.5 to 4.5 |
Customers (2023) |
1,944 |
-1.4 to 5.6 |
Wang et al. (2015) |
1,130 |
-3.6 to 6.8 |
Francisco et al. (2021) |
61 |
-1.4 to 3.8 |
Lasallas et al. (2016) |
35 |
-1.4 to 3.8 |
Bergazin et al. (2021 SAMPL7) |
22 |
0.8 to 3.0 |
CDD curation |
19 |
-1.4 to 3.4 |
Combined |
7,209 |
-3.6 to 6.8 |
After cleanup and handling duplicate values, the combined dataset has 7,209 structures for the logD model.
Each chemical structure is preprocessed as follows:
- Explicit hydrogens are removed
- Normalization of functional groups
- Normalization of few tautomers
- Acids and bases are protonated/deprotonated
Extended dataset
To further increase the training set size, we extended the dataset using logP data points for structures that are not charged at pH 7.4. For neutral molecules, logP and logD7.4 values are identical. The ionization state was determined using the CDD pKa model. This approach leads to a dataset of 20,706 data points.
Descriptors
The model was trained using a hierarchical linear regression model using counts of Morgan fragments with radii of 0, 1, and 2.
Model training and validation
The hierarchical linear regression model is described in more details in the CDD logP model documentation.
The five-fold cross validation results for this model are:
Median absolute error: MedAE | 0.263 |
Mean absolute error: MAE | 0.391 |
Root mean squared error: RMSE | 0.611 |
While the performance metrics are a useful measure to judge the overall quality of the model it is useful to look at the distribution of the errors. 74.4% of all predictions are within 0.5 log unit, 93% within 1 log unit. Only 1.2% have predictions greater than 2. The following figure shows predicted versus experimental logD values.
The performance of the model is similar to results reported in the literature. Tetko and Poda (2004) evaluated the performance of several methods on two Pfizer internal logD datasets. ACD Labs LogD achieved MAE of 0.69 and 0.97 and an RMSE of 0.99 and 1.32. Pallas PrologD had a slightly worse performance; MAE 1.29 and 1.06, RMSE 1.52 and 1.41. The ALOGPS software gave MAE of 1.09 and 1.17, RMSE 1.33 and 1.17. This last result was improved by retraining the ALOGPSS model using parts of the internal dataset. After training, MAE was 0.45 and 0.48, RMSE 0.68 and 0.69. Bruneau and McElroy (2006) used a Bayesian regularized neural network which was trained using internal data from AstraZeneca. They report an RMSE value of 0.63. Schroeter et al. (2007) achieved an RMSE value of 0.66 using an internal dataset from Bayer Schering Pharma. Li Fu et al. (2020) reported RMSE values around 0.5 using several non-linear modeling methods. While this seems like a significant improvement over the previous results, it needs to be noted that the RMSE values dropped considerably when applied to an external test set. In these cases, a linear regression model like we used in our model, had comparable performance to the non-linear models. This is an indication that the non-linear models were overfitting the dataset.
All in all, the performance of our logD model is comparable to other models.
References
Aliagas, Ignacio, Alberto Gobbi, Man-Ling Lee, and Benjamin D. Sellers. “Comparison of LogP and LogD Correction Models Trained with Public and Proprietary Data Sets.” Journal of Computer-Aided Molecular Design 36, no. 3 (March 1, 2022): 253–62. https://doi.org/10.1007/s10822-022-00450-9.
Bergazin, Teresa Danielle, Nicolas Tielker, Yingying Zhang, Junjun Mao, M. R. Gunner, Karol Francisco, Carlo Ballatore, Stefan M. Kast, and David L. Mobley. “Evaluation of Log P, PKa, and Log D Predictions from the SAMPL7 Blind Challenge.” Journal of Computer-Aided Molecular Design 35, no. 7 (July 1, 2021): 771–802. https://doi.org/10.1007/s10822-021-00397-3.
Bruneau, P.; McElroy, N.R. logD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction. J. Chem. Inf. Model. 2006, 46, 1379-1387
Gedeck, P.; Skolnik, S.; Rodde, S. Developing Collaborative QSAR Models Without Sharing Structures. Journal of Chemical Information and Modeling 2017, DOI: 10.1021/acs.jcim.7b00315.
Francisco, Karol R., Carmine Varricchio, Thomas J. Paniak, Marisa C. Kozlowski, Andrea Brancale, and Carlo Ballatore. “Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres.” European Journal of Medicinal Chemistry 218 (June 5, 2021): 113399. https://doi.org/10.1016/j.ejmech.2021.113399.
Lassalas, Pierrik, Bryant Gay, Caroline Lasfargeas, Michael J. James, Van Tran, Krishna G. Vijayendran, Kurt R. Brunden, et al. “Structure Property Relationships of Carboxylic Acid Isosteres.” Journal of Medicinal Chemistry 59, no. 7 (April 14, 2016): 3183–3203. https://doi.org/10.1021/acs.jmedchem.5b01963.
Li Fu, Lu Liu, Zhi-Jiang Yang, Pan Li, Jun-Jie Ding, Yong-Huan Yun, Ai-Ping Lu, Ting-Jun Hou, and Dong-Sheng Cao. Systematic Modeling of log D7.4 Based on Ensemble Machine Learning, Group Contribution, and Matched Molecular Pair Analysis. J. Chem. Inf. Model. 2020, 60, 1, 63–76, DOI: 10.1021/acs.jcim.9b00718
Schroeter, T. S.; Schwaighofer, A.; Mika, S.; Ter Laak, A.; Suelzle, D.; Ganzer, U.; Heinrich, N.; Muller, K.R. Predicting Lipophilicity of Drug-Discovery Molecules using Gaussian Process Models. ChemMedChem 2007, 2, 1265–1267, DOI: 10.1002/cmdc.200700041
Tetko, I.V.; Poda, G.I. Application of ALOGPS 2.1 to Predict log D Distribution Coefficient for Pfizer Proprietary Compounds. J. Med. Chem. 2004, 47, 23, 5601–5604, DOI: 10.1021/jm049509l.
Wang, J.-B.; Cao, D.-S.; Zhu, M.-F.; Yun, Y.-H.; Xiao, N.; Liang, Y.-Z. InSilico Evaluation of logD7.4 and Comparison with Other Prediction Methods. Journal of Chemometrics 2015, 29, 389–398, DOI: 10.1002/cem.2718.
Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Science 2018, 9, 513–530, DOI: 10.1039/C7SC02664A.