Log P Calculated by CDD Vault During Chemical Registration – CDD Support

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the Log D Model used in CDD Vault.

Log P Model (2023.03)

Peter Gedeck, CDD Research Informatics

The development of the logP model was based on the method described in Gedeck et al. (2017). The model predicts logP with a median absolute error of 0.245, a mean absolute error of 0.41, and a root mean squared error of 0.71. This performance is comparable to results reported in the literature for other methods.

One limitation of the current implementation is that for structures with tautomeric groups, predictions can vary based on how tautomers are drawn. While we incorporate some normalization of tautomers, the implementation is not complete. We are currently exploring approaches to develop tautomer independent fingerprints. We expect this to improve our model in the future. Proprietary data can be incorporated during the development process in a confidential manner.

Further details

Dataset
The model was developed using data from several public datasets. Pubchem combines logP values reported in about 200 result sets in Pubchem (Kim et al. 2023, outliers identified and corrected)

Dataset	Size	Range
Ulrich et al. (2021)	23,391	-5.1 to 11.3
OpenChem	14,176	-5.4 to 11.3
Mansouri et al. (2016)	13,839	-12.0 to 11.3
Cui et al. (2021)	9,789	-10.9 to 20.9
Pubchem (2023-01)	2,446	-3.5 to 14.6
Martel et al. (2013)	707	0.3 to 7.0
Bergazin et al. (2021 SAMPL7)	22	0.8 to 3.0
Francisco et al. (2021)	15	0.6 to 2.7
SAMPL6	11	2.0 to 4.1
Combined	35,984	-12 to 20.9

After cleanup and handling duplicate values, the combined dataset has 35,984 data points to build the logP model. Each chemical structure is preprocessed as follows:

Explicit hydrogens are removed
Normalization of functional groups
Normalization of few tautomers
Acids and bases are protonated/deprotonated

Descriptors

The model was trained using a hierarchical linear regression model using counts of Morgan fragments with radii of 0, 1, and 2.

Model training and validation

The descriptors were used to build Bayesian ridge regression models using a hierarchical process as follows. The initial model is trained using Morgan-0 fragments counts to predict logP (level 0). This initial model is further refined by training a model using the counts of Morgan fragment of radius 1 on the residuals of the level 0 predictions. In order to reduce the size of the regression problem, fragments were combined based on their model coefficients and a level 1 model trained. Similarly, Morgan fragments of radius 2 were used to train a level 2 model to correct the error of the combined level 0 and level 1 predictions. This time fragments were merged based on coefficients from univariate linear regression models. This approach groups the over 60,000 unique fragments to 97 Morgan-0 fragments, 399 groups of Morgan-1 fragments, and 379 groups of Morgan-2 fragments.

The five-fold cross validation results for this model are:

Median absolute error: MedAE = 0.245
Mean absolute error: MAE = 0.410
Root mean squared error: RMSE = 0.715

While the performance metrics are a useful measure to judge the overall quality of the model it is useful to look at the distribution of the errors. 75% of all predictions are within 0.5 log unit, 92% within 1 log unit. Only 2% have predictions greater than 2. The following figure shows predicted versus experimental logP values.

Comparison with Literature

As the dataset used to train this model is publicly available, most models were trained with the same data. In order to compare our model to reported results, it is therefore useful to also keep the performance of the model on the training data in mind. The performance metrics in this case are: MedAE=0.21, MAE=0.27, and RMSE=0.36.

Lenselink et al. (2021) report RMSE values of 0.67 for XlogP3 and 0.40 for S+logP. Ulrich et al. (2021) report RMSE values for various software packages ranging from 0.34 (OCHEM) to 0.97 (COSMO-RS). Schroeter et al. (2007) trained logP models with a variety of linear and non-linear regression techniques. They report MAE values on a public dataset ranging from 0.38 to 0.59 and RMSE values ranging from 0.66 and 0.89. They compare their results with a variety of commercial models having MAE between 0.25 and 0.76 and RMSE between 0.9 and 1.32. The large discrepancy between MAE and RMSE was due to outliers.

All in all, the performance of our logP model compares favorably to other models.

References

Bergazin, Teresa Danielle, Nicolas Tielker, Yingying Zhang, Junjun Mao, M. R. Gunner, Karol Francisco, Carlo Ballatore, Stefan M. Kast, and David L. Mobley. “Evaluation of Log P, PKa, and Log D Predictions from the SAMPL7 Blind Challenge.” Journal of Computer-Aided Molecular Design 35, no. 7 (July 1, 2021): 771–802. https://doi.org/10.1007/s10822-021-00397-3.

Cui, Qiuji, Shuai Lu, Bingwei Ni, Xian Zeng, Ying Tan, Ya Dong Chen, and Hongping Zhao. “Improved Prediction of Aqueous Solubility of Novel Compounds by Going Deeper With Deep Learning.” Frontiers in Oncology 10 (February 11, 2020): 121. https://doi.org/10.3389/fonc.2020.00121.

Gedeck, P.; Skolnik, S.; Rodde, S. Developing Collaborative QSAR Models Without Sharing Structures. Journal of Chemical Information and Modeling 2017, DOI: 10.1021/acs.jcim.7b00315.

Francisco, Karol R., Carmine Varricchio, Thomas J. Paniak, Marisa C. Kozlowski, Andrea Brancale, and Carlo Ballatore. “Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres.” European Journal of Medicinal Chemistry 218 (June 5, 2021): 113399. https://doi.org/10.1016/j.ejmech.2021.113399.

Kim, Sunghwan, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, et al. “PubChem 2023 Update.” Nucleic Acids Research 51, no. D1 (January 6, 2023): D1373–80. https://doi.org/10.1093/nar/gkac956.

Lenselink, Eelke B., and Pieter F. W. Stouten. “Multitask Machine Learning Models for Predicting Lipophilicity (LogP) in the SAMPL7 Challenge.” Journal of Computer-Aided Molecular Design 35, no. 8 (2021): 901–9. https://doi.org/10.1007/s10822-021-00405-6.

Mansouri, K., C. M. Grulke, A. M. Richard, R. S. Judson, and A. J. Williams. “An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling.” SAR and QSAR in Environmental Research 27, no. 11 (November 1, 2016): 911–37. https://doi.org/10.1080/1062936X.2016.1253611.

Martel, Sophie, Fabrice Gillerat, Emanuele Carosati, Daniele Maiarelli, Igor V. Tetko, Raimund Mannhold, and Pierre-Alain Carrupt. “Large, Chemically Diverse Dataset of Log P Measurements for Benchmarking Studies.” European Journal of Pharmaceutical Sciences 48, no. 1–2 (January 23, 2013): 21–29. https://doi.org/10.1016/j.ejps.2012.10.019.

Popova, Mariya, Olexandr Isayev, and Alexander Tropsha. “Deep Reinforcement Learning for de Novo Drug Design.” Science Advances 4, no. 7 (July 25, 2018). https://doi.org/10.1126/sciadv.aap7885.

Schroeter, Timon, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer, Nikolaus Heinrich, and Klaus-Robert Müller. “Machine Learning Models for Lipophilicity and Their Domain of Applicability.” Molecular Pharmaceutics 4, no. 4 (August 1, 2007): 524–38. https://doi.org/10.1021/mp0700413.

Ulrich, Nadin, Kai-Uwe Goss, and Andrea Ebert. “Exploring the Octanol–Water Partition Coefficient Dataset Using Deep Learning Techniques and Data Augmentation.” Communications Chemistry 4, no. 1 (June 14, 2021): 1–10. https://doi.org/10.1038/s42004-021-00528-9.

CDD Vault automates the generation of chemical properties based on structures.