Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

Log P Calculated by CDD Vault During Chemical Registration

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the Log D Model used in CDD Vault.

Log P Model

Peter Gedeck, CDD Research Informatics

The logP model was developed using the method described in Gedeck et al. (2017). The model predicts logP with a median absolute error of 0.26, a mean absolute error of 0.36, and a root mean squared error of 0.53. This performance is comparable to results reported in the literature for other methods.

The approach used in the development of the model allows us to incorporate additional experimental data without having access to structural information (see Gedeck et al., 2017). One limitation of the current implementation is that for structures with tautomeric groups, predictions can vary based on how tautomers are drawn. While we incorporate some normalization of tautomers, the implementation is not complete. We are currently exploring approaches to develop tautomer independent fingerprints. We expect this to improve our model in the future.

 

Further details

Dataset
The model was developed using data from several public datasets.

Dataset

Size

Range

Mansouri et al. ()

13,752

-12 to 11.3

Martel et al. 

707

0.3 to 7.0

OpenChem

14,176

-5.4 to 11.3

Combined

17,430

-12 to 11.3

 

After cleanup and removing duplicate values, the combined dataset has 17,430 data points to build the logP model. Each chemical structure is preprocessed as follows:

  • Explicit hydrogens are removed
  • Normalization of functional groups
  • Normalization of few tautomers
  • Acids and bases are protonated/deprotonated

 

Descriptors
A variety of fingerprints implemented in RDKit were explored:

  • alogP fragment
  • Morgan fragments with radius 1 or 2
  • RDKit path fingerprint with lengths 3 to 7
  • In all cases, fragment counts were used as descriptors.

 

Model training and validation


The descriptors were used to build Bayesian ridge regression models. Bayesian ridge regression is a form of regularized linear regression. The other approaches explored didn’t lead to further improvement.

Based on five-fold cross validation results, Morgan-2 fingerprints gave the best performance.

  • Median absolute error: MedAE = 0.26
  • Mean absolute error: MAE = 0.36
  • Root mean squared error: RMSE = 0.53

 

While the performance metrics are a useful measure to judge the overall quality of the model it is useful to look at the distribution of the errors. 87% of all predictions are within 0.5 log unit, 98.9% within 1 log unit. Only 0.1% have predictions greater than 2. The following figure shows predicted versus experimental logP values.

LogP1.PNG

Comparison with Literature

As the dataset used to train this model is publicly available, most models were trained with the same data. In order to compare our model to reported results, it is therefore useful to also keep the performance of the model on the training data in mind. The performance metrics in this case are: MedAE=0.21, MAE=0.27, and RMSE=0.36.

Lenselink et al. (2021) report RMSE values of 0.67 for XlogP3 and 0.40 for S+logP. Ulrich et al. (2021) report RMSE values for various software packages ranging from 0.34 (OCHEM) to 0.97 (COSMO-RS). Schroeter et al. (2007) trained logP models with a variety of linear and non-linear regression techniques. They report MAE values on a public dataset ranging from 0.38 to 0.59 and RMSE values ranging from 0.66 and 0.89. They compare their results with a variety of commercial models having MAE between 0.25 and 0.76 and RMSE between 0.9 and 1.32. The large discrepancy between MAE and RMSE was due to outliers.

All in all, the performance of our logP model compares favorably to other models. 

References

Gedeck, P.; Skolnik, S.; Rodde, S. Developing Collaborative QSAR Models Without Sharing Structures. Journal of Chemical Information and Modeling 2017, DOI: 10.1021/acs.jcim.7b00315

Lenselink, Eelke B., and Pieter F. W. Stouten. “Multitask Machine Learning Models for Predicting Lipophilicity (LogP) in the SAMPL7 Challenge.” Journal of Computer-Aided Molecular Design 35, no. 8 (2021): 901–9. https://doi.org/10.1007/s10822-021-00405-6.

Mansouri, K., C. M. Grulke, A. M. Richard, R. S. Judson, and A. J. Williams. “An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling.” SAR and QSAR in Environmental Research 27, no. 11 (November 1, 2016): 911–37. https://doi.org/10.1080/1062936X.2016.1253611.

Martel, Sophie, Fabrice Gillerat, Emanuele Carosati, Daniele Maiarelli, Igor V. Tetko, Raimund Mannhold, and Pierre-Alain Carrupt. “Large, Chemically Diverse Dataset of Log P Measurements for Benchmarking Studies.” European Journal of Pharmaceutical Sciences 48, no. 1–2 (January 23, 2013): 21–29. https://doi.org/10.1016/j.ejps.2012.10.019.

Popova, Mariya, Olexandr Isayev, and Alexander Tropsha. “Deep Reinforcement Learning for de Novo Drug Design.” Science Advances 4, no. 7 (July 25, 2018). https://doi.org/10.1126/sciadv.aap7885.

Schroeter, Timon, Anton Schwaighofer, Sebastian Mika, Antonius Ter Laak, Detlev Suelzle, Ursula Ganzer, Nikolaus Heinrich, and Klaus-Robert Müller. “Machine Learning Models for Lipophilicity and Their Domain of Applicability.” Molecular Pharmaceutics 4, no. 4 (August 1, 2007): 524–38. https://doi.org/10.1021/mp0700413.

Ulrich, Nadin, Kai-Uwe Goss, and Andrea Ebert. “Exploring the Octanol–Water Partition Coefficient Dataset Using Deep Learning Techniques and Data Augmentation.” Communications Chemistry 4, no. 1 (June 14, 2021): 1–10. https://doi.org/10.1038/s42004-021-00528-9.