Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

pKa Calculated by CDD Vault During Chemical Registration

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the pKa calculation used in CDD Vault.

pKa model (2023.03)

Peter Gedeck, CDD Research Informatics

The pKa model was developed using an approach similar to the method described in Lu et al. (2019). The mean absolute error obtained using cross-validation is between 0.46 and 0.84. The predictive performance of the model is in line with results reported for other models.

 

Further details

Dataset
The model was developed using data from several public datasets.

Dataset

Acids

Bases

Mixed

Baltruschat et al. (2020)

3,372

4,248

0

Mansouri et al. (2019a)

4,051

3,402

43

Mansouri et al. (2019)

2,680

3,104

435

Datawarrior (2015)

2,360

2,889

366

Hunt (2020)

972

1,410

0

Yu (2010)

580

563

0

Settimo (2014)

118

460

45

CDD internal (2023)

174

94

0

AID781326

49

73

0

AID781327

0

173

0

SpiroKit (2023)

0

66

0

Francisco (2021)

45

0

0

Jensen (2017)

2

41

0

Franz (2001)

36

0

0

SAMPL6 (Hunt, 2020)

5

15

4

SAMPL7

22

0

0

Manual curation

563

566

0

Combined

7,701

8,589

1,039

 

Each chemical structure is preprocessed as follows:

  • Explicit hydrogens are removed
  • Normalization of functional groups
  • Normalization of few tautomers
  • Acids and bases are protonated/deprotonated

While in several cases, either only an acid or base pKa value was reported, in many cases, we identified several ionization centers. By restricting the initial training to cases where the structures had only a single ionization center, we identified 6,541 acids and 7,585 bases. To derive an initial model (see below).

 

Ionization Sites

In order to identify acidic and basic ionization sites, we use a set of SMARTS patterns and categorized them into several subsets. The model training showed that splitting the dataset into subsets and train individual models is beneficial.

 

Descriptors

Following the results from the work by Lu et al. (2019), we describe ionization sites using rooted topological torsions with lengths between 1 and 7 bonds. We modified this approach by reducing the rooted topological torsions to shortest paths only.

In order not to overtrain the model, we reduced the number of fingerprints by requiring them to occur in several training set structures. This frequency cutoff is length dependent. The pruning reduces the size of the fingerprint considerably by 70%.

 

Model training and validation

We trained individual models for each subset. Each model was trained using an iterative process. The initial model was trained using only the data points with a single ionization center and a single experimental pKa value. In the next step, the model predictions were used to assign pKa values to ionization sites that were ignored in the initial model. An assignment was made if the prediction was within one pKa unit to the experimental value. This iterative process greatly increased the number of data points that could be used for model building. We repeated this process six times.

 

The performance metrics for 5-fold cross-validation training of the final iteration are:

Type

Size

RMSE

MAE

MedAE

carboxylic_acid

4688

0.290

0.158

0.079

aromatic_nitrogen

3469

0.609

0.334

0.194

aromatic_alcohol

2333

0.538

0.294

0.163

aliphatic_alcohol

906

1.165

0.540

0.163

thiol

443

0.411

0.195

0.062

aliphatic_NH2

2956

0.347

0.175

0.077

aliphatic_NH

2404

0.440

0.264

0.147

aliphatic_N

4785

0.404

0.245

0.138

aliphatic_Nsp2

813

0.598

0.310

0.102

acid

2683

0.764

0.345

0.141

base

71

1.594

0.533

0.039

 

The following figure shows the predicted versus experimental pKa values for the final model. The right graph shows the cross-validated predictions and the left graph for predictions on the training data.

pka_PvE.PNG

We can see that most of the predictions are within one pKa unit. 

 

Comparison to Literature

There are several studies in the literature that compare different methods. Kalliokoski and Sinervo (2019) looked at four different commercial products (Simulations Plus ADMET-Predictor S+pKa, ACD/Labs Percepta Classic, ACD/Labs Percepta GALAS and Epik). The best method gave a median absolute error of 0.69. Settimo et al. (2014) obtained similar results. They reported median absolute errors between 0.3 and 0.8 for several methods (Chemaxon, Epik, ACD). Yu et al. (2010) compared methods from SPARC and ACD and obtained mean absolute errors of 0.22 to 0.43. Hunt et al. (2020) developed various models and reported MAE values between 0.65 and 1.43. Baltruschat and Czodrowski (2020) trained models for pKa prediction and compared them against the performance of Chemaxon. They report MAE of 0.532 (internal model) and 0.57 (Chemaxon) for public data and MAE of 1.15 and 0.86 for Novartis internal data. All these studies show a wide range of performance metrics. The performance of our model compares well to the other methods. 

It is useful to point out the drop of 0.3 to 0.6 pKa units reported by Baltruschat and Czodrowski when comparing performance of models between public data and the Novartis internal dataset. The quality of pKa predictions greatly depends on how well an ionization site is described in the training set. The longitudinal study by Gedeck et al. (2015) observed model performance over time for a very large Novartis internal dataset. They showed that even if models are trained on internal data, the appearance of new structural features leads to a drop in model performance.

 

 

References

  • AID 781326. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781326, Source: ChEMBL" PubChem, https://pubchem.ncbi.nlm.nih.gov/bioassay/781326. Accessed 10 December, 2022.
  • AID 781327. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781327, Source: ChEMBL" PubChem, https://pubchem.ncbi.nlm.nih.gov/bioassay/781327. Accessed 10 December, 2022.
  • Baltruschat, Marcel, and Paul Czodrowski. “Machine Learning Meets pKa.” F1000Research 9 (2020): Chem Inf Sci-113. https://doi.org/10.12688/f1000research.22090.2.
  • Francisco, Karol R., Carmine Varricchio, Thomas J. Paniak, Marisa C. Kozlowski, Andrea Brancale, and Carlo Ballatore. “Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres.” European Journal of Medicinal Chemistry 218 (June 5, 2021): 113399. https://doi.org/10.1016/j.ejmech.2021.113399.
  • Franz, Robert G. “Comparisons of PKa and Log P Values of Some Carboxylic and Phosphonic Acids: Synthesis and Measurement.” AAPS PharmSci 3, no. 2 (June 1, 2001): 1. https://doi.org/10.1208/ps030210.
  • Gedeck, Peter, Yipin Lu, Suzanne Skolnik, Stephane Rodde, Gavin Dollinger, Weiping Jia, Giuliano Berellini, Riccardo Vianello, Bernard Faller, and Franco Lombardo. “Benefit of Retraining pKa Models Studied Using Internally Measured Data.” Journal of Chemical Information and Modeling 55, no. 7 (July 27, 2015): 1449–59. https://doi.org/10.1021/acs.jcim.5b00172.
  • Hunt, Peter, Layla Hosseini-Gerami, Tomas Chrien, Jeffrey Plante, David J. Ponting, and Matthew Segall. “Predicting pKa Using a Combination of Semi-Empirical Quantum Mechanics and Radial Basis Function Methods.” Journal of Chemical Information and Modeling 60, no. 6 (June 22, 2020): 2989–97. https://doi.org/10.1021/acs.jcim.0c00105.
  • Jensen, Jan H., Christopher J. Swain, and Lars Olsen. “Prediction of PKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods.” The Journal of Physical Chemistry A 121, no. 3 (January 26, 2017): 699–707. https://doi.org/10.1021/acs.jpca.6b10990.
  • Kalliokoski, Tuomo, and Kai Sinervo. “Predicting pKa for Small Molecules on Public and In-House Datasets Using Fast Prediction Methods Combined with Data Fusion.” Molecular Informatics 38, no. 7 (2019): 1800163. https://doi.org/10.1002/minf.201800163.
  • Lu, Yipin, Shankara Anand, William Shirley, Peter Gedeck, Brian P. Kelley, Suzanne Skolnik, Stephane Rodde, Mai Nguyen, Mika Lindvall, and Weiping Jia. “Prediction of pKa Using Machine Learning Methods with Rooted Topological Torsion Fingerprints: Application to Aliphatic Amines.” Journal of Chemical Information and Modeling 59, no. 11 (November 25, 2019): 4706–19. https://doi.org/10.1021/acs.jcim.9b00498.
  • Mansouri, Kamel, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, and Antony J. Williams. “Open-Source QSAR Models for pKa Prediction Using Multiple Machine Learning Approaches.” Journal of Cheminformatics 11, no. 1 (September 18, 2019): 60. https://doi.org/10.1186/s13321-019-0384-1.
  • Mansouri, Kamel, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, and Antony J. Williams. “MOESM1 of Open-Source QSAR Models for PKa Prediction Using Multiple Machine Learning Approaches.” figshare, September 19, 2019. https://doi.org/10.6084/m9.figshare.9877349.v1.
  • Sander, Thomas, Joel Freyss, Modest von Korff, and Christian Rufener. “DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis.” Journal of Chemical Information and Modeling 55, no. 2 (February 23, 2015): 460–73. https://doi.org/10.1021/ci500588j.
  • Settimo, Luca, Krista Bellman, and Ronald M. A. Knegtel. “Comparison of the Accuracy of Experimental and Predicted pKa Values of Basic and Acidic Compounds.” Pharmaceutical Research 31, no. 4 (April 1, 2014): 1082–95. https://doi.org/10.1007/s11095-013-1232-z.
  • SpiroKits | SpiroChem | Tailor-made Molecules. “SpiroKits | SpiroChem | Tailor-Made Molecules.” Accessed March 21, 2023. https://www.spirochem.com/spirokits.
  • Yu, Haiying, Ralph Kühne, Ralf-Uwe Ebert, and Gerrit Schüürmann. “Comparative Analysis of QSAR Models for Predicting pKa of Organic Oxygen Acids and Nitrogen Bases from Molecular Structure.” Journal of Chemical Information and Modeling 50, no. 11 (November 22, 2010): 1949–60. https://doi.org/10.1021/ci100306k.