Information submitted through the support site is private but is not hosted within your secure CDD Vault. Please do not include sensitive intellectual property in your support requests.

pKa Calculated by CDD Vault During Chemical Registration

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the pKa calculation used in CDD Vault.

pKa model

Peter Gedeck, CDD Research Informatics

The pKa model was developed using an approach similar to the method described in Lu et al. (2019). The mean absolute error obtained using cross-validation is between 0.46 and 0.84. The predictive performance of the model is in line with results reported for other models.

 

Further details

Dataset
The model was developed using data from several public datasets.

Dataset

Acids

Bases

Mixed

Mansouri et al. (2019)

2,680

3,104

435

AID781326

49

73

0

AID781327

0

173

0

Manual curation

121

28

0

Combined

2,719

3,294

435

 

Each chemical structure is preprocessed as follows:

  • Explicit hydrogens are removed
  • Normalization of functional groups
  • Normalization of few tautomers
  • Acids and bases are protonated/deprotonated

While in several cases, either only an acid or base pKa value was reported, in many cases, we identified several ionization centers. By restricting the initial training to cases where the structures had only a single ionization center, we identified 1,293 acids and 1,365 bases. We use this unambiguous assignment to derive an initial model that gets iteratively refined.

Ionization Sites

In order to identify acidic and basic ionization sites, we use a set of SMARTS patterns. 

 

Descriptors

Following the results from the work by Lu et al. (2019), we describe ionization sites using rooted topological torsions with lengths between 1 and 7 bonds. In order not to overtrain the model, we reduced the number of fingerprints by requiring them to occur in several training set structures. This frequency cutoff is length dependent. The pruning reduces the size of the fingerprint considerably by 70%.

 

Model training and validation

We decided to build three different models. 

  • Carboxylic acid model
  • Acid model (all remaining acidic ionization sites)
  • Base model

About 30% of all acidic ionization sites are carboxylic acids. This is a sufficiently large subset that justifies training a separate model. 

Each model was trained using an iterative process. The initial model was trained using only the data points with a single ionization center and a single experimental pKa value (566 carboxylic acids, 726 acids, and 1358 bases). In the next step, the model predictions were used to assign pKa values to ionization sites that were ignored in the initial model. An assignment was made if the prediction was within one pKa unit to the experimental value. This iterative process greatly increased the number of data points that could be used for model building. We repeated this process eight times.

The performance metrics for 5-fold cross-validation training are:

Model

Type

Size

RMSE

MAE

MedAE

Initial model

Carboxylic acid

566

0.81

0.44

0.25

 

Acid

726

0.77

1.18

0.72

 

Base

1,358

1.35

0.87

0.51

Final model

Carboxylic acid

566

0.87

0.46

0.26

 

Acid

1,322

1.36

0.84

0.48

 

Base

2,921

0.90

0.57

0.38

 

The following figure shows the predicted versus experimental pKa values for the final model. The right graph shows the cross-validated predictions and the left graph for predictions on the training data. The green points are for the carboxylic acid model, the orange for the base model, and blue for the acid model.

pKa1.PNG

We can see that most of the predictions are within one pKa unit. 

 

Comparison to Literature

There are several studies in the literature that compare different methods. Kalliokoski and Sinervo (2019) looked at four different commercial products (Simulations Plus ADMET-Predictor S+pKa, ACD/Labs Percepta Classic, ACD/Labs Percepta GALAS and Epik). The best method gave a median absolute error of 0.69. Settimo et al. (2014) obtained similar results. They reported median absolute errors between 0.3 and 0.8 for several methods (Chemaxon, Epik, ACD). Yu et al. (2010) compared methods from SPARC and ACD and obtained mean absolute errors of 0.22 to 0.43. Hunt et al. (2020) developed various models and reported MAE values between 0.65 and 1.43. Baltruschat and Czodrowski (2020) trained models for pKa prediction and compared them against the performance of Chemaxon. They report MAE of 0.532 (internal model) and 0.57 (Chemaxon) for public data and MAE of 1.15 and 0.86 for Novartis internal data. All these studies show a wide range of performance metrics. The performance of our model compares well to the other methods. 

It is useful to point out the drop of 0.3 to 0.6 pKa units reported by Baltruschat and Czodrowski when comparing performance of models between public data and the Novartis internal dataset. The quality of pKa predictions greatly depends on how well an ionization site is described in the training set. The longitudinal study by Gedeck et al. (2015) observed model performance over time for a very large Novartis internal dataset. They showed that even if models are trained on internal data, the appearance of new structural features leads to a drop in model performance.

 

References

AID 781326. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781326, Source: ChEMBL" PubChemhttps://pubchem.ncbi.nlm.nih.gov/bioassay/781326. Accessed 10 December, 2022.

AID 781327. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781327, Source: ChEMBL" PubChemhttps://pubchem.ncbi.nlm.nih.gov/bioassay/781327. Accessed 10 December, 2022.

Baltruschat, Marcel, and Paul Czodrowski. “Machine Learning Meets pKa.” F1000Research 9 (2020): Chem Inf Sci-113. https://doi.org/10.12688/f1000research.22090.2.

Gedeck, Peter, Yipin Lu, Suzanne Skolnik, Stephane Rodde, Gavin Dollinger, Weiping Jia, Giuliano Berellini, Riccardo Vianello, Bernard Faller, and Franco Lombardo. “Benefit of Retraining pKa Models Studied Using Internally Measured Data.” Journal of Chemical Information and Modeling 55, no. 7 (July 27, 2015): 1449–59. https://doi.org/10.1021/acs.jcim.5b00172.

Hunt, Peter, Layla Hosseini-Gerami, Tomas Chrien, Jeffrey Plante, David J. Ponting, and Matthew Segall. “Predicting pKa Using a Combination of Semi-Empirical Quantum Mechanics and Radial Basis Function Methods.” Journal of Chemical Information and Modeling 60, no. 6 (June 22, 2020): 2989–97. https://doi.org/10.1021/acs.jcim.0c00105.

Kalliokoski, Tuomo, and Kai Sinervo. “Predicting pKa for Small Molecules on Public and In-House Datasets Using Fast Prediction Methods Combined with Data Fusion.” Molecular Informatics 38, no. 7 (2019): 1800163. https://doi.org/10.1002/minf.201800163.

Lu, Yipin, Shankara Anand, William Shirley, Peter Gedeck, Brian P. Kelley, Suzanne Skolnik, Stephane Rodde, Mai Nguyen, Mika Lindvall, and Weiping Jia. “Prediction of pKa Using Machine Learning Methods with Rooted Topological Torsion Fingerprints: Application to Aliphatic Amines.” Journal of Chemical Information and Modeling 59, no. 11 (November 25, 2019): 4706–19. https://doi.org/10.1021/acs.jcim.9b00498.

Mansouri, Kamel, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, and Antony J. Williams. “Open-Source QSAR Models for pKa Prediction Using Multiple Machine Learning Approaches.” Journal of Cheminformatics 11, no. 1 (September 18, 2019): 60. https://doi.org/10.1186/s13321-019-0384-1.

Settimo, Luca, Krista Bellman, and Ronald M. A. Knegtel. “Comparison of the Accuracy of Experimental and Predicted pKa Values of Basic and Acidic Compounds.” Pharmaceutical Research 31, no. 4 (April 1, 2014): 1082–95. https://doi.org/10.1007/s11095-013-1232-z.

Yu, Haiying, Ralph Kühne, Ralf-Uwe Ebert, and Gerrit Schüürmann. “Comparative Analysis of QSAR Models for Predicting pKa of Organic Oxygen Acids and Nitrogen Bases from Molecular Structure.” Journal of Chemical Information and Modeling 50, no. 11 (November 22, 2010): 1949–60. https://doi.org/10.1021/ci100306k.