pKa Calculated by CDD Vault During Chemical Registration – CDD Support

Users may register new Molecules into CDD Vault either manually, one-at-a-time, through the interface or by using the Data Import wizard. Whichever mechanism is used, a set of chemical properties are automatically calculated by CDD Vault for every chemical structure (small molecule) registered. Here are details on the pKa calculation used in CDD Vault.

pKa model (2023.03)

Peter Gedeck, CDD Research Informatics

The pKa model was developed using an approach similar to the method described in Lu et al. (2019). The mean absolute error obtained using cross-validation is between 0.46 and 0.84. The predictive performance of the model is in line with results reported for other models.

Further details

Dataset
The model was developed using data from several public datasets.

Dataset	Acids	Bases	Mixed
Baltruschat et al. (2020)	3,372	4,248	0
Mansouri et al. (2019a)	4,051	3,402	43
Mansouri et al. (2019)	2,680	3,104	435
Datawarrior (2015)	2,360	2,889	366
Hunt (2020)	972	1,410	0
Yu (2010)	580	563	0
Settimo (2014)	118	460	45
CDD internal (2023)	174	94	0
AID781326	49	73	0
AID781327	0	173	0
SpiroKit (2023)	0	66	0
Francisco (2021)	45	0	0
Jensen (2017)	2	41	0
Franz (2001)	36	0	0
SAMPL6 (Hunt, 2020)	5	15	4
SAMPL7	22	0	0
Manual curation	563	566	0
Combined	7,701	8,589	1,039

Each chemical structure is preprocessed as follows:

Explicit hydrogens are removed
Normalization of functional groups
Normalization of few tautomers
Acids and bases are protonated/deprotonated

While in several cases, either only an acid or base pKa value was reported, in many cases, we identified several ionization centers. By restricting the initial training to cases where the structures had only a single ionization center, we identified 6,541 acids and 7,585 bases. To derive an initial model (see below).

Ionization Sites

In order to identify acidic and basic ionization sites, we use a set of SMARTS patterns and categorized them into several subsets. The model training showed that splitting the dataset into subsets and train individual models is beneficial.

Descriptors

Following the results from the work by Lu et al. (2019), we describe ionization sites using rooted topological torsions with lengths between 1 and 7 bonds. We modified this approach by reducing the rooted topological torsions to shortest paths only.

In order not to overtrain the model, we reduced the number of fingerprints by requiring them to occur in several training set structures. This frequency cutoff is length dependent. The pruning reduces the size of the fingerprint considerably by 70%.

Model training and validation

We trained individual models for each subset. Each model was trained using an iterative process. The initial model was trained using only the data points with a single ionization center and a single experimental pKa value. In the next step, the model predictions were used to assign pKa values to ionization sites that were ignored in the initial model. An assignment was made if the prediction was within one pKa unit to the experimental value. This iterative process greatly increased the number of data points that could be used for model building. We repeated this process six times.

The performance metrics for 5-fold cross-validation training of the final iteration are:

Type	Size	RMSE	MAE	MedAE
carboxylic_acid	4688	0.290	0.158	0.079
aromatic_nitrogen	3469	0.609	0.334	0.194
aromatic_alcohol	2333	0.538	0.294	0.163
aliphatic_alcohol	906	1.165	0.540	0.163
thiol	443	0.411	0.195	0.062
aliphatic_NH2	2956	0.347	0.175	0.077
aliphatic_NH	2404	0.440	0.264	0.147
aliphatic_N	4785	0.404	0.245	0.138
aliphatic_Nsp2	813	0.598	0.310	0.102
acid	2683	0.764	0.345	0.141
base	71	1.594	0.533	0.039

The following figure shows the predicted versus experimental pKa values for the final model. The right graph shows the cross-validated predictions and the left graph for predictions on the training data.

We can see that most of the predictions are within one pKa unit.

Comparison to Literature

There are several studies in the literature that compare different methods. Kalliokoski and Sinervo (2019) looked at four different commercial products (Simulations Plus ADMET-Predictor S+pKa, ACD/Labs Percepta Classic, ACD/Labs Percepta GALAS and Epik). The best method gave a median absolute error of 0.69. Settimo et al. (2014) obtained similar results. They reported median absolute errors between 0.3 and 0.8 for several methods (Chemaxon, Epik, ACD). Yu et al. (2010) compared methods from SPARC and ACD and obtained mean absolute errors of 0.22 to 0.43. Hunt et al. (2020) developed various models and reported MAE values between 0.65 and 1.43. Baltruschat and Czodrowski (2020) trained models for pKa prediction and compared them against the performance of Chemaxon. They report MAE of 0.532 (internal model) and 0.57 (Chemaxon) for public data and MAE of 1.15 and 0.86 for Novartis internal data. All these studies show a wide range of performance metrics. The performance of our model compares well to the other methods.

It is useful to point out the drop of 0.3 to 0.6 pKa units reported by Baltruschat and Czodrowski when comparing performance of models between public data and the Novartis internal dataset. The quality of pKa predictions greatly depends on how well an ionization site is described in the training set. The longitudinal study by Gedeck et al. (2015) observed model performance over time for a very large Novartis internal dataset. They showed that even if models are trained on internal data, the appearance of new structural features leads to a drop in model performance.

References

AID 781326. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781326, Source: ChEMBL" PubChem, https://pubchem.ncbi.nlm.nih.gov/bioassay/781326. Accessed 10 December, 2022.
AID 781327. National Center for Biotechnology Information. "PubChem Bioassay Record for AID 781327, Source: ChEMBL" PubChem, https://pubchem.ncbi.nlm.nih.gov/bioassay/781327. Accessed 10 December, 2022.
Baltruschat, Marcel, and Paul Czodrowski. “Machine Learning Meets pKa.” F1000Research 9 (2020): Chem Inf Sci-113. https://doi.org/10.12688/f1000research.22090.2.
Francisco, Karol R., Carmine Varricchio, Thomas J. Paniak, Marisa C. Kozlowski, Andrea Brancale, and Carlo Ballatore. “Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres.” European Journal of Medicinal Chemistry 218 (June 5, 2021): 113399. https://doi.org/10.1016/j.ejmech.2021.113399.
Franz, Robert G. “Comparisons of PKa and Log P Values of Some Carboxylic and Phosphonic Acids: Synthesis and Measurement.” AAPS PharmSci 3, no. 2 (June 1, 2001): 1. https://doi.org/10.1208/ps030210.
Gedeck, Peter, Yipin Lu, Suzanne Skolnik, Stephane Rodde, Gavin Dollinger, Weiping Jia, Giuliano Berellini, Riccardo Vianello, Bernard Faller, and Franco Lombardo. “Benefit of Retraining pKa Models Studied Using Internally Measured Data.” Journal of Chemical Information and Modeling 55, no. 7 (July 27, 2015): 1449–59. https://doi.org/10.1021/acs.jcim.5b00172.
Hunt, Peter, Layla Hosseini-Gerami, Tomas Chrien, Jeffrey Plante, David J. Ponting, and Matthew Segall. “Predicting pKa Using a Combination of Semi-Empirical Quantum Mechanics and Radial Basis Function Methods.” Journal of Chemical Information and Modeling 60, no. 6 (June 22, 2020): 2989–97. https://doi.org/10.1021/acs.jcim.0c00105.
Jensen, Jan H., Christopher J. Swain, and Lars Olsen. “Prediction of PKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods.” The Journal of Physical Chemistry A 121, no. 3 (January 26, 2017): 699–707. https://doi.org/10.1021/acs.jpca.6b10990.
Kalliokoski, Tuomo, and Kai Sinervo. “Predicting pKa for Small Molecules on Public and In-House Datasets Using Fast Prediction Methods Combined with Data Fusion.” Molecular Informatics 38, no. 7 (2019): 1800163. https://doi.org/10.1002/minf.201800163.
Lu, Yipin, Shankara Anand, William Shirley, Peter Gedeck, Brian P. Kelley, Suzanne Skolnik, Stephane Rodde, Mai Nguyen, Mika Lindvall, and Weiping Jia. “Prediction of pKa Using Machine Learning Methods with Rooted Topological Torsion Fingerprints: Application to Aliphatic Amines.” Journal of Chemical Information and Modeling 59, no. 11 (November 25, 2019): 4706–19. https://doi.org/10.1021/acs.jcim.9b00498.
Mansouri, Kamel, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, and Antony J. Williams. “Open-Source QSAR Models for pKa Prediction Using Multiple Machine Learning Approaches.” Journal of Cheminformatics 11, no. 1 (September 18, 2019): 60. https://doi.org/10.1186/s13321-019-0384-1.
Mansouri, Kamel, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, and Antony J. Williams. “MOESM1 of Open-Source QSAR Models for PKa Prediction Using Multiple Machine Learning Approaches.” figshare, September 19, 2019. https://doi.org/10.6084/m9.figshare.9877349.v1.
Sander, Thomas, Joel Freyss, Modest von Korff, and Christian Rufener. “DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis.” Journal of Chemical Information and Modeling 55, no. 2 (February 23, 2015): 460–73. https://doi.org/10.1021/ci500588j.
Settimo, Luca, Krista Bellman, and Ronald M. A. Knegtel. “Comparison of the Accuracy of Experimental and Predicted pKa Values of Basic and Acidic Compounds.” Pharmaceutical Research 31, no. 4 (April 1, 2014): 1082–95. https://doi.org/10.1007/s11095-013-1232-z.
SpiroKits | SpiroChem | Tailor-made Molecules. “SpiroKits | SpiroChem | Tailor-Made Molecules.” Accessed March 21, 2023. https://www.spirochem.com/spirokits.
Yu, Haiying, Ralph Kühne, Ralf-Uwe Ebert, and Gerrit Schüürmann. “Comparative Analysis of QSAR Models for Predicting pKa of Organic Oxygen Acids and Nitrogen Bases from Molecular Structure.” Journal of Chemical Information and Modeling 50, no. 11 (November 22, 2010): 1949–60. https://doi.org/10.1021/ci100306k.

CDD Vault automates the generation of chemical properties based on structures.