290 likes | 434 Views
Prediction of NMR Chemical Shifts. A Chemometrical Approach. К.А. Blinov , Y . D . Smurnyy , Т. S . Churanova , М.Е. Elyashberg Advanced Chemistry Development (ACD). Structure and its spectral data. Spectra. Structure. Sometimes solution is not obvious.
E N D
Prediction of NMR Chemical Shifts. A Chemometrical Approach К.А. Blinov, Y.D. Smurnyy, Т.S. Churanova, М.Е. Elyashberg Advanced Chemistry Development (ACD)
Structure and its spectral data Spectra Structure
Sometimes solution is not obvious • In many cases we obtain several structures corresponding to spectral data. • In this case we need a method to rank the structures. • Most powerful method - compare experimental and predicted 13C NMR spectra
13C NMR spectral data Experimental Predicted 2,00 9.62
How to find the best structure? • In most cases predicted spectrum of “correct structure” has best fit to experimental spectrum • In practice “correct structure” has average deviation between predicted and experimental spectra 2-3 ppm
The role of the spectra prediction • Real-world task. Unknown structure with MF C29H32N2O5 and spectral data (1D and 2D NMR). • 20 min to generate all structures (> 12 000) • 24 hoursto predict the NMR 13С spectraof all the obtained structures • Speed of spectra prediction should be increased
Methods of the prediction ofNMR spectra – extremely slow • Quantum Mechanics • Database approach • HOSE Codes • Maximum Common Substructure • Rule-based • Additive scheme • Neural Networks – accurate but slow – fast but inaccurate • Our choice – improve accuracy of fast method
Additive scheme 0.52 -1.85 -2.79 d = åaixi -1.35 153.71 144.31 -1.39 0.52 -4.49 1.43 d = 153.71 -1.85-4.49-1.39 -2.79+1.43+0.52+0.52 -1.35 =144.31 Main problem – find correct values of atom increments
Available data • We have database of 1.5 millions of chemical shifts for 13С. • We can try to obtain correct values!
How to encode atom environment … Atom’s type CH3 CH2 CH2 CH C O … 2 1 1 1 1 Number of atoms 1 1st sphere 2nd sphere Input variables
Data for PLS regression Atom environment encoding Chemical shifts X Y Samples
Find best structure encoding • Initially best scheme of structure representation does not evident • We should find scheme which has best accuracy • We should optimize • substitutents coding scheme • number of used “spheres”
Used data • 210 K of chemical shifts used as a training set. • 170 K of chemical shifts from recent literature used as external validation set.
How to describe atom type “Central” atom 7 (N) • Atom type (C, O, etc.). • Hybridization (sp3, sp2, etc). • Valence • Number of neighbor H. • Charge • Distance to “central” atom (bonds) 1 (sp3) 3 2 0 3 “Substitutent”
Is it the best possible accuracy? • Best possible average deviation is 3.5 ppm. • We need less than 3 ppm (2 is preferable). • Should we use additional variables? • We should be very careful adding variables.
125,38 134,16 138,30 125,90 141,48 Substitutents interference (cross effect) +11,26 +2,48 122,90 136.64 127.86 145.42 D-1.94 D+1.34 D-3.94
Enhanced structure encoding … Atom pair type CH2 and CH CandO … 1 1 Number of pairs Atoms Pairs of atoms (Crosses) Input variables
Result foratom pairs (crosses) Mean error, ppm Distance between atoms within a cross Number of spheres
More enhancements? • Now accuracy is good enough (2.3 ppm) • But it is still bad in some cases • Unfortunately these cases are very important • This “special” cases should be taken into account
Stereo effects: double bonds • We use “topological” distance • Sometimes equal topological distance correspond to different “real” distances 25.7 3,9 A 17.6 2,9 A
Modified structure encoding “Stereo” effects Atoms Pairs of atoms (Crosses) Variables
Prediction of spectra by different methods (mean error, ppm)
Size of training set • We have 1.5 millions of chemical shifts • We should try to use all available data • Only one problem – matrix size • In many cases matrix size becomes more than 2 GB
The final results Faster by 3 order!
Prediction time: the past and present C29H32N2O5
Conclusions • Combination of “new” method with old well-known algorithm can produce very good (and unexpected) result