210 likes | 488 Views
Two cases of chemometrics application in protein crystallography. European Molecular Biology Laboratory (EMBL), Hamburg, Germany. Andrey Bogomolov. Outline. Protein crystallography: a brief introduction
E N D
Two cases of chemometrics application in protein crystallography European Molecular Biology Laboratory (EMBL), Hamburg, Germany Andrey Bogomolov
Outline • Protein crystallography: a brief introduction • Case I: determination of protein secondary structure from the raw diffraction data using PLS-R • Case II: modeling of crystal radiation damage • Potential applications of chemometric techniques to crystallography (of biological macromolecules)
Protein crystallography: introduction • Protein (macromolecular) crystallography is a scientific discipline that studies… • biological objects: proteins, DNA, RNA etc. … • by physical means: X-ray diffraction, synchrotron radiation … • on the chemical level: 3D-structure, complexes, interactions … • with the extensive use of mathematics: data analysis, modeling • The main objectives: • solve 3D-structure of a molecule • explain its biological function at the atomic level • Today’s hot topic: • drug design • part of the global “-omics” project (genomics/proteomics)
Protein crystallography workflow protein (DNA, RNA) solution expression& purification crystallization data collection phasing structure solution
Protein crystallography workflow protein crystal expression& purification crystallization data collection phasing structure solution
Protein crystallography workflow diffraction pattern expression& purification crystallization data collection phasing structure solution
Protein crystallography workflow electron density map expression& purification crystallization data collection phasing structure solution
Protein crystallography workflow 3D structure expression& purification crystallization data collection phasing structure solution
Protein Data Bank (PDB) Global data collection (>30000 records) • www.pdb.org • 3D structures • experimental data • biological and chemical information
control optimization theoretical experimental Crystallographic data collection: Wilson plot X-ray beam
Case I: Determination of protein secondary structure Problem: • determine the contents (fractions of the polypeptide chain) of secondary structure elements in a protein molecule from the raw diffraction data (Wilson plot) • well established method for CD and IR spectra of protein solutions • PLS regression – one of the best methods • Wilson plot: only qualitative data on existing correlation for “theoretical” data α-helix β-sheet
theoretical experimental *) experimental data only Secondary structure determination: data Data Preprocessing: • averaging with an optimal bin size* • special scaling (correction for anisotropic B-factor)* • taking the natural logarithm • conversion into the matrix (Wilson plots in rows)* • auto-scaling • outliers detection and removal*
theoretical 1hq3 (α) 1at0 (β) experimental 1d5t (α+β) Secondary structure determination: data (2)
Secondary structure determination: calibration results RMSEP & correlation coefficients for different methods α-helix (theoretical) *) Resolution (1/d) = 0.52Å-1 (~1.9 Å) • S. Navea, R. Tauler, A. de Juan, Elucidation of protein secondary structure, Anal. Biochem. 336 (2005) 231–242 • K.A. Oberg, J.-M. Ruysschaert, and E. Goormaghtigh, The optimization of protein secondary structure determination with infrared and circular dichroism spectra, Eur. J. Biochem. 271 (2004) 2937-2948
Case II: Modeling radiation damage • Biological crystal exposed to X-rays undergoes radiation damage: • Modeling of radiation damage is important • understanding of the effect on the protein • optimization of data collection • Problem present state • no comprehensive theory of RD • specific effects are well-known, but it the main changes are non-specific • Suggestion by Gleb Bourenkov: • radiation dose has linear effect on atom’s B-factors • Task • check for linearity, find reason(s) of deviation
Radiation damage modeling: results r=0.999 RMSEP=9.4×10-3
Conclusions • Multivariate data analysis has a great potential for protein crystallography • currently it is application is episodic • rarely goes beyond PCA • Method-centric approach would be beneficial: • “I have a method, I am looking for problems”
X-files PCA, Factor Analysis crystallization, HTPC SIMCA, PLSD crystal screening Multivariate Regression crystal auto-mounting MSPC, Design Of Experiment data collection Curve Resolution data reduction Multivariate Image Analysis radiation damage Target Factor Analysis phasing PARAFAC, 3(multi)-way structure solution Wavelet Transform structure refinement
Challenge Critical re-assessment of the entire protein crystallographic workflow with multivariate approach in mind– an ambitious project for chemometricians?
Acknowledgements • Alexander Popov • Gleb Bourenkov • Victor Lamzin