270 likes | 423 Views
The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques. Dr. Thomas R. Ioerger Department of Computer Science Texas A&M University. Collaboration with: Dr. James C. Sacchettini, Center for Structural Biology, Texas A&M Univ.
E N D
The TEXTAL System:Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R. Ioerger Department of Computer Science Texas A&M University Collaboration with: Dr. James C. Sacchettini, Center for Structural Biology, Texas A&M Univ. With support from: National Institutes of Health
Automated Structure Determination • Key step to high-throughput Structural Genomics, structure-based drug design, etc. • Many computational tools to generate a map, but... • Given electron density map, how to extract atomic coordinates automatically? • Currently requires humans (+O): potential bottleneck • Sources of difficulty: complexity, low resolution, phase errors, weak density • Related methods: Shake&Bake, ARP/wARP, X-Powerfit, template convolution...
Overview of TEXTAL • Apply pattern recognition techniques • Exploit database of previously-solved maps • Model molecular structures in local regions (e.g. spheres of 5 Angstrom radius) • Intuitive principles: 1) Have I ever seen a region with a pattern of density like this before? 2) If so, what were previous local atomic coordinates?
Overview (cont’d) • Divide-and-Conquer: 1) identify alpha-carbon positions (chain-tracing) 2) model regions around alpha-carbons (CAs), including backbone and side-chain atoms 3) concatenate local models back together, resolve any conflicts • Database contains many regions centered on CAs from previous maps • ~5A radius right for “structural repetition”
Main Stages of TEXTAL electron density map CAPRA build-in side-chain and main-chain atoms locally around each CA C-alpha chains Reciprocal-space refinement/ML DM LOOKUP example: real-space refinement model (initial coordinates) Human Crystallographer (editing) Post-processing routines model (final coordinates)
Feature Extraction • Database: ~105 regions from ~100 maps • How to identify closest match (efficiently)??? • Calculate numerical features that represent the pattern in each region • Must be rotation-invariant • Search can be very fast: just compare features
F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...> F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>
Rotation-Invariant Features • Average density: m=(1/n)Sri, where ri is density at each lattice point in region • Other Statistical Features: standard deviation, kurtosis… • Distant to center of mass: • <xc,yc,zc>=(1/n)< Sxiri/m,Syiri/m,Sziri/m> • dcen=(xc2+ yc2+zc2)
More Features • Moments of inertia • measures dispersion around axes of symmetry in a density distribution • calculate 3x3 inertia matrix • diagonalize to get eigenvalues • sort from largest to smallest • take magnitudes and ratios of moments
More Features • Spoke angles • if region centered on CA, should have 3 “spokes” of density emanating from center • find best-fit vectors; calc. angles among them • surface area of contours • connectivity of density/bones in region • other geometrical features...
CAPRA: C-Alpha Pattern-Recognition Algorithm Density Trace Neural Network Linking into C-alpha chains • Tracer - remove lattice points from map (lowest density first) without breaking connectivity • Neural nework - for each pseudo atom, extract features, input to network, predict distances to CAs (1:10 in trace), trained on example points in real maps • Linking - desire long chains, good CA predictions (not in side-chains), “structurally plausible” (e.g. linear, helical) map pseudo atoms predictions of distance to true CA C-alpha coordinates
Database Construction • Ideally would use solved MAD/MIR maps • Using “back-transformed” maps works well • PDB structure factors (include B-factors) • keep reflections down to 2.8A • Fourier transform electron density map • 50 proteins from PDBSelect (non-homol.) • about 50,000 regions • Feature extraction done offline
Details of Matching Process • Feature-based matching: • Euclidean distance metric between feature vectors. • dist(R1,R2)=Swi(Fi(R1)-Fi(R2))2 • Must weight features by relevance • less-relevant features add noise • Slider algorithm: optimize weights by comparing features in matching regions versus mismatches • Verify selections by density correlation • requires search for optimal rotation
Post-Processing Routines • Imperfections in the initial model: • backbone atoms not necessarily juxtaposed between adjacent residues, or in same direction • side-chains occasionally “flipped” into backbone • residue identities often incorrect (based on dens.) • Fixing “flips” and direction - take candidate match with next highest correlation • Real-space refinement: regularizes backbone • Use sequence alignment to fix identities?
New Results on Real MAD Maps aCZRA: missed a 5-res loop (weak density) and C-terminus bM01: missed a 17-res helix, 9 deletions, 5 due to breaks, 3-res false backbone
Analysis of Amino Acid Types Confusion Matrix for CZRA: Amino acid in true structure Amino acid in TEXTAL model