The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques

The TEXTAL System:Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R. Ioerger Department of Computer Science Texas A&M University Collaboration with: Dr. James C. Sacchettini, Center for Structural Biology, Texas A&M Univ. With support from: National Institutes of Health

Automated Structure Determination • Key step to high-throughput Structural Genomics, structure-based drug design, etc. • Many computational tools to generate a map, but... • Given electron density map, how to extract atomic coordinates automatically? • Currently requires humans (+O): potential bottleneck • Sources of difficulty: complexity, low resolution, phase errors, weak density • Related methods: Shake&Bake, ARP/wARP, X-Powerfit, template convolution...

Overview of TEXTAL • Apply pattern recognition techniques • Exploit database of previously-solved maps • Model molecular structures in local regions (e.g. spheres of 5 Angstrom radius) • Intuitive principles: 1) Have I ever seen a region with a pattern of density like this before? 2) If so, what were previous local atomic coordinates?

Overview (cont’d) • Divide-and-Conquer: 1) identify alpha-carbon positions (chain-tracing) 2) model regions around alpha-carbons (CAs), including backbone and side-chain atoms 3) concatenate local models back together, resolve any conflicts • Database contains many regions centered on CAs from previous maps • ~5A radius right for “structural repetition”

Main Stages of TEXTAL electron density map CAPRA build-in side-chain and main-chain atoms locally around each CA C-alpha chains Reciprocal-space refinement/ML DM LOOKUP example: real-space refinement model (initial coordinates) Human Crystallographer (editing) Post-processing routines model (final coordinates)

Feature Extraction • Database: ~105 regions from ~100 maps • How to identify closest match (efficiently)??? • Calculate numerical features that represent the pattern in each region • Must be rotation-invariant • Search can be very fast: just compare features

F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...> F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>

Rotation-Invariant Features • Average density: m=(1/n)Sri, where ri is density at each lattice point in region • Other Statistical Features: standard deviation, kurtosis… • Distant to center of mass: • <xc,yc,zc>=(1/n)< Sxiri/m,Syiri/m,Sziri/m> • dcen=(xc2+ yc2+zc2)

More Features • Moments of inertia • measures dispersion around axes of symmetry in a density distribution • calculate 3x3 inertia matrix • diagonalize to get eigenvalues • sort from largest to smallest • take magnitudes and ratios of moments

More Features • Spoke angles • if region centered on CA, should have 3 “spokes” of density emanating from center • find best-fit vectors; calc. angles among them • surface area of contours • connectivity of density/bones in region • other geometrical features...

Feature Weights

CAPRA: C-Alpha Pattern-Recognition Algorithm Density Trace Neural Network Linking into C-alpha chains • Tracer - remove lattice points from map (lowest density first) without breaking connectivity • Neural nework - for each pseudo atom, extract features, input to network, predict distances to CAs (1:10 in trace), trained on example points in real maps • Linking - desire long chains, good CA predictions (not in side-chains), “structurally plausible” (e.g. linear, helical) map pseudo atoms predictions of distance to true CA C-alpha coordinates

Example of the CAPRA Process

Example of CAPRA chains

The LOOKUP Process

Database Construction • Ideally would use solved MAD/MIR maps • Using “back-transformed” maps works well • PDB  structure factors (include B-factors) • keep reflections down to 2.8A • Fourier transform  electron density map • 50 proteins from PDBSelect (non-homol.) • about 50,000 regions • Feature extraction done offline

Details of Matching Process • Feature-based matching: • Euclidean distance metric between feature vectors. • dist(R1,R2)=Swi(Fi(R1)-Fi(R2))2 • Must weight features by relevance • less-relevant features add noise • Slider algorithm: optimize weights by comparing features in matching regions versus mismatches • Verify selections by density correlation • requires search for optimal rotation

Post-Processing Routines • Imperfections in the initial model: • backbone atoms not necessarily juxtaposed between adjacent residues, or in same direction • side-chains occasionally “flipped” into backbone • residue identities often incorrect (based on dens.) • Fixing “flips” and direction - take candidate match with next highest correlation • Real-space refinement: regularizes backbone • Use sequence alignment to fix identities?

New Results on Real MAD Maps aCZRA: missed a 5-res loop (weak density) and C-terminus bM01: missed a 17-res helix, 9 deletions, 5 due to breaks, 3-res false backbone

Histograms of DistancesBetween Matched Atoms

Analysis of Amino Acid Types Confusion Matrix for CZRA: Amino acid in true structure Amino acid in TEXTAL model

The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques

The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques

Presentation Transcript

An overview of the SPHINX Speech Recognition System

Chapter 2 Systems Techniques and Documentation

Automated IHSS Payroll System Proposal

Design Patterns

272: Software Engineering Fall 2008

NATCA Grievance Automated Tracking System G.A.T.S. Training

NATCA Grievance Automated Tracking System G.A.T.S. Training

3D Model-Based Hand Gesture Recognition and Tracking

Suffix tree and suffix array techniques for pattern analysis in strings

Revenue Recognition

Model Building For ARIMA time series

雲端計算 Cloud Computing

Search Patterns

PATTERN RECOGNITION Fatoş Tunay Yarman Vural

Institute of Information Theory and Automation Introduction to Pattern Recognition

Introduction to Pattern Recognition Chapter 1 ( Duda et al.)

Crystallography and Diffraction Techniques

A Tutorial on Bayesian Speech Feature Enhancement

Eigen Value Analysis in Pattern Recognition

Design and Implementation of Speech Recognition Systems

Agricultural Land use Pattern - Von Thunen Model