Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition

Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition Ming Jack Po and Andrew Laine Department of Biomedical Engineering Columbia University New York, NY USA August 22nd 2008 IEEE EMBS Annual Conference 2008, Vancouver, Canada

Agenda • Introduction • Current Algorithm • Future Direction

Protein Structure Determination currently relies on X-ray crystallography • The production of protein crystals is crucial to protein structure determination via x-ray crystallography. • In 2000, the US National Institute of General Medical Sciences of the National Institutes of Health funded the Protein Structure Initiative (PSI), a ten-year project to uncover the three-dimensional shapes of a wide range of proteins.1 • Unfortunately, there are currently no reliable methodology to predict environments that would lead to protein crystallization. • High throughput experiments with varying crystallization parameters are being performed in order to “brute force” the problem. 1) http://www.nature.com/nmeth/journal/v5/n2/full/nmeth0208-203.html

HTP Protein Crystallization Screening is currently the bottleneck in protein crystal discovery • Extensive backlog of images have been developed • 1536 Wells / Plate * 5K Plates * 6 time points ~ 46M Images* • Manual Inspection of images from HTP experiments is not practical • Qualified and trained crystallographers are in short supply. • Crystallographers cannot keep up with the speed of robotic systems used in production experiments. • Automated Protein Crystallization Screening is needed to tackle both previous existing images and future images * Feb 2002 to October 2006 only

Several key challenges have to be overcome for automated protein crystal recognition • Arbitrary geometric orientation and structure of crystals • Presence of organic matter • Non-uniform lighting conditions • Irregular droplet boundaries and size. Hits

Our Solution to the problem – Neural Networks • Advantages • Allows for incremental learning • Can deal with the seemingly arbitrary geometric orientation and structure of crystals • Fast classification speed once neural net has been trained. • Disadvantages • Black-box methodology • Identification of good feature set necessary for good performance • Need sufficiently large training set to be robust

Training database has been compiled by HWI expert crystallographers • Dr. George DeTitta et al. at HWI (Buffalo, NY) has compiled a data set of 73,632 manually classified images. • 3 independent crystallographers each categorized 75,000 images into one of the above categories. • 75,632 of these images have consensus between at least two crystallographers. Only these images were used for validation and training.

Image Normalization MPGA 1 – ROI Detection MPGA 2 – Area of Crystals Linearity Detector Laplacian Pyramidal Decomposition Feature Extraction Pre-Processing Steps • Images are converted to Sobel edge sets and single edge points are removed. • Multi-population Genetic Algorithm is performed on the image to find ellipsoidal Region of Interest (elaborated upon on the next few slides).

Multiple Population Genetic Algorithm • Randomly select 100 “chromosomes” of 5 points. • Fitness based on similarity and distance metric. • Similarity = Distance = • Evolution proceeds through selection and diversification. • Optimize for high fitness score based on a combination of similarity and distance scores. • Selection eliminates low fit populations. • Diversification is realized through crossover, mutation and clustering. • Significant speed and accuracy improvements vs. Randomized Hough Transforms. • Processing time dropped 50% to ~ 10 seconds for ROI detection. Yao, J., Kharma, N., and Grogono, P, "A multi-population genetic algorithm for robust and fast ellipse detection", Pattern Analysis & Applications, Volume 8, Issue 1 - 2, Sep 2005, pp. 149-162

Ellipsoidal Geometry • The equation of a conic through 5 points is • This conic is an ellipse iff • With 5 (x,y) pairs, it is possible to solve for parameters (a,h,b,g,f), and thus in turn solve for the physically related ellipsoidal parameters to the right.

MPGA is run twice due to variations in fitness criteria • Similarity = Distance = • Multiple population genetic algorithm allows for significantly faster and more robust search results than Randomized Hough Transform. • MPGA 1 – ROI Detection • Heavy distance penalties for points that do not line up exactly on the perimeter of the projected ellipse. • looks for r_maj close to r_min (more circular shapes – droplets, well). • r_maj and r_min are bounded at empirically determined values. • MPGA 2 – Crystal Detection • Only run inside ROI • Heavy distance penalties only for far away points, but allow for ellipsoidal shape to be more “flexible”. • Looks for r_maj far from r_min (more elongated ellipsoidal – closer to crystals). • r_maj and r_min are bounded by no more than ½ ROI’s r_maj and r_min.

Crystal Recognition Code Execution Speed * Not scale invariant, and done on original scale

Performance for current algorithm • Performance metrics derived using 10% randomized holdout averaged over 3 iterations. • Current false negative rate ~ 10%. • Working to reduce the number to below 5% at minimum before putting it into production.* • Current false negatives are total misses, so not possible to correct through thresholding. There is also no intuitive visual correlation. • Current true negative rate ~ 99%. Conversations with John Hunt

Future Directions • Incremental Neural Network training has been implemented in Matlab. • Allows us learn new crystal shapes & percipatate. Negligible performance hit. • Porting the simulation portion of the network classifier onto C++. • The current program consists of • Preprocessing done in C++ inside the IT++ framework • Neural network toolbox in Matlab • Currently working on making new training data sets. • Selectively biasing the training data set in order to increase accuracy. • Expansion of feature sets in order to improve false negative rates. Bishop, C. Neural Networks for Pattern Recognition.

Acknowledgements • This project is part of the Northeast Structural Genomics Consortium (NESG) sponsored by the NIH for evaluating the feasibility, costs, economics of scale, and value of structural genomics. • Protein crystal images acquired from Hauptman-Woodward Medical Research Institute, Buffalo, NY.

Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition

Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition

Presentation Transcript

Protein Structure Alignment using a Genetic algorithm

Will neural network work for my problem? Character recognition neural networks Prediction neural networks

Genetic Algorithm

Neural-Fuzzy Pattern Recognition Algorithm for Classifying the Events in Power System Networks

Fingerprint Recognition – Neural Networks

Genetic Programming and Artificial Neural Networks

Genetic Algorithm

Genetic Algorithms in Artificial Neural Networks

Learning Algorithm and Neural Networks

Speech Recognition through Neural Networks

Genetic Algorithm

Genetic Algorithms and Neural Networks

Artificial Neural Networks for Pattern Recognition

Genetic Algorithm

Genetic Regulatory Networks Applied to Neural Networks

Character Recognition Using Neural Networks

Genetic Algorithm

Neural Network-based Face Recognition, using ARENA Algorithm.

Genetic Algorithm

GENETIC ALGORITHM

Pattern Recognition Using Artificial Neural Networks

Genetic Algorithm