170 likes | 305 Views
Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition. Ming Jack Po and Andrew Laine Department of Biomedical Engineering Columbia University New York, NY USA August 22 nd 2008 IEEE EMBS Annual Conference 2008, Vancouver, Canada. Agenda. Introduction
E N D
Leveraging Genetic Algorithm and Neural Networks in Automated Protein Crystal Recognition Ming Jack Po and Andrew Laine Department of Biomedical Engineering Columbia University New York, NY USA August 22nd 2008 IEEE EMBS Annual Conference 2008, Vancouver, Canada
Agenda • Introduction • Current Algorithm • Future Direction
Protein Structure Determination currently relies on X-ray crystallography • The production of protein crystals is crucial to protein structure determination via x-ray crystallography. • In 2000, the US National Institute of General Medical Sciences of the National Institutes of Health funded the Protein Structure Initiative (PSI), a ten-year project to uncover the three-dimensional shapes of a wide range of proteins.1 • Unfortunately, there are currently no reliable methodology to predict environments that would lead to protein crystallization. • High throughput experiments with varying crystallization parameters are being performed in order to “brute force” the problem. 1) http://www.nature.com/nmeth/journal/v5/n2/full/nmeth0208-203.html
HTP Protein Crystallization Screening is currently the bottleneck in protein crystal discovery • Extensive backlog of images have been developed • 1536 Wells / Plate * 5K Plates * 6 time points ~ 46M Images* • Manual Inspection of images from HTP experiments is not practical • Qualified and trained crystallographers are in short supply. • Crystallographers cannot keep up with the speed of robotic systems used in production experiments. • Automated Protein Crystallization Screening is needed to tackle both previous existing images and future images * Feb 2002 to October 2006 only
Several key challenges have to be overcome for automated protein crystal recognition • Arbitrary geometric orientation and structure of crystals • Presence of organic matter • Non-uniform lighting conditions • Irregular droplet boundaries and size. Hits
Our Solution to the problem – Neural Networks • Advantages • Allows for incremental learning • Can deal with the seemingly arbitrary geometric orientation and structure of crystals • Fast classification speed once neural net has been trained. • Disadvantages • Black-box methodology • Identification of good feature set necessary for good performance • Need sufficiently large training set to be robust
Training database has been compiled by HWI expert crystallographers • Dr. George DeTitta et al. at HWI (Buffalo, NY) has compiled a data set of 73,632 manually classified images. • 3 independent crystallographers each categorized 75,000 images into one of the above categories. • 75,632 of these images have consensus between at least two crystallographers. Only these images were used for validation and training.
Agenda • Introduction • Current Algorithm • Future Direction
Image Normalization MPGA 1 – ROI Detection MPGA 2 – Area of Crystals Linearity Detector Laplacian Pyramidal Decomposition Feature Extraction Pre-Processing Steps • Images are converted to Sobel edge sets and single edge points are removed. • Multi-population Genetic Algorithm is performed on the image to find ellipsoidal Region of Interest (elaborated upon on the next few slides).
Multiple Population Genetic Algorithm • Randomly select 100 “chromosomes” of 5 points. • Fitness based on similarity and distance metric. • Similarity = Distance = • Evolution proceeds through selection and diversification. • Optimize for high fitness score based on a combination of similarity and distance scores. • Selection eliminates low fit populations. • Diversification is realized through crossover, mutation and clustering. • Significant speed and accuracy improvements vs. Randomized Hough Transforms. • Processing time dropped 50% to ~ 10 seconds for ROI detection. Yao, J., Kharma, N., and Grogono, P, "A multi-population genetic algorithm for robust and fast ellipse detection", Pattern Analysis & Applications, Volume 8, Issue 1 - 2, Sep 2005, pp. 149-162
Ellipsoidal Geometry • The equation of a conic through 5 points is • This conic is an ellipse iff • With 5 (x,y) pairs, it is possible to solve for parameters (a,h,b,g,f), and thus in turn solve for the physically related ellipsoidal parameters to the right.
MPGA is run twice due to variations in fitness criteria • Similarity = Distance = • Multiple population genetic algorithm allows for significantly faster and more robust search results than Randomized Hough Transform. • MPGA 1 – ROI Detection • Heavy distance penalties for points that do not line up exactly on the perimeter of the projected ellipse. • looks for r_maj close to r_min (more circular shapes – droplets, well). • r_maj and r_min are bounded at empirically determined values. • MPGA 2 – Crystal Detection • Only run inside ROI • Heavy distance penalties only for far away points, but allow for ellipsoidal shape to be more “flexible”. • Looks for r_maj far from r_min (more elongated ellipsoidal – closer to crystals). • r_maj and r_min are bounded by no more than ½ ROI’s r_maj and r_min.
Crystal Recognition Code Execution Speed * Not scale invariant, and done on original scale
Performance for current algorithm • Performance metrics derived using 10% randomized holdout averaged over 3 iterations. • Current false negative rate ~ 10%. • Working to reduce the number to below 5% at minimum before putting it into production.* • Current false negatives are total misses, so not possible to correct through thresholding. There is also no intuitive visual correlation. • Current true negative rate ~ 99%. Conversations with John Hunt
Agenda • Introduction • Current Algorithm • Future Direction
Future Directions • Incremental Neural Network training has been implemented in Matlab. • Allows us learn new crystal shapes & percipatate. Negligible performance hit. • Porting the simulation portion of the network classifier onto C++. • The current program consists of • Preprocessing done in C++ inside the IT++ framework • Neural network toolbox in Matlab • Currently working on making new training data sets. • Selectively biasing the training data set in order to increase accuracy. • Expansion of feature sets in order to improve false negative rates. Bishop, C. Neural Networks for Pattern Recognition.
Acknowledgements • This project is part of the Northeast Structural Genomics Consortium (NESG) sponsored by the NIH for evaluating the feasibility, costs, economics of scale, and value of structural genomics. • Protein crystal images acquired from Hauptman-Woodward Medical Research Institute, Buffalo, NY.