960 likes | 975 Views
Next-Generation Bioinformatics Systems. Jelena Kovačević Center for Bioimage Informatics Department of Biomedical Engineering Carnegie Mellon University. Acknowledgments. Current PhD students. PhD students. Funding. Amina Chebira. Tad Merryman. Gowri Srinivasa. Doru Cristian Balcan.
E N D
Next-Generation Bioinformatics Systems Jelena Kovačević Center for Bioimage InformaticsDepartment of Biomedical EngineeringCarnegie Mellon University
Acknowledgments Current PhDstudents PhD students Funding AminaChebira TadMerryman GowriSrinivasa DoruCristianBalcan ElviraGarciaOsuna PabloHenningsYeomans JasonThornton Collaborators Undergrads VijaykumarBhagavatula GeoffGordon JoséMoura BobMurphy MarkusPüschel MariosSavvides LionelCoulot Woon HoJung HeatherKirshner
Application area Acquisition Knowledge Extraction Computation Goal • Imaging in systems biology • Use informatics to • acquire, store, manipulate and share large bioimaging databases • Leads to • automated, efficient and robust processing • Need • Host of sophisticated tools from many areas
Application Areas • Bioimaging • Current focus in biology: mapping out the protein landscape • Fluorescence microscopy used to gather data on subcellular events► • Biometrics • Biosensing for providing security • to the financial industry • at US borders • Use person’s biometric characteristic to identify/verify►
Acquisition • Issues • z-stacks and time series resolution • Context-dependent • Slow-changing process needs to be acquired with coarser resolution • Changes need to be detected and reacted to • Efficiency of acquisition • Acquire only where and when needed adaptivity • Sample question • How can we efficiently acquire fluorescence microscopy images? ►
Knowledge Extraction • Sample questions • How can we automatically and efficiently classify proteins based on images of their subcellular locations? ► • How can we identify/verify person’s identity based on his/her biometric characteristic? ► • Toolbox needed to solve the problem • Signal processing/data mining • Multiresolution tools allow for adaptive and efficient processing ►
vendor library or SPIRAL generated 10x reasonable implementation Computation • The problem: fast numerical software • Hard to write fast code • Best code platform-dependent • Code becomes obsolete as fast as it is written
The Solution Automatic generation and optimization of numerical software Tuning of implementation and algorithm A new breed of intelligent SW design tools SPIRAL: a prototype for the domain of DSP algorithms ► fast algorithm as SPL formula DSP transform (user specified) Formula generator controls controls runtime on given platform Platform adapted code C/Fortran program Formula translator Search engine SPIRALCode Generation for DSP Algorithms www.spiral.net
Acquisition How can we efficiently acquire fluorescence microscopy images? ► Knowledge extraction How can we automatically and efficiently classify proteins based on images of their subcellular locations? ► Computation Automatic code generation and optimization ► Bioimaging Acquisition Knowledge Extraction Computation Bioimaging
Motivation • Current focus in biological sciences • System-wide research “omics” • Human genome project • Next frontier • Proteomics • Subcellular location one of major components • Grand challenge • Develop an intelligent next-generation bioimaging system capable of fast, robust and accurate classification of proteins based on images of their subcellular locations
Problem Why acquire in areas of low fluorescence? Acquire only when and where needed Measure of success Problem dependent Here: Strive to maintain the achieved classification accuracy Efficient acquisition leads to Faster acquisition Possibility of increasing acquisition resolution Possible increase in classification accuracy due to increased resolution ER MR Acquisition of Fluorescence Microscopy Images
MR Acquisition of Fluorescence Microscopy Images 2D 3D • Approach • Develop algorithm on an acquired data set at maximum resolution • Implement a microscope’s scanning protocol • Algorithm:Mimic “Battleship” strategy • Acquire around the hits
2l 2l Initialize probe locations Probe N N yes Add probe locations Intensity > T? no M M yes Probe locations left? no Algorithm: Details
What will we lose? Scanning simplicity What will we gain? Faster acquisition process Time is proportional to the savings in samples Need to take into account the time to operate scanning unit Higher resolution in 3D The laser intensity can be reduced Reduces photobleaching Some sources indicated linear relationship, some other Trade-Offs
MR sampling algorithm Trivial approach Approximation Difference Image MR Algorithm (9.81:1) Mitochondrial compression versus distortion MSE Trivial Approach (9:1) Percent of samples kept / 100 Results in 3D
Results in 2D Accuracy [%] Compression Ratio
Current and Future Work • Implementation issues • Can one operate galvo-mirrors fast enough to capitalize on the gain? • Algorithmic issues • Add knowledge from classification (feedback) • Build models http://www.olympusconfocal.com/theory/confocalintro.html
Funding and References • Funding • NSF-0331657, “Next-Generation Bio-Molecular Imaging and Information Discovery,” NSF, $2,500,000, 10/03-9/08. Co-PI. • Journal papers • T.E. Merryman and J. Kovačević, “An adaptive multirate algorithm for acquisition of fluorescence microscopy data sets," IEEE Trans. Image Proc., special issue on Molecular and Cellular Bioimaging, September 2005. • Conference papers • T.E. Merryman, J. Kovačević, E.G. Osuna and R.F. Murphy, "Adaptive multirate data acquisition of 3D cell images," Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc., Philadelphia, PA, March 2005.
Segmentation Classification Knowledge Extraction MR Classification of Proteins • Why MR? • Introduction of simple MR features produced a statistically significant jump in accuracy • Introduce adaptivity with little computational cost This is tubulin
3D HeLa ► 2D HeLa ► 3T3 ► Huang & Murphy, Journal of Biomedical Optics 9(5), 893–912, 2004 Data Sets
Cells from Henrietta Lacks (d. 1951, cervical cancer) Confocal Scanning Laser Microscope (100x) DNA stain (PI), all protein stain (Cy5 reactive dye) and fluorescent anti-body for a specific protein 50-58 sets per class 14-24 2D slices per set Resolution 0.049 x 0.049 x 0.2 μm Covers all major subcellular structures ► 3D HeLa Data Set Huang & Murphy, Journal of Biomedical Optics 9(5), 893–912, 2004
Covers all major subcellular structures ► Golgi apparatus (giantin, gpp 130) Cytoskeleton (actin, tubulin) Endoplasmic reticulum membrane (ER) Lysosomes (LAMP2) Endosomes (transf. receptor) Nucleus (nucleolin) Mitochondria outer membrane 3D HeLa Data Set http://www.biologymad.com/
DNA Mitochondria Giantin Actin Tubulin Gpp130 ER LAMP2 Nucleolin Tfr Boland & Murphy, Bioinformatics 17(12), 1213-1223, 2001 2D HeLa Data Set • Cells from Henrietta Lacks (d. 1951, cervical cancer) • Widefield w nearest neighbor deconvolution (100x) • DNA stain and fluorescent anti-body for a specific protein • 78-98 sets per class • Resolution 0.23 x 0.23 μm
Preprocessing Manual shifting Manual rotation Feature computation Subcellular Location Features (SLF) Drawn from many different feature categories Texture, morphological, Gabor and wavelet Gabor and wavelet features improved accuracy significantly(from 88% to 92%) Classification Combination of classifiers Classification: Previous system Input image Preprocessing Feature extraction Classification Class
Points to Frames ► MD frames Wavelet/frame packets ► MR Classification of Proteins • What do we need? • Want to keep MR(based on results with Gabor and wavelet features) • Avoid manual processing • Rotation invariance • Shift invariance • Adaptivity
Does Adaptivity Help? • Would like to use wavelet packets ► • Do not have an obvious cost measure • Line of work Find out if adaptivity helps If it does, find a cost function to use with wavelet packets • Frame packets • Challenge: Same class, different story Tubulin
Clustering images Full wavelet tree Feature extraction K-means clustering Weights Voting Weight computation Gaussian modeling Gaussianmodels Training image Training Phase • Number of classes C • Number of training images/class N
Full Wavelet Tree Decomposition Clustering images Full wavelet tree • Grow a full tree ► • Depth L levels • Total number of subbands S
Feature Extraction Clustering images Full wavelet tree Feature extraction • Use Haralick texture features ► • One feature vector per subband s • Indexed by class c, training image n, subband s
K-Means Clustering Clustering images Full wavelet tree Feature extraction K-means clustering • Clustering in a fixed subband • Max K clusters/class Feature vector for image I from class c and subband s Cluster mean Clusteringimages of class c X
Gaussian Modeling Clustering images Full wavelet tree Feature extraction K-means clustering Gaussian modeling Training image • Model each cluster with a Gaussian pdf • Probability the training image belongs to class i • Output: single probability vector
Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C From Feature Space to Probability Space
Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation: Initialization Clustering images Full wavelet tree Feature extraction K-means clustering Weight computation Gaussian modeling Training image • Decision for vector tc,n,s
Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation : Initialization • Initial weight for subband s: probability of correct decision correct incorrect incorrect correct correct correct correct incorrect
Class 1 Subband 1 Class 1 Class C Subband 1 Subband S Image 1 from Class 1 Subband S Image 1 from Class 1 Image N from Class 1 Image 1 from Class C Image N from Class C Weight Computation • Compute probability vector for each image
Weight AdjustmentVoting Clustering images Full wavelet tree Feature extraction K-means clustering Weights Voting Weight computation Gaussian modeling Gaussianmodels Training image • Make a decision • Decision correct • Do nothing, take next image • Decision incorrect • Adjust the weights, take next image • Make runs through all the images • Does the algorithm converge?
Testing Phase Testing image • Compute probabilities for each subband • Compute the overall probability vector • Make the decision Weights Full wavelet tree Feature extraction Probability space Voting Class label Gaussianmodels
Results • C = 10 classes • N = 45 training images • T = 5 testing images • 10-fold cross validation • Training phase • 44 clustering images • 45-fold cross validation • L = 2,3 levels of Haar wavelet decomposition • K = 10 max number of clusters per class
Weight Adjustment: 2nd Try • Keep the previous best weight • Can do no worse than previous system
Principal Component Analysis • Using eigenspace representations for Haralick texture features Texture classification (TC) • Decomposition better than no decomposition(with or without PCA) • There is information in the subbands TC + PCA • Improves accuracy(with or without decomposition) Dimensionality reduction (DR) • Increases accuracy slightly without much complexity
Effect of Translation Variance • No translation • accuracy(MR frames)>accuracy(MR) • Translation • MR drops • MR frames stable
Conclusions and Future Directions • Adaptivity definitely helps! • Accuracy stable with the increased # of epochs • Investigate the algorithm for convergence • K-means clustering introduces randomness • There is no notion of global, local minima • Reducing K reduces randomness • Weighting • Should be done for each class separately • Would lead to WP trees • Find cost function • Construct frame packets
References • Conference papers • G. Srinivasa, A. Chebira, T. Merryman and J. Kovačević, “Adaptive multiresolution texture features for protein image classification”, Proc. BMES Annual Fall Meeting, Baltimore, MD, September 2005. • K Williams, T. Merryman and J. Kovačević, “A Wavelet Subband Enhancement to Classification”, Proc. Annual Biomed. Res. Conf. for Minority Students, Atlanta, GA, November 2005. Submitted. • A. Mintos, G. Srinivasa, A. Chebira and J. Kovačević, “Combining Wavelet Features with PCA for Classification of Protein Images”, Proc. Annual Biomed. Res. Conf. for Minority Students, Atlanta, GA, November 2005. Submitted. • T. Merryman, K. Williams and J. Kovačević, “A multiresolution enhancement to generic classifiers of subcellular protein location images”, Proc. IEEE Intl. Symp. Biomed. Imaging, Arlington, VA, April 2006. In preparation. • G. Srinivasa, T. Merryman, A. Chebira, A. Mintos and J. Kovačević, “Adaptive multiresolution techniques for subcellular protein location image classification”, Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc., Toulouse, France, May 2006. Invited paper. In preparation.
Automatic Code Generation • Work in progress
Acquisition Knowledge Extraction Computation Biometrics Biometrics • Acquisition • NIST database • Knowledge extraction • How can we identify/verify person’s identity based on his/her biometric characteristic? ► • Computation • Automatic code generation and optimization ►
Motivation • Security to the financial industry ► • 89,000 cases of identity theft in 2000 • Losses incurred by Visa/MasterCard $68.2 million • Security at US borders • Multimodal biometric systems • Grand challenge • Develop an intelligent next-generation biometric system capable of fast, robust and accurate identification and verification of human biometric characteristics.
Challenges • Variable conditions • Different lighting, indoors/outdoors, different poses, … • Small training sets • Uncooperative biometrics(access to only one picture of a suspected criminal) • Huge databases • Computation becomes an issue • Database sizes: up to hundreds of thousands