290 likes | 438 Views
Dr. Mahout: Analyzing clinical data using scalable and distributed computing. Shannon Quinn CPCB squinn@cmu.edu | spq1@pitt.edu November 10, 2011. 1/29. Punchline. Cloud computing for biological and clinical data analysis Problem: high- dimensional, noisy!. tech2date.com.
E N D
Dr. Mahout:Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edu | spq1@pitt.edu November 10, 2011 1/29
Punchline • Cloud computing for biological and clinical data analysis • Problem: high- dimensional, noisy! tech2date.com Heart tissue: biomedcentral fMRI: wikipedia segmentation: biodynamics UCSD 2/29
Disclaimer • Biology jargon • Academic jargon 3/29
My Background • 2nd year Ph.D. student in CPCB Program • Research in bioimage informatics 4/29
My Background • Other http://collegefootballbelt.com/Logos/ http://s3.amazonaws.com/data.tumblr.com/ 5/29
Computational biology and …the cloud? • Biological data • is BIG • requires repetitive analysis in chunks • modeling involves linear algebra and statistics 6/29
Use case 1: protein behavior [ 10-15 10-12 10-9 10-6 10-3 100 timescale of relevant motions bond vibration side-chain rotation domain shifts/ max. catalysis protein folding global conformational shifts sampling detail a common tradeoff… 7/29
Molecular dynamics 8/29
“The curse of [MD] dimensionality” • MD := • for every atom • for every t • … http://icanhascheezburger.files.wordpress.com/ http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi 9/29
Pipeline for MD trajectory analysis • Find a “surface” of protein shapes • MD output • Define surface (graph!) • Partition surface http://www.dillgroup.ucsf.edu/ 10/29
Mahout implementation Defining surface/graph: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Partitioning surface/graph: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Kmeans (kmeans) . . . 11/29
MD in Mahout conclusion • MD simulations (x@Home projects) • Existing Mahout functionality • Additional algorithms http://folding.stanford.edu/ 12/29
Use case 2: diseases affecting cilia • What are cilia? • Hairlike structures • Keep things moving • Diseasedcilia = http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg 13/29
Importance of correct diagnoses • Symptoms look familiar • Consequences do not 14/29
Beat pattern of cilia tells a lot! • What is the motion called? • Can we create a database of motions? • Clinicians look at cilia motion in making their diagnoses 15/29
Clinicians’ ultimate goal ? ? ? Category 1 Category 2 Category 3 16/29
Cilia as dynamic textures • Properties • Computer vision Saisanet al 2001 17/29
The [proposed] pipeline • Step 1 • Clinician captures video and uploads it http://googolplex.dyndns.org/cilia/ 18/29
The [proposed] pipeline • Step 2 • Mahout job: autoregressive modeling Appearance Model Dynamic Model http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg 19/29
The [proposed] pipeline • Step 3 • Add the transition matrices to cloud library A = 20/29
The [proposed] pipeline • Step 4 • Recompute network with added videos ? Axis 2 Axis 1 21/29
One more thing… • What’s really cool about AR models: • Can you spot the fake? Synthetic Original 22/29
Mahout implementation Learning autoregressive models: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Comparing autoregressive parameters: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Frobenius norm Tensors ? ? ? 23/29
Cilia on Mahout conclusions • Autoregressive modeling uses linear algebra that is already implemented • Maintaining AR library requires new functionality • Mahout framework gives us elbow room 24/29
Final Thoughts • Biological / biomedical data is large, high-dimensional, and noisy • We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models) • We provide a cloud framework! 25/29
Research Group • University of Pittsburgh • Dr. Chakra Chennubhotla Lab (advisor) • CMU@Qatar • Dr. Majd Sakr Lab (collaborator) • University of Pittsburgh Medical Center • Dr. Cecilia Lo Lab (collaborator) 26/29
Sources • Resources • Apache Mahout • Spectrally Clustered • Links • Categorizing ciliary motion defects (BSEC 2011) • Eigencuts spectral clustering algorithm • Technical report (coming soon!) 27/29
Contact • Shannon Quinn • squinn@cmu.edu | spq1@pitt.edu • http://www.magsolweb.net/ 28/29
Thank you! http://icanhascheezburger.files.wordpress.com/ 29/29