270 likes | 407 Views
Mixtures, clustering, spatial [ & dynamic ] point processes and big data sets Mike West Department of Statistical Science Duke University. Immune response studies. 57.8. 0.79. <G710-A>: CD4 CY55PE. 36.3. <V705-A>: CD8 Q705. cellular phenotypes in vaccine adjuvant studies .
E N D
Mixtures, clustering, spatial [ & dynamic ] point processes and big data sets Mike WestDepartment of Statistical ScienceDuke University
Immune response studies 57.8 0.79 <G710-A>: CD4 CY55PE 36.3 <V705-A>: CD8 Q705 cellular phenotypes in vaccine adjuvant studies i. cell subtyping ii. spatio-temporal response Lymphocyte differentiation: Multiple cell types ~ 15 cell surface marker proteins (+)
i. Cell subtyping: Flow cytometry data 57.8 0.79 <G710-A>: CD4 CY55PE 36.3 <V705-A>: CD8 Q705 OPTICS Modest p, large n LASER Multiple experiments … - really big data - characterise data distributions - comparisons FLUIDICS ELECTRONICS
Mixtures for flow cytometry data 88.3 FSC-H <Violet H-A>: vAmine CD14PB CD19 PB 41.4 FSC-W <Violet G-A>: CD3 Amcyan mixture models (TDP version) Chan et al 2008,9 MCMC Bayesian EM Live T-cells
Modal clustering for non-Gaussian mixtures Non-Gaussian clusters/cell subtypes Flexible mixture model: Subtypes: groups of components Modal grouping Mode trace: fast iterative id of modes
Mixtures of mixtures Cluster mixture models (TDP version) Cao & West 1993; Merl et al 2009 Cluster “anchors”
CFSE data: 3 of 7 dimensions: MCMC snapshot dead cells helper Ts data by cluster other Ts cytotoxic Ts components cluster locations
Specification & computation Prior control: Anchor cluster locations Tie component means “close” to anchors MCMC iterates Reallocate data to components: One “big mixture of normals” b. Sufficient statistics: resample normal parameters c. Probabilities: - Counts of data in clusters - Counts in components within clusters BIG data, many components: Exploit parallelisation in modules a, b, c shared memory multi-threading in multi-core, multi-cpu computer cluster: MPI interface Stickiness: New MCMC - Split/merge? Component swapping between clusters MAP/Bayesian EM
Inferences: Comparisons Mouse cell line: HIV adjuvants Common interest: rare cell subsets (e.g. antigen-specific cells << 1%) Changes in relative abundance Changes in marker levels
Variable selection: Discriminative information Measure fewer variables? Subtype characterising variables? Redundant variables? Discrimination confusing variables? discriminators: Marker 2 discriminatory information: - high is good - finds useful & useless variables - ranks subsets - involves “concordances” : Marker 1
CFSE discriminative information analysis Lose irrelevant markers: no loss in false pos/neg rates Simpler, efficient marker subset analysis Change in information by subtype: Drop one marker
Technology adoption: Many routine analyses HIV/AIDS Cancer vaccines Tropical diseases Computation Implementation
ii. Spatial responses: Fluorescent histology/microscopy Example: Mice lymph nodes: Compare immune response to various treatments 3 or 4 fluorescent tags – stain cell types: e.g. B220, IgM,GL-7 Many exploratory questions: Regional concentrations of types? Overall levels of types? Interactions? Germinal centres: relative concentrations of GL7/B220 Etc Different time points PA+Alum, day 1
Immunofluorescent histology: BIG data 4 cell types/4 colour channels: several treatments, several days pixels: grid to small pixel regions PA alone Cells: model 2D (3D) spatial intensity hugely inhomogeneous Noisy fluorescence Flexible model to characterise ... intensity surfaces, … uncertain overall levels, … noise & signal fluorescence, … compare cell types B220 IgM CD4 GL7
Inhomogeneous Poisson process model Intensity function Point process Latent Measured fluorescence levels B cells: GFP/B220, day 1
Spatial mixture & measurement model Truncated Dirichlet process mixture [ Kottas & Sanso 07; Ji et al 09 ] Extend “usual” priors: - random effects - Pareto tails Data: noise/background vs. signal
Fluorescence intensity signal & noise model noise signal Fluorescence intensity data Mixture model - noise vs. signal
Components of posterior Grid: (small) pixel regions: area MCMC: conditionals Gaussian mixture: Signal only observations Large K, large N Block Gibbs sampler for TDP mixtures
MCMC progression & inferences Signal/noise events? Pr(Signal/noise events)? Intensity function … Intensity function … B220/day 1 … estimate… B220/day 11
Posterior summaries and explorations (a) B220 (b) IgM Quantified germinal centres (c) CD4 (*) B220/(B220+IgM)
Computation: Multi-core, multi-thread; cluster Large K, large N mixture model Heavy computation: Configuration indicators, Gaussian component parameter updates Parallelizable steps within MCMC Parallel sub-images: conditional mixture in sub-image allocate pixel to sub-image … then to component in sub-mixture … use a) only for pixels “near boundaries” - reduces computation
Dynamic spatial process Confocal microscopy: Imaging fluorescence in situ Model: quantify directional(?) drifts in intensity Above model at each time: Intensity dynamic - Dynamic models for Gaussian parameters • - Generalized Polya Urn Scheme for random partitions/pixel-component configurations [ Matt Taddy’s talk Caron, Davey, Doucet, 07, UAI C. Ji et al, 09 forthcoming ] Sequential MC: Particle filtering
Team & Links Lynn Lin, PhD student Chan et al, Cytometry A, 2008 Ji et al, BA 2009 New & software: www.stat.duke.edu/~mw Quanli Wang, comp.guru Dan Merl postdoc > Livermore ChunlinJi, PhD student Cliburn Chan Immunology & Comp Bio IoannaManolopolou postdoc Tom Kepler Immunology & Comp Bio