250 likes | 382 Views
Entropic graphs for high dimensional data analysis Alfred Hero. Outline. Motivating examples Statistical modeling approaches Entropic graphs and applications Conclusions Selected references. I. Fairly low dimensional data: Flow cytometry.
E N D
Entropic graphs for high dimensional data analysisAlfred Hero
Outline • Motivating examples • Statistical modeling approaches • Entropic graphs and applications • Conclusions • Selected references
I. Fairly low dimensional data:Flow cytometry One 2D projection of 6 color flow cytometry data – N = 30,000 (UM Hemopathology Lab – Dr. W.Finn) Lymphocytes Leukocytes Blasts Granulocytes
I. Higher dimensional data:Wireless sensor networks 13x14=182 dimensional RSS data sample collected from 14 node UM wireless sensor network – N=3500 Sample trajectories over time 64 2D projections of 182 dimensional data
C06BL C06BL C06BLR C06BLR C06D0H0 C06D0H0 C06D0H04 C06D0H04 C06D0H04R C06D0H04R C06D0H0R C06D0H0R C06D0H12 C06D0H12 C06D0H12R C06D0H12R C06D0H16 C06D0H16 C06D0H16R C06D0H16R C06D1H20 C06D1H20 280 peripheral blood samples of 20 individuals at 14 timepoints. mRNA, metabolite, protein, and antibody assays at each time point (24,000 probe dimensions) I. Even Higher Dimensional Data:High throughput genomic time series C06D1H20R C06D1H20R C06D1H24 C06D1H24 C06D1H24R C06D1H24R C06D1H30 C06D1H30 C06D1H30R C06D1H30R C06D1H36 C06D1H36 C06D1H36R C06D1H36R C06D2H42 C06D2H42 C06D2H42R C06D2H42R C06D2H48 C06D2H48 C06D2H48R C06D2H48R C06D3 C06D3 C06D3R C06D3R C06D4 C06D4 C06D4R C06D4R C08BL C08BL C08BLR C08BLR C08D0H0 C08D0H0 C08D0H04 C08D0H04 C08D0H04R C08D0H04R C08D0H08 C08D0H08 C08D0H08R C08D0H08R C08D0H0R C08D0H0R C08D0H12 C08D0H12 C08D0H12R C08D0H12R C08D0H16 C08D0H16 C08D0H16R C08D0H16R C08D1H20 C08D1H20 C08D1H20R C08D1H20R C08D1H24 C08D1H24 C08D1H24R C08D1H24R C08D1H30 C08D1H30 C08D1H30R C08D1H30R C08D1H36 C08D1H36 C08D1H36R C08D1H36R C08D2H42 C08D2H42 C08D2H42R C08D2H42R C08D2H48 C08D2H48 C08D2H48R C08D2H48R C08D3 C08D3 C08D3R C08D3R C08D4 C08D4 C08D4R C08D4R
II. Statistical modeling approaches Structured modeling Estimation involves fitting parametric model to a data sample Frequentist parametric models and Fisher’s ML principle (Fisher25) Bayesian parametric models and minimum risk estimation (Jeffreys39) (Likelihood and prior) models include Exponential families of densities (Lehman57) Graphical models (Lauritzen96) Unstructured modeling Estimation is performed directly on the density in data space Nearest neighbor density estimators (FixHodges51) Partitioning density estimators (NobelLugosi96) Models include Multiscale density representations (WadaSato90) Cluster tree density representations (Hartigan75)
II. Unstructured topological model • Density function f(x) • Cutting plane • These level sets are • Minimum volume sets of specified probability • Minimum entropy sets of specified probability • Epigraph sets
II. Toolkit for graphical modeling Factored representation of density (factor graphs) Mixture representation of density (Hidden variable models) Parameter and structure estimation with EM, variational bayes, MCMC, dependency tests
II. Toolkit for topological models Density cluster tree representations Morse-Smale representions Entropic graphs, prim flows Statistics and analysis of shapes, Eds, Krim and Yezzi, Birkhuaser 2006
III. Entropic Graphs an i.i.d. sample of points from density Density defined over real euclidean space a graph on vertices with edges
BHH Theorem Beardwood, Halton, Hammersley Theorem (BHH:1959): *BHH Thm was extended to compact Riemannian manifolds in CostaH:05.
Entropic Graph Estimators 600 uniformly distributed random samples d=2 H=9.6 bits Mean kNNG (k=5) length Costa&Hero:Birkhauser05
MNIST Digit Database Global intrinsic dimension estimates: GMST (left) and 5-NN (right) (M = 10, N = 1, Q = 15).
Local Dimension/Entropy Statistics Costa&Hero:Birkhauser05
Dimensionality analysis of Internet traffic STTL CHIN NYCM SNVA DNVR IPLS WASH LOSA KSCY ATLA HSTN Multiple measurement sites (Abilene)
Dimension of Abilene router load • At indicated time point • Sunnyvale router has increased flow from single IP address. • Over half of packets had both source and destination IP 128.223.216.xxx within port 119. • The same port showed increased activity on the Atlanta router during this time period as well. • K. Carter etal, SSP 2007, Madison WI. Abilene Netflow data (traffic measured at 11 routers)
Entropic image segmentation Local dimension map Satellite image of NYC metro Nbd size Local dimension histogram Local entropy estimator
GEM anomaly detection GEM anomaly detection: Training: for a large set of training samples construct a k-MST/KNNG for k=(1-a)n points, 0<a<1, over training samples (assumed nominal). Test: for a singleton test sample merge test and training samples together Declare anomaly at level a if k-MST/KNNG does not “capture” test sample Example: nominal bivariate Gaussian mixture density New point X is is in capture region of k-MST
WSN activity detection Experiment • 14 Mica2 motes randomly distributed inside and outside lab • 14*13=182 pairs of RSSI measurements over 30 minute period • 1 sample acquired every ½ sec. • TDMA broadcast of 24 measurements every 12 secs • Students walk into and out of lab at random times over period • Positions of motes unknown • Webcam recorded activity for ground truth
GEM for WSN activity detection GEM learns minimum entropy set of given probability [Hero&Michel99] by pruning a minimal spanning tree or kNNG over the feature set Sliding window draws latest 100 samples from normal Level of significance = 0.001 p-values are also generated In Motion Key: Score (1-p) Anomalies from GEM xxxxx Ground truth of motion Score No Motion (secs) Sample Number (2 sps)
Other applications Criminal forensics (Silva and Willett 2008) Image registration (Neemuchwala etal 2005) • Database indexing (Neemuchwala etal 2006) • Neuro tract clustering (Tsai etal 2007)
IV. Conclusions Topological inference offers a model-free approach to high dimensional data analysis Minimal graphs over feature space capture topological properties of feature distribution Intrinsic dimension can be remarkably robust discriminant Anomaly detection can be implemented by minimum entropy set detection Applications to security, medical imaging, remote sensing, sensor networks, gene expression analysis…
Selected references K. Carter, R. Raich and A.O. Hero, "De-biasing local dimension estimation,“ IEEE Workshop on Statist. Sig. Processing (SSP), 2007. J. Costa and A. O. Hero, "Learning intrinsic dimension and entropy of shapes," in Statistics and analysis of shapes, Eds. H. Krim and T. Yezzi, Birkhauser, pp. 231-252, 2006. A. O. Hero, B. Ma, O. Michel and J. Gorman, "Applications of entropic spanning graphs," in IEEE Signal Proc. Magazine (Special Issue on Mathematics in Imaging), Vol 19, No. 5, pp 85-95, Sept. 2002. A. O. Hero III, "Geometric entropy minimization (GEM) for anomaly detection and localization", in Proc. of Advances in Neural Information Processing Systems (NIPS), Vancouver Nov. 2006 E. Oubel, M. De Craene, M, Gazzola, A.O. Hero and A. Frangi, "Multiview registration of cardiac tagging MRI images," IEEE Intl Symp. on Biomedical Imaging (ISBI), April 2007 H. Neemuchwala, A. Hero, S. Zabuawala, and P. Carson, "Image registration methods in high dimensional space,"International Journal of Imaging Systems and Technology, vol. 16, No. 5, pp. 130-145, Mar 2007 J. Silva and R. Willett, “Hypergraph-based anomaly detection in very large networks,” (Submitted to IEEE TSP, 2007) A. Tsai, C-F Westin, A.O. Hero and A. Willsky, "Fiber Tract Clustering on Manifolds With Dual Rooted-Graphs," IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2007.