Isaac Newton Institute - Cambridge

Isaac Newton Institute - Cambridge Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina August 15, 2014

Personal Opinions on Mathematical Statistics What is Mathematical Statistics? • Validation of existing methods • Asymptotics (n  ∞) & Taylor expansion • Comparison of existing methods (requires hard math, but really “accounting”???)

Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? • Basis for invention of new methods • Complicated data  mathematical ideas • Do we value creativity? • Since we don’t do this, others do… (where are the ₤₤₤s???)

Personal Opinions on Mathematical Statistics • Since we don’t do this, others do… • Pattern Recognition • Artificial Intelligence • Neural Nets • Data Mining • Machine Learning • ???

Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics • Clinical Trials Viewpoint: Worst Imaginable Idea • Mathematical Statistics Viewpoint: ???

Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? • 1st Course: Numbers • Multivariate Analysis Course : Vectors • Functional Data Analysis: Curves • More generally: Data Objects

Object Oriented Data Analysis, II Examples: • Medical Image Analysis • Images as Data Objects? • Shape Representations as Objects • Micro-arrays • Just multivariate analysis?

Object Oriented Data Analysis, III Typical Goals: • Understanding population variation • Visualization • Principal Component Analysis + • Discrimination (a.k.a. Classification) • Time Series of Data Objects

Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) • Dimension d >> sample size n • “Multivariate Analysis” nearly useless • Can’t “normalize the data” • Land of Opportunity for Statisticians • Need for “creative statisticians”

Object Oriented Data Analysis, V Major Statistical Challenge, II: • Data may live in non-Euclidean space • Lie Group / Symmet’c Spaces (manifold data) • Trees/Graphs as data objects • Interesting Issues: • What is “the mean” (pop’n center)? • How do we quantify “pop’n variation”?

Statistics in Image Analysis, I First Generation Problems: • Denoising • Segmentation • Registration (all about single images)

Statistics in Image Analysis, II Second Generation Problems: • Populations of Images • Understanding Population Variation • Discrimination (a.k.a. Classification) • Complex Data Structures (& Spaces) • HDLSS Statistics

HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? • Complex 3-d Objects Hard to Represent • Often need d = 100’s of parameters • Complex 3-d Objects Costly to Segment • Often have n = 10’s cases

Medical Imaging – A Challenging Example • Male Pelvis • Bladder – Prostate – Rectum • How do they move over time (days)? • Critical to Radiation Treatment (cancer) • Work with 3-d CT • Very Challenging to Segment • Find boundary of each object? • Represent each Object?

Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate

Male Pelvis – Raw Data Prostate: manual segmentation Slice by slice Reassembled

Male Pelvis – Raw Data Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-YeonJeong

Object Representation • Landmarks (hard to find) • Boundary Rep’ns (no correspondence) • Medial representations • Find “skeleton” • Discretize as “atoms” called M-reps

3-d m-reps • Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes”  “surface”

3-d m-reps • M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood  posterior • ~Conjugate Gaussians, but there are issues: • MajorHLDSS challenges • Manifold aspect of data

PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements

HDLSS Classification (Cont.) • Ineffective Methods: • Fisher Linear Discrimination • Gaussian Likelihood Ratio • Less Useful Methods: • Nearest Neighbors • Neural Nets (“black boxes”, no “directions” or intuition)

HDLSS Classification (Cont.) • Currently Fashionable Methods: • Support Vector Machines • Trees Based Approaches • New High Tech Method • Distance Weighted Discrimination (DWD) • Specially designed for HDLSS data • Avoids “data piling” problem of SVM • Solves more suitable optimization problem

HDLSS Classification (Cont.) • Currently Fashionable Methods: • Trees Based Approaches • Support Vector Machines:

Distance Weighted Discrimination Maximal Data Piling

Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming • Still Convex gen’tion of quadratic prog’ing • Fast greedy solution • Can use existing software

DWD Bias Adjustment for Microarrays Microarray data: • Simult. Measur’ts of “gene expression” • Intrinsically HDLSS • Dimension d ~ 1,000s – 10,000s • Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”

DWD Batch and Source Adjustment • For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times

DWD Adj: Raw Breast Cancer data

DWD Adj: Source Colors

DWD Adj: Batch Colors

DWD Adj: Biological Class Colors

DWD Adj: Biological Class Colors & Symbols

DWD Adj: Biological Class Symbols

DWD Adj: Source Colors

DWD Adj: PC 1-2 & DWD direction

DWD Adj: DWD Source Adjustment

DWD Adj: Source Adj’d, PCA view

DWD Adj: Source Adj’d, Class Colored

DWD Adj: Source Adj’d, Batch Colored

DWD Adj: Source Adj’d, 5 PCs

DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

DWD Adj: S. & B1,2 vs. 3 Adjusted

DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

DWD Adj: S. & B Adj’d, B1 vs. 2 DWD

Isaac Newton Institute - Cambridge

Isaac Newton Institute - Cambridge

Presentation Transcript

Isaac Newton

Isaac Newton

Isaac Newton

Isaac Newton

Isaac Newton

Isaac Newton

Isaac Newton Institute - Cambridge

Isaac Newton

Isaac Newton

Isaac Newton

Isaac Newton