1.03k likes | 1.18k Views
Isaac Newton Institute - Cambridge. Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina August 15, 2014. Personal Opinions on Mathematical Statistics. What is Mathematical Statistics? Validation of existing methods
E N D
Isaac Newton Institute - Cambridge Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina August 15, 2014
Personal Opinions on Mathematical Statistics What is Mathematical Statistics? • Validation of existing methods • Asymptotics (n ∞) & Taylor expansion • Comparison of existing methods (requires hard math, but really “accounting”???)
Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? • Basis for invention of new methods • Complicated data mathematical ideas • Do we value creativity? • Since we don’t do this, others do… (where are the ₤₤₤s???)
Personal Opinions on Mathematical Statistics • Since we don’t do this, others do… • Pattern Recognition • Artificial Intelligence • Neural Nets • Data Mining • Machine Learning • ???
Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics • Clinical Trials Viewpoint: Worst Imaginable Idea • Mathematical Statistics Viewpoint: ???
Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? • 1st Course: Numbers • Multivariate Analysis Course : Vectors • Functional Data Analysis: Curves • More generally: Data Objects
Object Oriented Data Analysis, II Examples: • Medical Image Analysis • Images as Data Objects? • Shape Representations as Objects • Micro-arrays • Just multivariate analysis?
Object Oriented Data Analysis, III Typical Goals: • Understanding population variation • Visualization • Principal Component Analysis + • Discrimination (a.k.a. Classification) • Time Series of Data Objects
Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) • Dimension d >> sample size n • “Multivariate Analysis” nearly useless • Can’t “normalize the data” • Land of Opportunity for Statisticians • Need for “creative statisticians”
Object Oriented Data Analysis, V Major Statistical Challenge, II: • Data may live in non-Euclidean space • Lie Group / Symmet’c Spaces (manifold data) • Trees/Graphs as data objects • Interesting Issues: • What is “the mean” (pop’n center)? • How do we quantify “pop’n variation”?
Statistics in Image Analysis, I First Generation Problems: • Denoising • Segmentation • Registration (all about single images)
Statistics in Image Analysis, II Second Generation Problems: • Populations of Images • Understanding Population Variation • Discrimination (a.k.a. Classification) • Complex Data Structures (& Spaces) • HDLSS Statistics
HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? • Complex 3-d Objects Hard to Represent • Often need d = 100’s of parameters • Complex 3-d Objects Costly to Segment • Often have n = 10’s cases
Medical Imaging – A Challenging Example • Male Pelvis • Bladder – Prostate – Rectum • How do they move over time (days)? • Critical to Radiation Treatment (cancer) • Work with 3-d CT • Very Challenging to Segment • Find boundary of each object? • Represent each Object?
Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate
Male Pelvis – Raw Data Prostate: manual segmentation Slice by slice Reassembled
Male Pelvis – Raw Data Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-YeonJeong
Object Representation • Landmarks (hard to find) • Boundary Rep’ns (no correspondence) • Medial representations • Find “skeleton” • Discretize as “atoms” called M-reps
3-d m-reps • Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes” “surface”
3-d m-reps • M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood posterior • ~Conjugate Gaussians, but there are issues: • MajorHLDSS challenges • Manifold aspect of data
PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)
PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…
PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements
HDLSS Classification (Cont.) • Ineffective Methods: • Fisher Linear Discrimination • Gaussian Likelihood Ratio • Less Useful Methods: • Nearest Neighbors • Neural Nets (“black boxes”, no “directions” or intuition)
HDLSS Classification (Cont.) • Currently Fashionable Methods: • Support Vector Machines • Trees Based Approaches • New High Tech Method • Distance Weighted Discrimination (DWD) • Specially designed for HDLSS data • Avoids “data piling” problem of SVM • Solves more suitable optimization problem
HDLSS Classification (Cont.) • Currently Fashionable Methods: • Trees Based Approaches • Support Vector Machines:
Distance Weighted Discrimination Maximal Data Piling
Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming • Still Convex gen’tion of quadratic prog’ing • Fast greedy solution • Can use existing software
DWD Bias Adjustment for Microarrays Microarray data: • Simult. Measur’ts of “gene expression” • Intrinsically HDLSS • Dimension d ~ 1,000s – 10,000s • Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”
DWD Batch and Source Adjustment • For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times