480 likes | 752 Views
Handling Outliers and Missing Data in Statistical Data Models. Kaushik Mitra Date: 17/1/2011 ECSU Seminar, ISI. Statistical Data Models. Goal: Find structure in data Applications Finance Engineering Sciences Biological Wherever we deal with data Some examples Regression
E N D
Handling Outliers and Missing Data in Statistical Data Models KaushikMitra Date: 17/1/2011 ECSU Seminar, ISI
Statistical Data Models • Goal: Find structure in data • Applications • Finance • Engineering • Sciences • Biological • Wherever we deal with data • Some examples • Regression • Matrix factorization • Challenges: Outliers and Missing data
Outliers Are Quite Common Google search results for `male faces’
Need to Handle Outliers Properly Removing salt-and-pepper (outlier) noise Noisy image Gaussian filtered image Desired result
Missing Data Problem Missing tracks in structure from motion Completing missing tracks Completed tracks by a sub-optimal method Desired result Incomplete tracks
Our Focus • Outliers in regression • Linear regression • Kernel regression • Matrix factorization in presence of missing data
What is Regression? • Regression • Find functional relation between y and x • x: independent variable • y: dependent variable • Given • data: (yi,xi) pairs • Model y = f(x, w)+n • Estimate w • Predict y for a new x
Robust Regression • Real world data corrupted with outliers • Outliers make estimates unreliable • Robust regression • Unknown • Parameter, w • Outliers • Combinatorial problem • N data and k outliers • C(N,k) ways
Prior Work • Combinatorial algorithms • Random sample consensus (RANSAC) • Least Median Squares (LMedS) • Exponential in dimension • M-estimators • Robust cost functions • local minima
Robust Linear Regression model • Linear regression model : yi=xiTw+ei • ei, Gaussian noise • Proposed robust model: ei=ni+si • ni, inlier noise (Gaussian) • si, outlier noise (sparse) • Matrix-vector form • y=Xw+n+s • Estimate w, s x1T x2T . . xNT n1 n2 . . nN s1 s2 . . sN w1 w2 . wD y1 y2 . . yN + + =
Simplification • Objective (RANSAC): Find w that minimizes the number of outliers • Eliminate w • Model: y=Xw+n+s • Premultiple by C: CX=0, N ≥ D • Cy=CXw+Cs+Cn • z=Cs+g • g Gaussian • Problem becomes: • Solve for s -> identify outliers -> LS -> w
Relation to Sparse Learning • Solve: • Combinatorial problem • Sparse Basis Selection/ Sparse Learning • Two approaches : • Basis Pursuit (Chen, Donoho, Saunder 1995) • Bayesian Sparse Learning (Tipping 2001)
Basis Pursuit Robust regression (BPRR) • Solve • Basis Pursuit Denoising (Chen et. al. 1995) • Convex problem • Cubic complexity : O(N3) • From Compressive Sensing theory (Candes 2005) • Equivalent to original problem if • s is sparse • C satisfy Restricted Isometry Property (RIP) • Isometry: ||s1 - s2|| = ||C(s1 – s2)|| • Restricted: to the class of sparse vectors • In general, no guarantees for our problem
Bayesian Sparse Robust Regression (BSRR) • Sparse Bayesian learning technique (Tipping 2001) • Puts a sparsity promoting prior on s : • Likelihood : p(z/s)=Ν(Cs,εI) • Solves the MAP problem p(s/z) • Cubic Complexity : O(N3)
Setup for Empirical Studies • Synthetically generated data • Performance criteria • Angle between ground truth and estimated hyper-planes
Vary Outlier Fraction • BSRR performs well in all dimensions • Combinatorial algorithms like RANSAC, MSAC, LMedS not practical in high dimensions Dimension = 2 Dimension = 32 Dimension = 8
Facial Age Estimation • Fgnet dataset : 1002 images of 82 subjects • Regression • y : Age • x: Geometric feature vector
Outlier Removal by BSRR • Label data as inliers and outliers • Detected 177 outliers in 1002 images • Leave-one-out testing
Summary for Robust Linear Regression • Modeled outliers as sparse variable • Formulated robust regression as Sparse Learning problem • BPRR and BSRR • BSRR gives the best performance • Limitation: linear regression model • Kernel model
Relevance Vector Machine (RVM) • RVM model: • : kernel function • Examples of kernels • k(xi, xj) = (xiTxj)2 : polynomial kernel • k(xi, xj) = exp( -||xi - xj||2/2σ2) : Gaussian kernel • Kernel trick: k(xi,xj) = ψ(xi)Tψ(xj) • Map xi to feature space ψ(xi)
RVM: A Bayesian Approach • Bayesian approach • Prior distribution : p(w) • Likelihood : • Prior specification • p(w) : sparsity promoting prior p(wi) = 1/|wi| • Why sparse? • Use a smaller subset of training data for prediction • Support vector machine • Likelihood • Gaussian noise • Non-robust : susceptible to outliers
Robust RVM model • Original RVM model • e, Gaussian noise • Explicitly model outliers, ei= ni + si • ni, inlier noise (Gaussian) • si, outlier noise (sparse and heavy-tailed) • Matrix vector form • y = Kw + n + s • Parameters to be estimated: w and s
Robust RVM Algorithms • y = [K|I]ws + n • ws = [wTsT]T : sparse vector • Two approaches • Bayesian • Optimization
Robust Bayesian RVM (RB-RVM) • Prior specification • w and s independent : p(w, s) = p(w)p(s) • Sparsity promoting prior for s: p(si)= 1/|si| • Solve for posterior p(w, s|y) • Prediction: use w inferred above • Computation: a bigger RVM • ws instead of w • [K|I] instead of K
Basis Pursuit RVM (BP-RVM) • Optimization approach • Combinatorial • Closest convex approximation • From compressive sensing theory • Same solution if [K|I] satisfies RIP • In general, can not guarantee
Image Denoising • Salt and pepper noise • Outliers • Regression formulation • Image as a surface over 2D grid • y: Intensity • x: 2D grid • Denoised image obtained by prediction
Some More Results RVM RB-RVM Median Filter
Age Estimation from Facial Images • RB-RVM detected 90 outliers • Leave-one-person-out testing
Summary for Robust RVM • Modeled outliers as sparse variables • Jointly estimated parameter and outliers • Bayesian approach gives very good result
Limitations of Regression • Regression: y = f(x,w)+n • Noise in only “y” • Not always reasonable • All variables have noise • M = [x1x2 … xN] • Principal component analysis (PCA) • [x1x2 … xN] = ABT • A: principal components • B: coefficients • M = ABT: matrix factorization (our next topic)
Applications in Computer Vision • Matrix factorization: M=ABT • Applications: build 3-D models from images • Geometric approach (Multiple views) • Photometric approach (Multiple Lightings) Structure from Motion (SfM) Photometric stereo
Matrix Factorization • Applications in Vision • Affine Structure from Motion (SfM) • Photometric stereo • Solution: SVD • M=USVT • Truncate S to rank r • A=US0.5, B=VS0.5 xij yij M = = CST Rank 4 matrix M = NST, rank = 3
Missing Data Scenario • Missed feature tracks in SfM • Specularities and shadow in photometric stereo Incomplete feature tracks
Challenges in Missing Data Scenario • Can’t use SVD • Solve: • W: binary weight matrix, λ: regularization parameter • Challenges • Non-convex problem • Newton’s method based algorithm (Buchanan et. al. 2005) • Very slow • Design algorithm • Fast (handle large scale data) • Flexible enough to handle additional constraints • Ortho-normality constraints in ortho-graphic SfM
Proposed Solution • Formulate matrix factorization as a low-rank semidefinite program (LRSDP) • LRSDP: fast implementation of SDP (Burer, 2001) • Quasi-Newton algorithm • Advantages of the proposed formulation: • Solve large-scale matrix factorization problem • Handle additional constraints
Low-rank Semidefinite Programming (LRSDP) • Stated as: • Variable: R • Constants • C: cost • Al, bl: constants • Challenge • Formulating matrix factorization as LRSDP • Designing C, Al, bl
Matrix factorization as LRSDP: Noiseless Case • We want to formulate: • As: • LRSDP formulation: C identity matrix, Al indicator matrix
Affine SfM • Dinosaur sequence • MF-LRSDP gives the best reconstruction 72% missing data
Photometric Stereo • Face sequence • MF-LRSDP and damped Newton gives the best result 42% missing data
Additional Constraints:Orthographic Factorization • Dinosaur sequence
Summary • Formulated missing data matrix factorization as LRSDP • Large scale problems • Handle additional constraints • Overall summary • Two statistical data models • Regression in presence of outliers • Role of sparsity • Matrix factorization in presence of missing data • Low rank semidefinite program