1 / 55

Learning From Data Locally and Globally

Learning From Data Locally and Globally. Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu. Outline. Background Linear Binary classifier Global Learning Bayes optimal Classifier Local Learning Support Vector Machine Contributions

amelia
Download Presentation

Learning From Data Locally and Globally

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning From Data Locally and Globally Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Dept. of C.S.E., C.U.H.K.

  2. Outline • Background • Linear Binary classifier • Global Learning • Bayes optimal Classifier • Local Learning • Support Vector Machine • Contributions • Minimum Error Minimax Probability Machine (MEMPM) • Biased Minimax Probability Machine (BMPM) • Maxi-Min Margin Machine (M4) • Local Support Vector Regression (LSVR) • Future work • Conclusion Dept. of C.S.E., C.U.H.K.

  3. Background - Linear Binary Classifier Given two classes of data sampled from x and y, we are trying to find a linear decision plane wT z + b=0, which can correctly discriminate x from y. wT z +b< 0, z is classified as y; wT z + b>0, z is classified as x. wT z + b=0 : decision hyperplane y x Dept. of C.S.E., C.U.H.K.

  4. Traditional Bayes Optimal Classifier Background - Global Learning (I) • Global learning • Basic idea: Focusing on summarizing data usually by estimating a distribution • Example • 1) Assume Gaussinity for the data • 2) Learn the parameters via MLE or other criteria • 3) Exploit Bayes theory to find the optimal thresholding for classification Dept. of C.S.E., C.U.H.K.

  5. Background - Global Learning (II) • Problems Usually have to assume specific models on data, which may NOT always coincide with data “all models are wrong but some are useful…”—by George Box Estimating distributions may be wasteful and imprecise Finding the ideal generator of the data, i.e., the distribution, is only an intermediate goal in many settings, e.g., in classification or regression. Optimizing an intermediate objective may be inefficient or wasteful. Dept. of C.S.E., C.U.H.K.

  6. Background- Local Learning (I) • Local learning • Basic idea: Focus on exploiting part of information, which is directly related to the objective, e.g., the classification accuracy instead of describing data in a holistic way • Example In classification, we need to accurately model the data around the (possible) separating plane, while inaccurate modeling of other parts is certainly acceptable (as is done in SVM). Dept. of C.S.E., C.U.H.K.

  7. Decision Plane Margin Support Vectors Background - Local Learning (II) • Support Vector Machine (SVM) ---The current state-of-the-art classifier Dept. of C.S.E., C.U.H.K.

  8. Background - Local Learning (III) • Problems The fact that the objective is exclusively determined by local information may lose the overall view of data Dept. of C.S.E., C.U.H.K.

  9. A more reasonable hyperplane Learning Locally and Globally Background- Local Learning (IV) An illustrative example Along the dashed axis, the y data is obviously more likely to scatter than the x data. Therefore, a more reasonable hyerplane may lie closer to the x data rather than locating itself in the middle of two classes as in SVM. SVM y x Dept. of C.S.E., C.U.H.K.

  10. Learning Locally and Globally • Basic idea: Focus on using both local informationandcertain robust global information • Do not try to estimate the distribution as in global learning, which may be inaccurate and indirect • Consider robust global information for providing a roadmap for local learning Dept. of C.S.E., C.U.H.K.

  11. Local Learning,e.g., SVM Distribution-free Bayes optimal classifier --- Minimum Error Minimax Probability Machine (MEMPM) Maxi-Min Margin Machine (M4) Summary of Background Optimizing an intermediate objective Problem Can we directly optimize the objective?? Global Learning Problem Assume specific models Without specific model assumption? Local Learning Problem Focusing on local info may lose the roadmap of data Can we learn both globally and locally?? Dept. of C.S.E., C.U.H.K.

  12. Contributions • Mininum Error Minimax Probability Machine (Accepted by JMLR 04) • A worst-case distribution-free Bayes Optimal Classifier • Containing Minimax Probability Machine (MPM) and Biased Minimax Probability Machine (BMPM)(AMAI04,CVPR04) as special cases • Maxi-Min Margin Machine (M4) (ICML 04+Submitted) • A unified framework that learns locally and globally • Support Vector Machine (SVM) • Minimax Probability Machine (MPM) • Fisher Discriminant Analysis (FDA) • Can be linked with MEMPM • Can be extended into regression: Local Support Vector Regression (LSVR) (submitted) Dept. of C.S.E., C.U.H.K.

  13. Hierarchy Graph of Related Models Classification models Global Learning Hybrid Learning Local Learning Generative Learning FDA Non-parametric Learning Gabriel Graph M4 Conditional Learning MEMPM SVM Bayesian Average Learning neural network BMPM Maximum Likelihood Learning Bayesian Point Machine MPM LSVR Maximum Entropy Discrimination Dept. of C.S.E., C.U.H.K.

  14. The class of distributions that have prescribed mean and covariance likewise Minimum Error Minimax Probability Machine (MEMPM) Model Definition: y {w,b} wTy≤b x wTx≥b • θ: prior probability of class x; α(β): represents the worst-case accuracy for class x (y) Dept. of C.S.E., C.U.H.K.

  15. MEMPM: Model Comparison MEMPM (JMLR04) MPM (Lanckriet et al. JMLR 2002) Dept. of C.S.E., C.U.H.K.

  16. MEMPM: Advantages • A distribution-free Bayes optimal Classifier in the worst-case scenario • Containing an explicit accuracy bound, namely, • Subsuming a special case Biased Minimax Probability Machinefor biased classification Dept. of C.S.E., C.U.H.K.

  17. BMPM An ideal model for biased classification. A typical setting: We should maximize the accuracy for the important class as long as the accuracy for the less important class is acceptable(greater than an acceptable level ). MEMPM: Biased MPM Biased Classification: Diagnosis of epidemical disease: Classifying a patient who is infected with a disease into an opposite class results in more serious consequence than the other way around. The classification accuracy should be biased towards the class with disease. Dept. of C.S.E., C.U.H.K.

  18. MEMPM: Biased MPM (I) • Objective • Equivalently Dept. of C.S.E., C.U.H.K.

  19. MEMPM: Biased MPM (II) • Objective • Equivalently, • Equivalently, • Each local optimum is the global optimum • Can be solved in O(n3+Nn2) Conave-Convex Fractional Programming problem N: number of data points n: Dimension Dept. of C.S.E., C.U.H.K.

  20. MEMPM: Optimization (I) • Objective • Equivalently Dept. of C.S.E., C.U.H.K.

  21. MEMPM: Optimization (II) • Objective • Line search + BMPM method Dept. of C.S.E., C.U.H.K.

  22. Learning Locally and Globally MEMPM: Problems • As a global learning approach, the decision plane is exclusivelydependent on global information, i.e., up to second-order moments. • These moments may NOT be accurately estimated! –We may need local information to neutralize the negative effect caused. Dept. of C.S.E., C.U.H.K.

  23. Learning Locally and Globally:Maxi-Min Margin Machine (M4) A more reasonable hyperplane y SVM Model Definition x Dept. of C.S.E., C.U.H.K.

  24. M4: Geometric Interpretation Dept. of C.S.E., C.U.H.K.

  25. M4: Solving Method (I) Divide and Conquer: If we fix ρ to a specific ρn , the problem changes to check whether this ρn satisfies the following constraints: If yes, we increase ρn; otherwise, we decrease it. Second Order Cone Programming Problem!!! Dept. of C.S.E., C.U.H.K.

  26. M4: Solving Method (II) Iterate the following two Divide and Conquer steps: Sequential Second Order Cone Programming Problem!!! Dept. of C.S.E., C.U.H.K.

  27. M4: Solving Method (III) No Yes can it satisfy the constraints? • The worst-case iteration number is log(L/) L: ρmax -ρmin (search range)  : The required precision • Each iteration is a Second Order Cone Programming problem yielding O(n3) • Cost of forming the constraint matrix O(N n3) Total time complexity= O(log(L/) n3+ N n3)O(N n3) N: number of data points n: Dimension Dept. of C.S.E., C.U.H.K.

  28. Span all the data points and add them together M4: Links with MPM (I) + Exactly MPM Optimization Problem!!! Dept. of C.S.E., C.U.H.K.

  29. M4: Links with MPM (II) M4 Remarks: • The procedure is not reversible: MPM is a special case of M4 • MPM focuses on building decision boundary GLOBALLY, i.e., it exclusively depends on the means and covariances. • However, means and covariances may not be accurately estimated. MPM Dept. of C.S.E., C.U.H.K.

  30. M4: Links with SVM (I) 1 4 If one assumes ∑=I 2 Support Vector Machines The magnitude of w can scale up without influencing the optimization. Assume ρ(wT∑w)0.5=1 3 SVM is the special case of M4 Dept. of C.S.E., C.U.H.K.

  31. M4: Links with SVM (II) Assumption 1 Assumption 2 If one assumes ∑=I These two assumptions of SVM are inappropriate Dept. of C.S.E., C.U.H.K.

  32. M4: Links with FDA (I) If one assumes ∑x=∑y=(∑*y+∑*x)/2 FDA Perform a procedure similar to MPM… Dept. of C.S.E., C.U.H.K.

  33. M4: Links with FDA (II) If one assumes ∑x=∑y=(∑*y+∑*x)/2 Assumption ? Still inappropriate Dept. of C.S.E., C.U.H.K.

  34. M4: Links with MEMPM MEMPM M4 (a globalized version) Tand s Κ(α)andΚ(β): The margin from the mean to the decision plane The globalized M4maximizes the weighted margin, while MEMPM Maximizes the weighted worst-case accuracy. Dept. of C.S.E., C.U.H.K.

  35. How to solve?? Line Search+Second Order Cone Programming M4 :Nonseparable Case Introducing slack variables Dept. of C.S.E., C.U.H.K.

  36. M4 : Extended into Regression---Local Support Vector Regression (LSVR) Regression: Find a function to approximate the data LSVR Model Definition SVR Model Definition Dept. of C.S.E., C.U.H.K.

  37. Local Support Vector Regression (LSVR) • When supposing ∑i=I for each observation, LSVR is equivalent with l1-SVR under a mild assumption. Dept. of C.S.E., C.U.H.K.

  38. SVR vs. LSVR Dept. of C.S.E., C.U.H.K.

  39. Globalized LSVR Gloablized and assume ∑x=∑y=(∑*x+∑*y)/2 MEMPM Globalized assume ∑x=∑y=I Short Summary M4 MPM FDA SVM Dept. of C.S.E., C.U.H.K.

  40. Non-linear Classifier : Kernelization (I) • Previous discussions of MEMPM, BMPM, M4 , and LSVR are conducted in the scope of linear classification. • How about non-linear classification problems? Using Kernelization techniques Dept. of C.S.E., C.U.H.K.

  41. Non-linear Classifier : Kernelization (II) • In the next slides, we mainly discuss the kernelization on M4, while the proposed kernelization method is also applicable for MEMPM, BMPM, and LSVR. Dept. of C.S.E., C.U.H.K.

  42. Nonlinear Classifier: Kernelization (III) • Map data to higher dimensional feature space Rf • xi(xi) • yi(yi) • Construct the linear decision plane f(γ ,b)=γT z + bin the feature space Rf, with γЄ Rf, b Є R • In Rf,we need to solve • However, we do not want to solve this in an explicit form of.Instead, we want to solve it in a kernelization form • K(z1,z2)= (z1)T(z2) Dept. of C.S.E., C.U.H.K.

  43. Nonlinear Classifier: Kernelization (IV) Dept. of C.S.E., C.U.H.K.

  44. Nonlinear Classifier: Kernelization (V) Notation Dept. of C.S.E., C.U.H.K.

  45. Experimental Results ---MEMPM (I) Six benchmark data sets From UCI Repository • Platform: Windows 2000 • Developing tool: Matlab 6.5 Evaluate both the linear and the Gaussiankernel with the wide parameter for Gaussian chosen by cross validations. Dept. of C.S.E., C.U.H.K.

  46. Experimental Results ---MEMPM(II) Dept. of C.S.E., C.U.H.K. At the Significance level 0.05

  47. Experimental Results ---MEMPM (III) vs. The test-set accuracy for x (TSAx) Dept. of C.S.E., C.U.H.K.

  48. Experimental Results ---MEMPM (IV) vs. The test-set accuracy for y (TSAy) Dept. of C.S.E., C.U.H.K.

  49. Experimental Results ---MEMPM (V) vs. The overall test-set accuracy (TSA) Dept. of C.S.E., C.U.H.K.

  50. Experimental Results ---M4 (I) • Synthetic Toy Data (1) Two types of data with the same data orientation but different data magnitude Dept. of C.S.E., C.U.H.K.

More Related