Nonlinear Knowledge in Kernel Approximation

Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison

Objectives • Primary objective: Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge • Secondary objective: Achieve transparency of the prior knowledge for practical applications

Outline • Use kernels for function approximation • Incorporate prior knowledge • Previous approaches: require transformation of knowledge • New approach does not require any transformation of knowledge • Knowledge given over completely arbitrary sets • Experimental results • Two synthetic examples and one real world example related to breast cancer prognosis • Approximations with prior knowledge more accurate than approximations without prior knowledge

Function Approximation • Given a set m of points in n-dimensional real space Rn and corresponding function values in R • Points are represented by rows of a matrix A  Rm n • Exact or approximate function values for each point are given by a corresponding vector y  Rm • Find a function f:RnR based on the given data • f(Ai0)yi

Function Approximation • Problem: utilizing only given data may result in a poor approximation • Points may be noisy • Sampling may be costly • Solution: use prior knowledge to improve the approximation

Adding Prior Knowledge • Standard approach: fit function at given data points without knowledge • Constrained approach: satisfy inequalities at given points • 2004 MSW Paper: satisfy linear inequalities over polyhedralregions • Proposed new approach: satisfy nonlinear inequalities over arbitrary regions

Kernel Approximation • Approximate f by a nonlinear kernel function K • f(x) K(x0, A0) + b • K(x0, A0) = exp(-||x-Ai||2)i • Error in kernel approximation of given data • -sK(A, A0) + be - ys • e is a vector of ones in Rm

Kernel Approximation • Trade off between solution complexity (||||1) and data fitting (||s||1) • Convert to a linear program • At solution • e0a = ||||1 • e0s = ||s||1

Incorporating Nonlinear Prior Knowledge (MSW 2004) • Bxd0Ax+ bh0x +  • Need to “kernelize” knowledge from input space to feature space of kernel • Requires change of variable x = A0t • BA0t  d 0AA0t + b h0A0t + • K(B, A0)td0K(A, A0)t + bh0A0t +  • Motzkin’s theorem of the alternative gives an equivalent linear system of inequalities which is added to a linear program • Achieves good numerical results, but kernelization is not readily interpretable in the original space

Incorporating Nonlinear Prior Knowledge: New Approach • Start with arbitrary nonlinear knowledge implication • g(x)  0K(x0, A0) + bh(x), 8x2 ½ Rn • g, h are arbitrary functions • g:! Rk, h:! R • Problem: need to add this knowledge to the optimization problem • Logically equivalent system: • g(x)  0, K(x0, A0) + b - h(x) < 0 hasno solution x

Prior Knowledge as a System of Linear Inequalities • Use a theorem of the alternative for convex functions: Assume that g(x), K(x0, A0) + b, -h(x) are convex functions of x, that  is convex and 9x2 : g(x)<0. Then either: • I. g(x)  0, K(x0, A0) + b - h(x) < 0 has a solution x,or • II. vRk, v 0: K(x0, A0) + b - h(x) + v0g(x)  0 x • But never both. • If we can find v 0: K(x0, A0) + b - h(x) + v0g(x)  0x, then by above theorem • g(x)  0, K(x0, A0) + b - h(x) < 0 hasno solution x or equivalently: • g(x)  0 K(x0, A0) + bh(x), 8x2

Proof • I  II • Follows from OLM 1969, Corollary 4.2.2 and the existence of an x2 such that g(x) <0. • I  II • Suppose not. That is, there exists x2, v 2Rk,, v¸ 0: • g(x)  0, K(x0, A0) + b - h(x) < 0, (I) • v  0, v0g(x) +K(x0, A0) + b - h(x)  0 , 8 x 2  (II) • Then we have the contradiction: • 0 > v0g(x) +K(x0, A0) + b - h(x)  0 • Requires no assumptions on g, h, K, or  whatsoever

Example g(x) = 1250 - x3 , f(x) = x4, h(x) = x2 + 5000 g(x) · 0)f(x)¸h(x) v0g(x) + f(x) - h(x) ¸ 0 :I II x3 1250 x4x2 + 5000

Incorporating Prior Knowledge • Linear semi-infinite program: infinite number of constraints • Discretize: finite linear program • Slacks allow knowledge to be satisfied inexactly • Add term to objective function to drive slacks to zero

Numerical Experience • Evaluate on three datasets • Two synthetic datasets • Wisconsin Prognostic Breast Cancer Database • Compare approximation with prior knowledge to one without prior knowledge • Prior knowledge leads to an improved approximation • Prior knowledge used cannot be handled exactly by previous work • No kernelization needed on knowledge set

Two-Dimensional Hyperboloid • f(x1, x2) = x1x2

Two-Dimensional Hyperboloid x2 • Given exact values only at 11 points along line x1 = x2 • At x12 {-5, …, 5} x1

Two-Dimensional Hyperboloid Approximation without Prior Knowledge

Two-Dimensional Hyperboloid • Add prior knowledge: • x1x2 1 f(x1, x2) x1x2 • Nonlinear term x1x2 can not be handled exactly by any previous approach • Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}

Two-Dimensional HyperboloidApproximationwithPrior Knowledge

Two-Dimensional Tower Function

Two-Dimensional Tower Function Data • Given 400 points on the grid [-4, 4]  [-4, 4] • Values are min{g(x), 2}, where g(x) is the exact tower function

Two-Dimensional Tower Function Approximation withoutPrior Knowledge

Two-Dimensional Tower FunctionPrior Knowledge • Add prior knowledge: • (x1, x2)  [-4, 4]  [-4, 4] f(x) = g(x) • Prior knowledge is the exact function value. • Enforced at 2500 points on the grid [-4, 4]  [-4, 4] through above implication • Principal objective of prior knowledge is to overcome poor given data

Two-Dimensional Tower Function Approximation with Prior Knowledge

Predicting Lymph Node Metastasis • Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence • Determined by surgery in addition to the removal of the tumor • Wisconsin Prognostic Breast Cancer (WPBC) data • Lymph node metastasis for 194 patients • 30 cytological features from a fine-needle aspirate • Tumor size, obtained during surgery • Task: predict the number of metastasized lymph nodes given tumor size alone

Predicting Lymph Node Metastasis • Split data into two portions • Past data: 20% used to find prior knowledge • Present data: 80% used to evaluate performance • Simulates acquiring prior knowledge from an expert’s experience

Prior Knowledge for Lymph Node Metastasis • Use kernel approximation without knowledge on the past data • f1(x) = K(x0, A10)1 + b1 • A1 is the matrix of the past data points • Use density estimation to decide where to enforce knowledge • p(x) is the empirical density of the past data • Number of metastasized lymph nodes is greater than the predicted value on the past data, with tolerance of 0.01 • p(x)  0.1 f(x) f1(x) - 0.01

Predicting Lymph Node Metastasis: Results • Table shows root-mean-squared-error (RMSE) of past data (20%) approximation f1(x) on present data (80%) • Leave-one-out (LOO) RMSE reported for approximations with and without knowledge • Improvement due to knowledge: 14.8%

Conclusion • Added general nonlinear prior knowledge to kernel approximation • Implemented as linear inequalities in a linear programming problem • Knowledge incorporated transparently • Demonstrated effectiveness • Two synthetic examples • Real world problem from breast cancer prognosis • Future work • More general prior knowledge with inequalities replaced by more general functions • Apply to classification problems

Questions • Websites linking to papers and talks: • http://www.cs.wisc.edu/~olvi/ • http://www.cs.wisc.edu/~wildt/

Nonlinear Knowledge in Kernel Approximation

Nonlinear Knowledge in Kernel Approximation

Presentation Transcript

Approximation Algorithms

Feature Selection in Nonlinear Kernel Classification

Knowledge-Based Kernel Approximation

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Feature Selection in Nonlinear Kernel Classification

Knowledge Base for the Linux Kernel

Knowledge Base for the Linux Kernel

Approximation

Approximation Algorithms

Kernel Synchronization in Linux (Chap. 5 in Understanding the Linux Kernel)

Implicit Approximation

Approximation Algorithms

Approximation Algorithms

Nonlinear Approximation Based Image Recovery Using Adaptive Sparse Reconstructions

In Kernel Remote Driver

Approximation

Kernel

Approximation

Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation

Approximation Algorithms