320 likes | 535 Views
Nonlinear Knowledge in Kernel Approximation. Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison. Objectives. Primary objective : Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge
E N D
Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison
Objectives • Primary objective: Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge • Secondary objective: Achieve transparency of the prior knowledge for practical applications
Outline • Use kernels for function approximation • Incorporate prior knowledge • Previous approaches: require transformation of knowledge • New approach does not require any transformation of knowledge • Knowledge given over completely arbitrary sets • Experimental results • Two synthetic examples and one real world example related to breast cancer prognosis • Approximations with prior knowledge more accurate than approximations without prior knowledge
Function Approximation • Given a set m of points in n-dimensional real space Rn and corresponding function values in R • Points are represented by rows of a matrix A Rm n • Exact or approximate function values for each point are given by a corresponding vector y Rm • Find a function f:RnR based on the given data • f(Ai0)yi
Function Approximation • Problem: utilizing only given data may result in a poor approximation • Points may be noisy • Sampling may be costly • Solution: use prior knowledge to improve the approximation
Adding Prior Knowledge • Standard approach: fit function at given data points without knowledge • Constrained approach: satisfy inequalities at given points • 2004 MSW Paper: satisfy linear inequalities over polyhedralregions • Proposed new approach: satisfy nonlinear inequalities over arbitrary regions
Kernel Approximation • Approximate f by a nonlinear kernel function K • f(x) K(x0, A0) + b • K(x0, A0) = exp(-||x-Ai||2)i • Error in kernel approximation of given data • -sK(A, A0) + be - ys • e is a vector of ones in Rm
Kernel Approximation • Trade off between solution complexity (||||1) and data fitting (||s||1) • Convert to a linear program • At solution • e0a = ||||1 • e0s = ||s||1
Incorporating Nonlinear Prior Knowledge (MSW 2004) • Bxd0Ax+ bh0x + • Need to “kernelize” knowledge from input space to feature space of kernel • Requires change of variable x = A0t • BA0t d 0AA0t + b h0A0t + • K(B, A0)td0K(A, A0)t + bh0A0t + • Motzkin’s theorem of the alternative gives an equivalent linear system of inequalities which is added to a linear program • Achieves good numerical results, but kernelization is not readily interpretable in the original space
Incorporating Nonlinear Prior Knowledge: New Approach • Start with arbitrary nonlinear knowledge implication • g(x) 0K(x0, A0) + bh(x), 8x2 ½ Rn • g, h are arbitrary functions • g:! Rk, h:! R • Problem: need to add this knowledge to the optimization problem • Logically equivalent system: • g(x) 0, K(x0, A0) + b - h(x) < 0 hasno solution x
Prior Knowledge as a System of Linear Inequalities • Use a theorem of the alternative for convex functions: Assume that g(x), K(x0, A0) + b, -h(x) are convex functions of x, that is convex and 9x2 : g(x)<0. Then either: • I. g(x) 0, K(x0, A0) + b - h(x) < 0 has a solution x,or • II. vRk, v 0: K(x0, A0) + b - h(x) + v0g(x) 0 x • But never both. • If we can find v 0: K(x0, A0) + b - h(x) + v0g(x) 0x, then by above theorem • g(x) 0, K(x0, A0) + b - h(x) < 0 hasno solution x or equivalently: • g(x) 0 K(x0, A0) + bh(x), 8x2
Proof • I II • Follows from OLM 1969, Corollary 4.2.2 and the existence of an x2 such that g(x) <0. • I II • Suppose not. That is, there exists x2, v 2Rk,, v¸ 0: • g(x) 0, K(x0, A0) + b - h(x) < 0, (I) • v 0, v0g(x) +K(x0, A0) + b - h(x) 0 , 8 x 2 (II) • Then we have the contradiction: • 0 > v0g(x) +K(x0, A0) + b - h(x) 0 • Requires no assumptions on g, h, K, or whatsoever
Example g(x) = 1250 - x3 , f(x) = x4, h(x) = x2 + 5000 g(x) · 0)f(x)¸h(x) v0g(x) + f(x) - h(x) ¸ 0 :I II x3 1250 x4x2 + 5000
Incorporating Prior Knowledge • Linear semi-infinite program: infinite number of constraints • Discretize: finite linear program • Slacks allow knowledge to be satisfied inexactly • Add term to objective function to drive slacks to zero
Numerical Experience • Evaluate on three datasets • Two synthetic datasets • Wisconsin Prognostic Breast Cancer Database • Compare approximation with prior knowledge to one without prior knowledge • Prior knowledge leads to an improved approximation • Prior knowledge used cannot be handled exactly by previous work • No kernelization needed on knowledge set
Two-Dimensional Hyperboloid • f(x1, x2) = x1x2
Two-Dimensional Hyperboloid x2 • Given exact values only at 11 points along line x1 = x2 • At x12 {-5, …, 5} x1
Two-Dimensional Hyperboloid Approximation without Prior Knowledge
Two-Dimensional Hyperboloid • Add prior knowledge: • x1x2 1 f(x1, x2) x1x2 • Nonlinear term x1x2 can not be handled exactly by any previous approach • Discretization used only 11 points along the line x1 = -x2, x1 {-5, -4, …, 4, 5}
Two-Dimensional Tower Function Data • Given 400 points on the grid [-4, 4] [-4, 4] • Values are min{g(x), 2}, where g(x) is the exact tower function
Two-Dimensional Tower Function Approximation withoutPrior Knowledge
Two-Dimensional Tower FunctionPrior Knowledge • Add prior knowledge: • (x1, x2) [-4, 4] [-4, 4] f(x) = g(x) • Prior knowledge is the exact function value. • Enforced at 2500 points on the grid [-4, 4] [-4, 4] through above implication • Principal objective of prior knowledge is to overcome poor given data
Two-Dimensional Tower Function Approximation with Prior Knowledge
Predicting Lymph Node Metastasis • Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence • Determined by surgery in addition to the removal of the tumor • Wisconsin Prognostic Breast Cancer (WPBC) data • Lymph node metastasis for 194 patients • 30 cytological features from a fine-needle aspirate • Tumor size, obtained during surgery • Task: predict the number of metastasized lymph nodes given tumor size alone
Predicting Lymph Node Metastasis • Split data into two portions • Past data: 20% used to find prior knowledge • Present data: 80% used to evaluate performance • Simulates acquiring prior knowledge from an expert’s experience
Prior Knowledge for Lymph Node Metastasis • Use kernel approximation without knowledge on the past data • f1(x) = K(x0, A10)1 + b1 • A1 is the matrix of the past data points • Use density estimation to decide where to enforce knowledge • p(x) is the empirical density of the past data • Number of metastasized lymph nodes is greater than the predicted value on the past data, with tolerance of 0.01 • p(x) 0.1 f(x) f1(x) - 0.01
Predicting Lymph Node Metastasis: Results • Table shows root-mean-squared-error (RMSE) of past data (20%) approximation f1(x) on present data (80%) • Leave-one-out (LOO) RMSE reported for approximations with and without knowledge • Improvement due to knowledge: 14.8%
Conclusion • Added general nonlinear prior knowledge to kernel approximation • Implemented as linear inequalities in a linear programming problem • Knowledge incorporated transparently • Demonstrated effectiveness • Two synthetic examples • Real world problem from breast cancer prognosis • Future work • More general prior knowledge with inequalities replaced by more general functions • Apply to classification problems
Questions • Websites linking to papers and talks: • http://www.cs.wisc.edu/~olvi/ • http://www.cs.wisc.edu/~wildt/