Machine Learning and Data Mining: A Math Programming-Based Approach

Machine Learning and Data Mining: A Math Programming-Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin

What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function

What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)

Geometry of the Classification Problem2-Category Linearly Separable Case A+ A-

Support vectors Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-

in class +1 or –1 specified by: • Membership of each • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, where e is a vector of ones. Algebra of the Classification Problem2-Category Linearly Separable Case • Given m points in n dimensional space • Represented by an m-by-n matrix A • More succinctly:

min s.t. where is the weight of the training error • Maximize themarginby minimizing Support Vector Machines:Quadratic Programming Formulation • Solve the following quadratic program:

Checkerboard Polynomial Kernel ClassifierBest Previous Result: [Kaufman 1998]

Gaussian Kernel PSVM Classifier Spiral Dataset: 94 Red Dots & 94 White Dots

The Federalist Papers • Written in 1787-1788 by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York to ratify the constitution. • Papers consisted of short essays, 900 to 3500 words in length. • Authorship of 12 of those papers have been in dispute ( Madison or Hamilton). These papers are referred to as the disputed Federalist papers.

Previous Work • Mosteller and Wallace (1964) • Using statistical inference, determined the authorship of the 12 disputed papers. • Bosch and Smith (1998). • Using linear programming techniques and the evaluation of every possible combination of one, two and three features, obtained a separating hyperplane using only three words.

Description of the data • For every paper: • Machine readable text was created using a scanner. • Computed relative frequencies of 70 words, that Mosteller-Wallace identified as good candidates for author-attribution studies. • Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies. • The dataset consists of 118 papers: • 50 Madison papers • 56 Hamilton papers • 12 disputed papers

Function Words Based on Relative Frequencies

The parameter was obtained by a tuning procedure. SLA Feature Selection for Classifying the Disputed Federalist Papers • Apply the successive linearization algorithm to: • Train on the 106 Federalist papers with known authors • Find a classification hyperplane that uses as few words as possible • Use the hyperplane to classify the 12 disputed papers

Hyperplane Classifier Using 3 Words • A hyperplane depending on three words was found: 0.5368to+24.6634upon+2.9532would=66.6159 • Alldisputed papers ended up on the Madison side of the plane

Results: 3d plot of resulting hyperplane

Comparison with Previous Work & Conclusion • Bosch and Smith (1998) calculated all the possible sets of one, two and three words to find a separating hyperplane. They solved 118,895 linear programs. • Our SLA algorithm for feature selectionrequired the solution of only6 linear programs. • Our classification of the disputed Federalist papers agrees with that of Mosteller-Wallace and Bosch-Smith.

Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant

E1 I1 E2 I2 E3 I3 E4 I4 E5 DNA Transcription E1 I1 E2 I2 E3 I3 E4 I4 E5 pre-mRNA (m=messenger) 5' 3' Alternative RNA splicing E1 E2 E4 E5 E1 E2 E3 E4 E5 mRNA (A)n (A)n Translation DATAS Proteins NH2 COOH NH2 COOH E3 Chemo-Sensitive Chemo-Resistant DATAS: Differential Analysis of Transcripts with Alternative Splicing Detection of Alternative RNA Isoforms via DATAS (Levels of mRNA that Correlate with Sensitivity to Chemotherapy)

Breast Cancer Treatment ResponseJoint with ExonHit ( French BioTech)http://www.exonhit.com/html/company/index.htm • 35 patients treated by a drug cocktail • 9 partial responders; 26 nonresponders • 25 gene expression measurements made on each patient • 1-Norm SVM classifier selected: 12 out of 25 genes • Combinatorially selected 6 genes out of 12 • Separating plane obtained: 2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0. • Leave-one-out-error:1 out of 35 (97.1% correctness)

More on SVMs: • My future job: Siemens, Medical solutions • My web page: www.cs.wisc.edu/~gfung • Olvi Mangasarian web page: www.cs.wisc.edu/~olvi

Machine Learning and Data Mining: A Math Programming-Based Approach