230 likes | 252 Views
Explore the mathematical programming-based approach to machine learning with support vector machines, covering classification, regression, data fitting, geometry of the classification problem, SVM algebra, quadratic programming formulation, and practical applications in author attribution and cancer diagnosis.
E N D
Machine Learning and Data Mining: A Math Programming-Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin
What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function
What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)
Geometry of the Classification Problem2-Category Linearly Separable Case A+ A-
Support vectors Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-
in class +1 or –1 specified by: • Membership of each • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, where e is a vector of ones. Algebra of the Classification Problem2-Category Linearly Separable Case • Given m points in n dimensional space • Represented by an m-by-n matrix A • More succinctly:
min s.t. where is the weight of the training error • Maximize themarginby minimizing Support Vector Machines:Quadratic Programming Formulation • Solve the following quadratic program:
Checkerboard Polynomial Kernel ClassifierBest Previous Result: [Kaufman 1998]
Gaussian Kernel PSVM Classifier Spiral Dataset: 94 Red Dots & 94 White Dots
The Federalist Papers • Written in 1787-1788 by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York to ratify the constitution. • Papers consisted of short essays, 900 to 3500 words in length. • Authorship of 12 of those papers have been in dispute ( Madison or Hamilton). These papers are referred to as the disputed Federalist papers.
Previous Work • Mosteller and Wallace (1964) • Using statistical inference, determined the authorship of the 12 disputed papers. • Bosch and Smith (1998). • Using linear programming techniques and the evaluation of every possible combination of one, two and three features, obtained a separating hyperplane using only three words.
Description of the data • For every paper: • Machine readable text was created using a scanner. • Computed relative frequencies of 70 words, that Mosteller-Wallace identified as good candidates for author-attribution studies. • Each document is represented as a vector containing the 70 real numbers corresponding to the 70 word frequencies. • The dataset consists of 118 papers: • 50 Madison papers • 56 Hamilton papers • 12 disputed papers
The parameter was obtained by a tuning procedure. SLA Feature Selection for Classifying the Disputed Federalist Papers • Apply the successive linearization algorithm to: • Train on the 106 Federalist papers with known authors • Find a classification hyperplane that uses as few words as possible • Use the hyperplane to classify the 12 disputed papers
Hyperplane Classifier Using 3 Words • A hyperplane depending on three words was found: 0.5368to+24.6634upon+2.9532would=66.6159 • Alldisputed papers ended up on the Madison side of the plane
Comparison with Previous Work & Conclusion • Bosch and Smith (1998) calculated all the possible sets of one, two and three words to find a separating hyperplane. They solved 118,895 linear programs. • Our SLA algorithm for feature selectionrequired the solution of only6 linear programs. • Our classification of the disputed Federalist papers agrees with that of Mosteller-Wallace and Bosch-Smith.
Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant
E1 I1 E2 I2 E3 I3 E4 I4 E5 DNA Transcription E1 I1 E2 I2 E3 I3 E4 I4 E5 pre-mRNA (m=messenger) 5' 3' Alternative RNA splicing E1 E2 E4 E5 E1 E2 E3 E4 E5 mRNA (A)n (A)n Translation DATAS Proteins NH2 COOH NH2 COOH E3 Chemo-Sensitive Chemo-Resistant DATAS: Differential Analysis of Transcripts with Alternative Splicing Detection of Alternative RNA Isoforms via DATAS (Levels of mRNA that Correlate with Sensitivity to Chemotherapy)
Breast Cancer Treatment ResponseJoint with ExonHit ( French BioTech)http://www.exonhit.com/html/company/index.htm • 35 patients treated by a drug cocktail • 9 partial responders; 26 nonresponders • 25 gene expression measurements made on each patient • 1-Norm SVM classifier selected: 12 out of 25 genes • Combinatorially selected 6 genes out of 12 • Separating plane obtained: 2.7915 T11 + 0.13436 S24 -1.0269 U23 -2.8108 Z23 -1.8668 A19 -1.5177 X05 +2899.1 = 0. • Leave-one-out-error:1 out of 35 (97.1% correctness)
More on SVMs: • My future job: Siemens, Medical solutions • My web page: www.cs.wisc.edu/~gfung • Olvi Mangasarian web page: www.cs.wisc.edu/~olvi