Classification and Feature Selection Algorithms for Multi-class CGH data

Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci http://www.cise.ufl.edu

The number of copies of genes can vary from person to person. ~0.4% of the gene copy numbers are different for pairs of people. Variations in copy numbers can alter resistance to disease EGFR copy number can be higher than normal in Non-small cell lung cancer. Gene copy number Lung images (ALA) Cancer Healthy

Comparative Genomic Hybridization (CGH)

Raw and smoothed CGH data

Example CGH dataset 862 genomic intervals in the Progenetix database

Problem description • Given a new sample, which class does this sample belong to? • Which features should we use to make this decision?

Outline • Support Vector Machine (SVM) • SVM for CGH data • Maximum Influence Feature Selection algorithm • Results

SVM in a nutshell Support Vector Machine (SVM) SVM for CGH data Maximum Influence Feature Selection algorithm Results

Classification with SVM • Consider a two-class, linearly separable classification problem • Many decision boundaries! • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1

Similarity between xi and xj SVM Formulation • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • Maximize J over αi • The decision boundary can be constructed as

SVM for CGH data Support Vector Machine (SVM) SVM for CGH data Maximum Influence Feature Selection algorithm Results

Pairwise similarity measures • Raw measure • Count the number of genomic intervals that both samples have gain (or loss) at that position. Raw = 3

SVM based on Raw kernel • Using SVM with the Raw kernel amounts to solving the following quadratic program • The resulting decision function is Maximize J over αi: Use Raw kernel to replace Use Raw kernel to replace Is this cool?

Is Raw kernel valid? • Not all similarity function can serve as kernel. This requires the underlying kernel matrix M is “positive semi-definite”. • M is positive semi-definite if for all vectors v, vTMv ≥ 0

Φ(X) = 0 0 0 1 0 1 0 0 0 1 1 0 Φ(Y) = 0 0 0 1 0 0 1 0 1 0 1 0 * * Is Raw kernel valid? • Proof: define a function Φ() where • Φ: a {1, 0, -1}m  b {1, 0}2m,where • Φ(gain) = Φ(1) = 01 • Φ(no-change) = Φ(0) = 00 • Φ(loss) = Φ(-1) = 10 • Raw(X, Y) =Φ(X)TΦ(Y) X = 0 1 1 0 1 -1 Y = 0 1 0 -1 -1 -1 * * Raw(X, Y) = 2 Φ(X)TΦ(Y) = 2

Raw Kernel is valid! • Raw kernel can be written as Raw(X, Y) =Φ(X)TΦ(Y) • Define a 2m by n matrix • Therefore, Let M denote the Kernel matrix of Raw

MIFS algorithm Support Vector Machine (SVM) SVM for CGH data Maximum Influence Feature Selection algorithm Results

High • Feature 8 • Feature 4 • Feature 9 • Feature 33 • Feature 2 • Feature 48 • Feature 27 • Feature 1 • … Contribution Feature 2 Feature 1 Feature 3 Feature 4 [8, 1, 3] Ranks of features [5, 15, 8] [12, 4, 3] [2, 31, 1] Low Sort ranks of features [1, 3, 8] [1, 2, 31] [5, 8, 15] [3, 4, 12] [1, 3, 8] [1, 2, 31] [3, 4, 12] [5, 8, 15] Sort features Most promising feature. Insert Feature 4 into feature set MIFS for multi-class data One-versus-all SVM

Results Support Vector Machine (SVM) SVM for CGH data Maximum Influence Feature Selection algorithm Results

Dataset Details Data taken from Progenetix database

Datasets Dataset size

Experimental results • Comparison of linear and Raw kernel On average, Raw kernel improves the predictive accuracy by 6.4% over sixteen datasets compared to linear kernel.

Experimental results Accuracy Using 80 features results in accuracy that is comparable or better than using all features Using 40 features results in accuracy that is comparable to using all features Number of Features (Fu and Fu-Liu, 2005) (Ding and Peng, 2005)

Using MIFS for feature selection • Result to test the hypothesis that 40 features are enough and 80 features are better

A Web Server for Mining CGH Data http://cghmine.cise.ufl.edu:8007/CGH/Default.html

Thank you

Appendix

Features 1 2 3 4 Class x1 x2 x3 x4 x5 x6 1 1 1 1 -1 -1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 Minimum Redundancy and Maximum Relevance (MRMR) • Relevance V is defined as the average mutual information between features and class labels • Redundancy W is defined as the average mutual information between all pairs of features • Incrementally select features by maximizing (V / W) or (V – W) Y 1 0 1 X

Support Vector Machine Recursive Feature Elimination (SVM-RFE) Train a linear SVM based on feature set Compute the weight vector Compute the ranking coefficient wi2 for the ith feature Remove the feature with smallest ranking coefficient N Is feature set empty? Y

Pairwise similarity measures • Sim measure • Segment is a contiguous block of aberrations of the same type. • Count the number of overlapping segment pairs. Sim = 2

f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f(.) A linear decision boundary can be found! Feature space Non-linear Decision Boundary • How to generalize SVM when the two class classification problem is not linearly separable? • Key idea: transform xi to a higher dimensional space to “make life easier” • Input space: the space the point xi are located • Feature space: the space of f(xi) after transformation Input space

Classification and Feature Selection Algorithms for Multi-class CGH data

Classification and Feature Selection Algorithms for Multi-class CGH data

Presentation Transcript

Multi-Class and Structured Classification

Feature Selection in Nonlinear Kernel Classification

Feature Selection in Nonlinear Kernel Classification

Stable Feature Selection: Theory and Algorithms

Classification and Feature Selection for Craniosynostosis

Semi-Supervised Feature Selection for Graph Classification

Multi-Label Feature Selection for Graph Classification

Feature Selection in Classification and R Packages

GMDH-based feature ranking and selection for improved classification of medical data

Data Mining Feature Selection

Microarrays: A Comparison of Classification and Feature Selection Algorithms for Interpretation

Unsupervised Feature Selection for Multi-Cluster Data

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Dual Active Feature and Sample Selection for Graph Classification

Feature Selection Stability Analysis for Classification Using Microarray Data

Algorithms for Smoothing Array CGH data

Feature Selection: Algorithms and Challenges

CGH Data

Multi-class SVM with Negative Data Selection for Web Page Classification

Feature Selection: Algorithms and Challenges

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Multi-class Classification