Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg

Survival-Time Classification of Breast Cancer PatientsDIMACS Workshop on Data Mining and Scalable AlgorithmsAugust 22-24, 2001- Rutgers University Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg Data Mining Institute University of Wisconsin - Madison Second Annual Review June 1, 2001

American Cancer SocietyYear 2001 Breast Cancer Estimates • Breast cancer, the most common cancer among women, is the second leading cause of cancer deaths in women (after lung cancer) • 192,200 new cases of breast cancer in women will be diagnosed in the United States • 40,600 deaths will occur from breast cancer (40,200 among women, 400 among men) in the United States • According to the World Health Organization, more than 1.2 million people will be diagnosed with breast cancer this year worldwide

Main Difficulty: Cannot carry out comparative tests on human subjects • Our Approach: Classify patients into: Good,Intermediate& Poor groups • Classification based on: 5 cytological features plus tumor size • Classification criteria: Tumor size & lymph node status Key Objective • Identify breast cancer patients for whom adjuvant chemotherapy prolongs survival time • Similar patients must be treated similarly

Principal ResultsFor 253 Breast Cancer Patients • All 69 patients in the Good group: • Had the best survival rate • Had no chemotherapy • All 73 patients in the Poor group: • Had the worst survival rate • Hadchemotherapy • For the 121 patients in the Intermediate group: • The 67 patients who had chemotherapy had better survival rate than: • The 44 patients who did not have chemotherapy • Last result reverses chemotherapy role for overall population • Very useful for treatment prescription

Outline • Tools used • Support vector machines (SVMs). • Feature selection • Classification • Clustering • k-Median (k-Mean fails!) • Cluster chemo patients into chemo-good & chemo-poor • Cluster no-chemo patients into no-chemo-good & no-chemo-poor • Three final classes • Good = No-chemo good • Poor = Chemo poor • Intermediate = Remaining patients • Generate survival curves for three classes • Use SVM to classify new patients into one of above three classes

Feature selection: SVM with 1-norm approach, min s. t. , where , denotes Lymph node > 0 or Lymph node =0 • 5 out 30 cytological features describenuclear size, shape and texture Support Vector Machines Used in this Work • 6 out of 31 features selected by SVM: • Tumor size from surgery • Classification:Use SSVMs with Gaussian kernel

Clustering in Data Mining General Objective • Given:A dataset ofm points in n-dimensional real space • Problem:Extract hidden distinct properties by clustering the dataset

of mpoints in • Given:Set represented by the matrix ,and a number of desired clusters , in such • Problem:Determine centers that the sum of the minima over of the 1-norm distance between each point , , , and cluster centers is minimized • Objective Function:Sum ofm minima of linear functions, hence it ispiecewise-linear concave • Difficulty:Minimizing a general piecewise-linear concave function over a polyhedral set is NP-hard Concave Minimization Formulationof Clustering Problem

Minimize thesum of 1-norm distances between each data and the closest cluster center point : min min s.t. • Bilinear reformulation: min s.t. Clustering via Concave Minimization

Step 1 (Cluster Assignment): Assign points to the cluster with the nearest cluster center in 1-norm Step 2 (Center Update) Recompute location of center for each cluster as the cluster median (closest point to all cluster points in 1-norm) Step3 (Stopping Criterion) Stop if the cluster centers are unchanged,else go toStep 1 Finite K-Median Clustering Algorithm(Minimizing Piecewise-linear Concave Function) Step 0 (Initialization): Givenkinitial cluster centers • Different initial centers will lead to different clusters

) • 6 out of 31 features selected by a linear SVM ( • SVM separating lymph node positive (Lymph > 0) from lymph node negative (Lymph = 0) • Poor1: Patients with Lymph > 4 OR Tumor Clustering Process: Feature Selection & Initial Cluster Centers • Perform k-Median algorithm in 6-dimensional feature space • Initial cluster centers used: Medians of Good1 & Poor1 • Good1: Patients with Lymph = 0AND Tumor < 2 • Typical indicator for chemotherapy

Poor1: Lymph>=5 OR Tumor>=4 Compute Median Using 6 Features Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Compute Initial Cluster Centers Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 140 Chemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor 69 NoChemo Good Poor Intermediate Good Clustering Process 253 Patients (113 NoChemo, 140 Chemo)

Survival Curves forGood, Intermediate& Poor Groups

Survival Curves for Intermediate Group:Split by Chemo & NoChemo

Survival Curves for All PatientsSplit by Chemo & NoChemo

Survival Curves for Intermediate GroupSplit by Lymph Node & Chemotherapy

Survival Curves for All PatientsSplit by Lymph Node Positive & Negative

Four groups from the clustering result: Intermediate (NoChemoPoor) Intermediate (ChemoGood) Good Poor SVM Poor2: NoChemoPoor & Poor Good2: Good & ChemoGood Compute LI(x) & CI(x) Compute LI(x) & CI(x) SVM SVM Poor Intermediate Good Intermediate Nonlinear SVM Classifier82.7% Tenfold Test Correctness

Used five cytological features & tumor size to cluster breast cancer patients into 3 groups: • First categorization of a breast cancer group for which chemotherapy enhances longevity • SVM- based procedure assigns new patients into one of above three survival groups Conclusion • Good–No chemotherapy recommended • Intermediate– Chemotherapy likely to prolong survival • Poor – Chemotherapy may or may not enhance survival • 3 groups have very distinct survival curves

Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg