Y.-J. Lee & O. L. Mangasarian

Survival-Time Classification of Breast Cancer PatientsDIMACS Workshop on Data Mining and Scalable AlgorithmsAugust 22-24, 2001- Rutgers University Y.-J. Lee & O. L. Mangasarian Data Mining Institute University of Wisconsin - Madison Second Annual Review June 1, 2001

American Cancer Society2001 Breast Cancer Estimates • Breast cancer, the most common cancer among women, is the second leading cause of cancer deaths in women (after lung cancer) • 192,200 new cases of breast cancer in women will be diagnosed in the United States • 40,600 deaths will occur from breast cancer (40,200 among women, 400 among men) in the United States • According to the World Health Organization, more than 1.2 million people will be diagnosed with breast cancer this year worldwide

Key Objective • Identify breast cancer patients for whom adjuvant chemotherapy prolongs survival time • Main Difficulty: Cannot carry out comparative tests on human subjects • Similar patients must be treated similarly • Our Approach: Classify patients into: good, intermediate & poor groups • Characterize classes by: Tumor size & lymph node status • Classification based on: 5 cytological features plus tumor size

Principal ResultsFor 253 Breast Cancer Patients • All 69 patients in the good group: • Had no chemotherapy • Had the best survival rate • All 73 patients in the poor group: • Had chemotherapy • Had the worst survival rate • For the 121 patients in the intermediate group: • The 67 patients who had chemotherapy had better survival rate than: • The 44 patients who did not have chemotherapy • Last result reverses role of chemotherapy for both the overall population as well as the good & poor groups

Outline • Tools used • Support vector machines (SVMs). • Feature selection • Classification • Clustering • k-Median (k-Mean fails!) • Cluster chemo patients into chemo-good & chemo-poor • Cluster no-chemo patients into no-chemo-good & no-chemo-poor • Three final classes • Good = No-chemo good • Poor = Chemo poor • Intermediate = Remaining patients • Generate survival curves for three classes • Use SVM to classify new patients into one of above three classes

Simplest Support Vector MachineLinear Surface Maximizing the Margin A+ A-

Clustering in Data Mining General Objective • Given:A dataset ofm points in n-dimensional real space • Problem:Extract hidden distinct properties by clustering the dataset

of m points in • Given:Set represented by the matrix ,and a number of desired clusters , in such • Problem:Determine centers that the sum of the minima over of the 1-norm distance between each point , , , and cluster centers is minimized linear functions, hence • Objective:Sum ofm minima of it ispiecewise-linear concave • Difficulty:Minimizing a general piecewise-linear concave function over a polyhedral set is NP-hard Concave Minimization Formulationof Clustering Problem

Minimize thesum of 1-norm distances between each data and the closest cluster center point : min min s.t. • Reformulation: min s.t. Clustering via Concave Minimization

Step 1 (Cluster Assignment): Assign points to the cluster with the nearest cluster center in 1-norm Step 2 (Center Update) Recompute location of center for each cluster as the cluster median (closest point to all cluster points in 1-norm) Step3 (Stopping Criterion) Stop if the cluster centers are unchanged,else go toStep 1 Finite K-Median Clustering Algorithm(Minimizing Piecewise-linear Concave Function) Step 0 (Initialization): Givenkinitial cluster centers • Different initial centers will lead to different clusters

Clustering Process: Feature Selection & Initial Cluster Centers • 6 out of 31 features selected by a linear SVM • SVM separating lymph node positive (Lymph>0) from lymph node negative (Lymph=0) • Clustering performed in 6-dimensional feature space • Initial cluster centers used: • Good: Median in 6-dimensional space of patients with Lymph=0 AND Tumor <2 • Poor: Median in 6-dimensional space of patients with of Lymph>4 OR Tumor >4 • Typical indicator for chemotherapy

Intermediate1: (0<Lymph<5 & Tumor<4) OR (Lymph<5 & 2<=Tumor<4) Poor1: Lymph>=5 OR Tumor>=4 Compute Median Using 6 Features Good1: Lymph=0 AND Tumor<2 Compute Median Using 6 Features Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 113 NoChemo Patients Use k-Median Algorithm with Initial Centers: Medians of Good1 & Poor1 Cluster 140 Chemo Patients 44 NoChemo Poor 67 Chemo Good 73 Chemo Poor 69 NoChemo Good Poor Intermediate Good Clustering Process 253 Patients

Survival Curves forGood, Intermediate& Poor Groups

Survival Curves for Intermediate Group:Split by Chemo & NoChemo

Survival Curves for All PatientsSplit by Chemo & NoChemo

Survival Curves for Intermediate GroupSplit by Lymph Node & Chemotherapy

Survival Curves for All PatientsSplit by Lymph Node Positive & Negative

Poor NoChemoPoor Good ChemoGood SVM Poor2: NoChemoPoor & Poor Not Good Good2: Good & ChemoGood Not Poor Compute LI(x) & CI(x) Compute LI(x) & CI(x) SVM SVM Intermediate Intermediate Good Poor Nonlinear SVM Classifier82.7% Tenfold Test Correctness

Conclusion • By using five features from a fine needle aspirate & tumor size, breast cancer patients can be classified into 3 classes • Good – Requiring no chemotherapy • Intermediate – Chemotherapy recommended for longer survival • Poor – Chemotherapy may or may not enhance survival • 3 classes have very distinct survival curves • First categorization of a breast cancer group for which chemotherapy enhances longevity

Y.-J. Lee & O. L. Mangasarian