Feature Selection and Error Tolerance for the Logical Analysis of Data

Feature Selection and Error Tolerance for the Logical Analysis of Data Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR

Our Goals • Train a computer to tell us which attributes in a medical data set are important • Have the computer suggest possible formulas for distinguishing healthy and sick patients • Achieve these goals with as much tolerance for data error as possible

Training Data Set: Wisconsin Breast Cancer Database from University of Wisconsin Hospitals, Madison, Dr. William H. Wolberg Sample Patient Vector: 1000025,5,1,1,1,2,1,3,1,1,2 ID #, 9 test results, class distinction There is a total of 699 patients (458 Benign “2”, 241 Malignant “4”)

Minimal Difference Vectors Dualization  Maximal Error Difference Vector: |(5 4 4 5 7 10 3 2 1) – (6 8 8 1 3 4 3 7 1)| = (1 4 4 4 4 6 0 5 0) ***An error  (1 4 4 4 4 6 0 5 0) would not distinguish (5 4 4 5 7 10 3 2 1) from (6 8 8 1 3 4 3 7 1)  = {Difference Vectors} (+90,000 for WBCD) Minimal Difference Vectors:    such that there is no other vector in  that is less than or equal to  in every coordinate Next Step: Input {Minimal Difference Vectors} into Dualization Algorithm

Minimal Difference Vectors  Dualization  Maximal Error Vectors Input: Minimal Vectors = Red Output: (5,0), (3,2), (2,3) To find blue points (what we want): Take (dimension) – outputs = (0,5), (2,3), (3,2) http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html

Minimal Difference Vectors  Dualization  Maximal Error Vectors • The output of the Dualization Algorithm is another set of vectors • For each coordinate in these vectors, take the compliment (i.e. 10 – coordinate) • Divide each vector by 2 to find a maximal error vector • Sort these error vectors by greatest sum, most 5’s, maximal minimum element, etc. • Choose an epsilon from the sorted lists that looks good

Binarization For a good error, we binarize the original data: eg: Error  = (0.5, 5, 0, 5, 0, 0, 5, 5, 5) Thresholds (): Col 1 = 4, Col 3 =7, Col 5 = 5, Col 6 = 8 For a patient (5, 1, 1, 1, 2, 1, 3, 1, 1) we test each threshold to see if the value in the patient vector is greater or less than the threshold value, or if it is within error (5, 1, 1, 1, 2, 1, 3, 1, 1)  1 0 0 0 (4, 1, 1, 1, 2, 1, 3, 1, 1)  * 0 0 0

RESULTS FOR WISCONSIN BREAST CANCER DATA Error Tolerance: 0.5, 5, 0, 5, 0, 0, 5, 5, 5 Attributes and Corresponding Thresholds: 1 : 4, 3 : 7, 5 : 5, 6 : 8 Total Pos/Neg Entries: 1 0 0 0 : 96, 21 (162, 23)1 0 0 1 : 1, 37 (1, 41)0 1 1 1 : 0, 6 (0, 6)1 0 1 0 : 2, 16 (3, 17)1 0 1 1 : 1, 27 (1, 28)1 1 0 0 : 1, 12 (1, 14)1 1 0 1 : 0, 19 (0, 21) 1 1 1 0 : 0, 24 (0, 24)1 1 1 1 : 2, 52 (2, 52)0 0 0 0 : 268, 2 (334, 4)0 0 0 1 : 1, 3 (1, 7)0 0 1 0 : 5, 2 (6, 3)0 0 1 1 : 0, 4 (0, 5)0 1 0 1 : 0, 2 (0, 4)

Formula for WBCD: Let P = Col 1  4 Q = Col 3  7 R = Col 5  5 S = Col 6  8 Then we can characterize most (432/444) positives with: -Q  -R  -S Some example patient vectors: Negatives: (8,7,5,10,7,9,5,5,4), (7,4,6,4,6,1,4,3,1) Positives: (4,1,1,1,2,1,2,1,1), (4,1,1,1,2,1,3,1,1)

More to do: • Test our procedure on different databases • Study heuristic methods for threshold selection • In general, explore ways to use more flexible error vectors and/or thresholds References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190 [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000 [3] Boros, E. http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html July 1, 2005 [4] Mangasarian, O.L. and W. H. Wolberg. "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18 [5] Rudell, Richard. Espresso Boolean Minimization http://www.csc.uvic.ca/~csc485c/espresso/instructions.html July 18, 2005

Feature Selection and Error Tolerance for the Logical Analysis of Data

Feature Selection and Error Tolerance for the Logical Analysis of Data

Presentation Transcript

Feature Selection of DNA Micrroarray Data

Error Tolerance and Feature Selection for the Logical Analysis of Data

Feature selection

Feature Selection

Data Mining Feature Selection

Feature Selection

Unsupervised Feature Selection for Multi-Cluster Data

Feature Selection

Random Subspace Feature Selection for Analysis of Data with Missing Features

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Feature Selection

Feature Selection Stability Analysis for Classification Using Microarray Data

Feature Selection

Feature Selection Focused within Error Clusters

Feature selection

Feature Selection

Feature Selection

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Feature selection