100 likes | 191 Views
Feature Selection and Error Tolerance for the Logical Analysis of Data. Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR. Our Goals.
E N D
Feature Selection and Error Tolerance for the Logical Analysis of Data Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR
Our Goals • Train a computer to tell us which attributes in a medical data set are important • Have the computer suggest possible formulas for distinguishing healthy and sick patients • Achieve these goals with as much tolerance for data error as possible
Training Data Set: Wisconsin Breast Cancer Database from University of Wisconsin Hospitals, Madison, Dr. William H. Wolberg Sample Patient Vector: 1000025,5,1,1,1,2,1,3,1,1,2 ID #, 9 test results, class distinction There is a total of 699 patients (458 Benign “2”, 241 Malignant “4”)
Minimal Difference Vectors Dualization Maximal Error Difference Vector: |(5 4 4 5 7 10 3 2 1) – (6 8 8 1 3 4 3 7 1)| = (1 4 4 4 4 6 0 5 0) ***An error (1 4 4 4 4 6 0 5 0) would not distinguish (5 4 4 5 7 10 3 2 1) from (6 8 8 1 3 4 3 7 1) = {Difference Vectors} (+90,000 for WBCD) Minimal Difference Vectors: such that there is no other vector in that is less than or equal to in every coordinate Next Step: Input {Minimal Difference Vectors} into Dualization Algorithm
Minimal Difference Vectors Dualization Maximal Error Vectors Input: Minimal Vectors = Red Output: (5,0), (3,2), (2,3) To find blue points (what we want): Take (dimension) – outputs = (0,5), (2,3), (3,2) http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html
Minimal Difference Vectors Dualization Maximal Error Vectors • The output of the Dualization Algorithm is another set of vectors • For each coordinate in these vectors, take the compliment (i.e. 10 – coordinate) • Divide each vector by 2 to find a maximal error vector • Sort these error vectors by greatest sum, most 5’s, maximal minimum element, etc. • Choose an epsilon from the sorted lists that looks good
Binarization For a good error, we binarize the original data: eg: Error = (0.5, 5, 0, 5, 0, 0, 5, 5, 5) Thresholds (): Col 1 = 4, Col 3 =7, Col 5 = 5, Col 6 = 8 For a patient (5, 1, 1, 1, 2, 1, 3, 1, 1) we test each threshold to see if the value in the patient vector is greater or less than the threshold value, or if it is within error (5, 1, 1, 1, 2, 1, 3, 1, 1) 1 0 0 0 (4, 1, 1, 1, 2, 1, 3, 1, 1) * 0 0 0
RESULTS FOR WISCONSIN BREAST CANCER DATA Error Tolerance: 0.5, 5, 0, 5, 0, 0, 5, 5, 5 Attributes and Corresponding Thresholds: 1 : 4, 3 : 7, 5 : 5, 6 : 8 Total Pos/Neg Entries: 1 0 0 0 : 96, 21 (162, 23)1 0 0 1 : 1, 37 (1, 41)0 1 1 1 : 0, 6 (0, 6)1 0 1 0 : 2, 16 (3, 17)1 0 1 1 : 1, 27 (1, 28)1 1 0 0 : 1, 12 (1, 14)1 1 0 1 : 0, 19 (0, 21) 1 1 1 0 : 0, 24 (0, 24)1 1 1 1 : 2, 52 (2, 52)0 0 0 0 : 268, 2 (334, 4)0 0 0 1 : 1, 3 (1, 7)0 0 1 0 : 5, 2 (6, 3)0 0 1 1 : 0, 4 (0, 5)0 1 0 1 : 0, 2 (0, 4)
Formula for WBCD: Let P = Col 1 4 Q = Col 3 7 R = Col 5 5 S = Col 6 8 Then we can characterize most (432/444) positives with: -Q -R -S Some example patient vectors: Negatives: (8,7,5,10,7,9,5,5,4), (7,4,6,4,6,1,4,3,1) Positives: (4,1,1,1,2,1,2,1,1), (4,1,1,1,2,1,3,1,1)
More to do: • Test our procedure on different databases • Study heuristic methods for threshold selection • In general, explore ways to use more flexible error vectors and/or thresholds References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190 [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000 [3] Boros, E. http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html July 1, 2005 [4] Mangasarian, O.L. and W. H. Wolberg. "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18 [5] Rudell, Richard. Espresso Boolean Minimization http://www.csc.uvic.ca/~csc485c/espresso/instructions.html July 18, 2005