1 / 10

Feature Selection and Error Tolerance for the Logical Analysis of Data

Feature Selection and Error Tolerance for the Logical Analysis of Data. Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR. Our Goals.

nau
Download Presentation

Feature Selection and Error Tolerance for the Logical Analysis of Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection and Error Tolerance for the Logical Analysis of Data Craig Bowles Kathryn Davidson Cornell University University of Pennsylvania Mentor: Endre Boros RUTCOR

  2. Our Goals • Train a computer to tell us which attributes in a medical data set are important • Have the computer suggest possible formulas for distinguishing healthy and sick patients • Achieve these goals with as much tolerance for data error as possible

  3. Training Data Set: Wisconsin Breast Cancer Database from University of Wisconsin Hospitals, Madison, Dr. William H. Wolberg Sample Patient Vector: 1000025,5,1,1,1,2,1,3,1,1,2 ID #, 9 test results, class distinction There is a total of 699 patients (458 Benign “2”, 241 Malignant “4”)

  4. Minimal Difference Vectors Dualization  Maximal Error Difference Vector: |(5 4 4 5 7 10 3 2 1) – (6 8 8 1 3 4 3 7 1)| = (1 4 4 4 4 6 0 5 0) ***An error  (1 4 4 4 4 6 0 5 0) would not distinguish (5 4 4 5 7 10 3 2 1) from (6 8 8 1 3 4 3 7 1)  = {Difference Vectors} (+90,000 for WBCD) Minimal Difference Vectors:    such that there is no other vector in  that is less than or equal to  in every coordinate Next Step: Input {Minimal Difference Vectors} into Dualization Algorithm

  5. Minimal Difference Vectors  Dualization  Maximal Error Vectors Input: Minimal Vectors = Red Output: (5,0), (3,2), (2,3) To find blue points (what we want): Take (dimension) – outputs = (0,5), (2,3), (3,2) http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html

  6. Minimal Difference Vectors  Dualization  Maximal Error Vectors • The output of the Dualization Algorithm is another set of vectors • For each coordinate in these vectors, take the compliment (i.e. 10 – coordinate) • Divide each vector by 2 to find a maximal error vector • Sort these error vectors by greatest sum, most 5’s, maximal minimum element, etc. • Choose an epsilon from the sorted lists that looks good

  7. Binarization For a good error, we binarize the original data: eg: Error  = (0.5, 5, 0, 5, 0, 0, 5, 5, 5) Thresholds (): Col 1 = 4, Col 3 =7, Col 5 = 5, Col 6 = 8 For a patient (5, 1, 1, 1, 2, 1, 3, 1, 1) we test each threshold to see if the value in the patient vector is greater or less than the threshold value, or if it is within error (5, 1, 1, 1, 2, 1, 3, 1, 1)  1 0 0 0 (4, 1, 1, 1, 2, 1, 3, 1, 1)  * 0 0 0

  8. RESULTS FOR WISCONSIN BREAST CANCER DATA Error Tolerance: 0.5, 5, 0, 5, 0, 0, 5, 5, 5 Attributes and Corresponding Thresholds: 1 : 4, 3 : 7, 5 : 5, 6 : 8 Total Pos/Neg Entries: 1 0 0 0 : 96, 21 (162, 23)1 0 0 1 : 1, 37 (1, 41)0 1 1 1 : 0, 6 (0, 6)1 0 1 0 : 2, 16 (3, 17)1 0 1 1 : 1, 27 (1, 28)1 1 0 0 : 1, 12 (1, 14)1 1 0 1 : 0, 19 (0, 21) 1 1 1 0 : 0, 24 (0, 24)1 1 1 1 : 2, 52 (2, 52)0 0 0 0 : 268, 2 (334, 4)0 0 0 1 : 1, 3 (1, 7)0 0 1 0 : 5, 2 (6, 3)0 0 1 1 : 0, 4 (0, 5)0 1 0 1 : 0, 2 (0, 4)

  9. Formula for WBCD: Let P = Col 1  4 Q = Col 3  7 R = Col 5  5 S = Col 6  8 Then we can characterize most (432/444) positives with: -Q  -R  -S Some example patient vectors: Negatives: (8,7,5,10,7,9,5,5,4), (7,4,6,4,6,1,4,3,1) Positives: (4,1,1,1,2,1,2,1,1), (4,1,1,1,2,1,3,1,1)

  10. More to do: • Test our procedure on different databases • Study heuristic methods for threshold selection • In general, explore ways to use more flexible error vectors and/or thresholds References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190 [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000 [3] Boros, E. http://rutcor.rutgers.edu/~boros/IDM/DualizationCode.html July 1, 2005 [4] Mangasarian, O.L. and W. H. Wolberg. "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18 [5] Rudell, Richard. Espresso Boolean Minimization http://www.csc.uvic.ca/~csc485c/espresso/instructions.html July 18, 2005

More Related