120 likes | 270 Views
Error Tolerance and Feature Selection for the Logical Analysis of Data. Presenter: Kathryn Davidson University of Pennsylvania Mentor: Dr. Endre Boros RUTCOR. The Problem. a few tests on the patient experience with past patients limited acquired outside knowledge
E N D
Error Tolerance and Feature Selection for the Logical Analysis of Data Presenter: Kathryn Davidson University of Pennsylvania Mentor: Dr. Endre Boros RUTCOR
TheProblem • a few tests on the patient • experience with past patients • limited acquired outside knowledge • large-scale laboratory experiments • gene mapping Before, doctors used a small amount of information for diagnoses: Now, available information is too large for humans to analyze:
A Solution • Partition the data into two classes (for example, one healthy, one sick.) • Use these binary partitions to write a separating function which will place future patients into one of the two categories. The Catch • Medical data is subject to error. If we partition the data too strictly, we risk incorrect manipulation. Our Goal • We’ll try to incorporate large tolerance for error while producing the most useful formulas.
Red shows the positive data pointsBlue shows the negative data points
Red shows the positive data pointsBlue shows the negative data points
Another way to write this information: We can make binary columns cut off at each value that is halfway between a positive and negative point. This means that a full binary table for this data will have 4 (positives) x 3 (negatives) x 2 (attributes) = 24 columns
What if we want to allow for error in the measuring of attributes? • The separation lines in the data graph will be surrounded by an “unsure” zone • Our binary chart starts to have missing pieces of information • More error means larger unsure zones and more missing information • How much error can we allow and still correctly separate the positive from the negative?
If we allow an error of 0.6 for A1 and 26 for A2, our (reduced) binary table will look like this: Are we able to allow this much error and still arrive at a formula that correctly categorizes each positive and negative entry?
Answer: No.Since rows a and g can no longer be distinguished, we cannot not have a reliable separating function Theorem: There exists a robust separating function if and only if for each pair positive and negative there is an index i such that ii , i , and i
General Procedure: • Create the full binary table for a given input data • Find the combinations of attribute error (error vectors) that are maximal Example: {(infinity, 2.5), (0.7,16), (0.55,25.5), (0.05, infinity)}
3. For the most promising error vectors, create a binary table that involves only the columns that are relevant for distinguishing positive from negative with that error tolerance 4. Examine which attributes’ columns were used, how they were used, and how much error is allowed when we use them. References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190. [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000