Error Tolerance and Feature Selection for the Logical Analysis of Data

Error Tolerance and Feature Selection for the Logical Analysis of Data Presenter: Kathryn Davidson University of Pennsylvania Mentor: Dr. Endre Boros RUTCOR

TheProblem • a few tests on the patient • experience with past patients • limited acquired outside knowledge • large-scale laboratory experiments • gene mapping Before, doctors used a small amount of information for diagnoses: Now, available information is too large for humans to analyze:

A Solution • Partition the data into two classes (for example, one healthy, one sick.) • Use these binary partitions to write a separating function which will place future patients into one of the two categories. The Catch • Medical data is subject to error. If we partition the data too strictly, we risk incorrect manipulation. Our Goal • We’ll try to incorporate large tolerance for error while producing the most useful formulas.

Example data set:

Red shows the positive data pointsBlue shows the negative data points

Another way to write this information: We can make binary columns cut off at each value that is halfway between a positive and negative point. This means that a full binary table for this data will have 4 (positives) x 3 (negatives) x 2 (attributes) = 24 columns

What if we want to allow for error in the measuring of attributes? • The separation lines in the data graph will be surrounded by an “unsure” zone • Our binary chart starts to have missing pieces of information • More error means larger unsure zones and more missing information • How much error can we allow and still correctly separate the positive from the negative?

If we allow an error of 0.6 for A1 and 26 for A2, our (reduced) binary table will look like this: Are we able to allow this much error and still arrive at a formula that correctly categorizes each positive and negative entry?

Answer: No.Since rows a and g can no longer be distinguished, we cannot not have a reliable separating function Theorem: There exists a robust separating function if and only if for each pair  positive and  negative there is an index i such that ii , i  , and  i

General Procedure: • Create the full binary table for a given input data • Find the combinations of attribute error (error vectors) that are maximal Example: {(infinity, 2.5), (0.7,16), (0.55,25.5), (0.05, infinity)}

3. For the most promising error vectors, create a binary table that involves only the columns that are relevant for distinguishing positive from negative with that error tolerance 4. Examine which attributes’ columns were used, how they were used, and how much error is allowed when we use them. References: [1] Boros, E., Hammer, P.L., Ibaraki, T., and Kogan, A. "Logical Analysis of Numerical Data," Math. Programming, 79, (1997), 163-190. [2] Boros, E., Ibaraki, T., and Makino, K., "Variations on Extending Partially Defined Boolean Functions with Missing Bits," June 6, 2000

Error Tolerance and Feature Selection for the Logical Analysis of Data

Error Tolerance and Feature Selection for the Logical Analysis of Data

Presentation Transcript

Feature Selection of DNA Micrroarray Data

Feature Selection and Error Tolerance for the Logical Analysis of Data

Feature selection

Feature Selection

Data Mining Feature Selection

Feature Selection

Unsupervised Feature Selection for Multi-Cluster Data

Feature Selection

Random Subspace Feature Selection for Analysis of Data with Missing Features

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Feature Selection

Feature Selection Stability Analysis for Classification Using Microarray Data

Feature Selection

Feature Selection Focused within Error Clusters

Feature selection

Feature Selection

Feature Selection

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Feature selection