130 likes | 299 Views
Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing. Jiangtao Ren Xiaoxiao Shi Wei Fan Philip S. Yu. What is sample selection bias?.
E N D
Type Independent Correction of Sample Selection Bias viaStructural Discovery and Re-balancing Jiangtao Ren Xiaoxiao Shi Wei Fan Philip S. Yu
What is sample selection bias? • Inductive learning: training data (x,y) is sampled from the universe of examples. • In many applications: training data (x,y) is not sampled randomly. • Insurance and mortgage data: you only know those people you give a policy. • School data: self-select
Ubiquitous • Loan Approval • Drug screening • Weather forecasting • Ad Campaign • Fraud Detection • User Profiling • Biomedical Informatics • Intrusion Detection • Insurance • etc
Different types of sample selection bias • There are different possibilities of how (x,y) is selected • S=1 denotes (x,y) is chosen. • S is independent from x and y. Total random sample. • S is dependent on y not x. Class bias • S is dependent on x not on y. Feature bias. • S is dependent on both x and y. Both class and feature.
Our method Structural Discovery Original Dataset Structural Re-balancing Corrected Dataset
Our method • Structural Discovery via automatic clustering • Key Idea: • Binary divide. • Stop dividing when most of the labeled data in the cluster have the same label
Our method • Structural Re-balancing via sample selection Key idea: (1)Select the same proportion from each cluster. (2)Select those confident and representative examples. (3)Label the unlabeled examples by neighbors
Our method • Theoretical analysis:Lemma 3.1 answers that why select the same proportion of examples from each cluster can reduce sample selection bias? Lemma 3.2 derives a criterion to select confident examples.
Feature Bias Accuracy of corrected minus Accuracy of original
Class Bias Accuracy of corrected minus Accuracy of original
Complete Bias Corrected VS. Original
Advantages: 1. Type Independent 2. Model Independent 3. Straightforward Experiment Dataset and the related matlab code can be downloaded at: ftp://202.116.65.69/sxx/SDM08 Or http://www.weifan.info