Predictive Modeling with Heterogeneous Sources

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4Hong Kong University of Science and Technology

Why learning with heterogeneous sources? Standard Supervised Learning Training (labeled) Test (unlabeled) Classifier 85.5% New York Times New York Times

Why heterogeneous sources? How to improve the performance? In Reality… Training (labeled) Test (unlabeled) 47.3% Labeled data are insufficient! New York Times New York Times

Why heterogeneous sources? Labeled data from other sources Target domain test (unlabeled) 82.6% 47.3% New York Times Reuters • Different distributions • Different outputs • Different feature spaces

Real world examples • Social Network: • Can various bookmarking systems help predict social tags for a new system given that their outputs (social tags) and data (documents) are different? Wikipedia ODP Backflip Blink …… ? 4/18

Real world examples • Applied Sociology: • Can the suburban housing price census data help predict the downtown housing prices? ? #rooms #bathrooms #windows price 5 2 12 XXX 6 3 11 XXX #rooms #bathrooms #windows price 2 1 4 XXXXX 4 2 5 XXXXX 5/18

Other examples • Bioinformatics • Previous years’ flu data  new swine flu • Drug efficacy data against breast cancer  drug data against lung cancer • …… • Intrusion detection • Existing types of intrusions  unknown types of intrusions • Sentiment analysis • Review from SDM Review from KDD 6/18

Learning with Heterogeneous Sources • The paper mainly attacks two sub-problems: • Heterogeneous data distributions • Clustering based KL divergence and a corresponding sampling technique • Heterogeneous outputs (to regression problem) • Unifying outputs via preserving similarity. 7/18

Learning with Heterogeneous Sources • General Framework Source data Unifying data distributions Unifying outputs Target data Target data Source data 8/18

Unifying Data Distributions • Basic idea: • Combine the source and target data and perform clustering. • Select the clusters in which the target and source data are similarly distributed, evaluated by KL divergence. 9/18

An Example T D Adaptive Clustering Combined Data 10/18

Unifying Outputs • Basic idea: • Generate initial outputs according to the regression model • For the instances similar in the original output space, make their new outputs closer. 11/18

Modification Modification Initial Outputs Initial Outputs 21.25 31.75 37 16 26.5 12/18

Experiment • Bioinformatics data set: 13/18

Experiment 14/18

Experiment • Applied sociology data set: 15/18

Experiment 16/18

Conclusions • Problem: Learning with Heterogeneous Sources: • Heterogeneous data distributions • Heterogeneous outputs • Solution: • Clustering based KL divergence help perform sampling • Similarity preserving output generation help unify outputs

Thanks!

Predictive Modeling with Heterogeneous Sources