190 likes | 357 Views
Predictive Modeling with Heterogeneous Sources. Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology.
E N D
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4Hong Kong University of Science and Technology
Why learning with heterogeneous sources? Standard Supervised Learning Training (labeled) Test (unlabeled) Classifier 85.5% New York Times New York Times
Why heterogeneous sources? How to improve the performance? In Reality… Training (labeled) Test (unlabeled) 47.3% Labeled data are insufficient! New York Times New York Times
Why heterogeneous sources? Labeled data from other sources Target domain test (unlabeled) 82.6% 47.3% New York Times Reuters • Different distributions • Different outputs • Different feature spaces
Real world examples • Social Network: • Can various bookmarking systems help predict social tags for a new system given that their outputs (social tags) and data (documents) are different? Wikipedia ODP Backflip Blink …… ? 4/18
Real world examples • Applied Sociology: • Can the suburban housing price census data help predict the downtown housing prices? ? #rooms #bathrooms #windows price 5 2 12 XXX 6 3 11 XXX #rooms #bathrooms #windows price 2 1 4 XXXXX 4 2 5 XXXXX 5/18
Other examples • Bioinformatics • Previous years’ flu data new swine flu • Drug efficacy data against breast cancer drug data against lung cancer • …… • Intrusion detection • Existing types of intrusions unknown types of intrusions • Sentiment analysis • Review from SDM Review from KDD 6/18
Learning with Heterogeneous Sources • The paper mainly attacks two sub-problems: • Heterogeneous data distributions • Clustering based KL divergence and a corresponding sampling technique • Heterogeneous outputs (to regression problem) • Unifying outputs via preserving similarity. 7/18
Learning with Heterogeneous Sources • General Framework Source data Unifying data distributions Unifying outputs Target data Target data Source data 8/18
Unifying Data Distributions • Basic idea: • Combine the source and target data and perform clustering. • Select the clusters in which the target and source data are similarly distributed, evaluated by KL divergence. 9/18
An Example T D Adaptive Clustering Combined Data 10/18
Unifying Outputs • Basic idea: • Generate initial outputs according to the regression model • For the instances similar in the original output space, make their new outputs closer. 11/18
Modification Modification Initial Outputs Initial Outputs 21.25 31.75 37 16 26.5 12/18
Experiment • Bioinformatics data set: 13/18
Experiment 14/18
Experiment • Applied sociology data set: 15/18
Experiment 16/18
Conclusions • Problem: Learning with Heterogeneous Sources: • Heterogeneous data distributions • Heterogeneous outputs • Solution: • Clustering based KL divergence help perform sampling • Similarity preserving output generation help unify outputs