680 likes | 864 Views
Jing Peng Computer Science Department. Transfer Learning with Applications to Text Classification. Machine learning: study of algorithms that improve performance P on some task T using experience E Well defined learning task: <P,T,E>. Learning to recognize targets in images:.
E N D
Jing Peng Computer Science Department Transfer Learning with Applications to Text Classification
Machine learning: • study of algorithms that • improve performance P • on some task T • using experience E • Well defined learning task: <P,T,E>
Growth of Machine Learning • Machine learning is preferred approach to • Speech processing • Computer vision • Medical diagnosis • Robot control • News articles processing • … • This machine learning niche is growing • Improved machine learning algorithms • Lots of data available • Software too complex to code by hand • …
:estimation error Learning • Given • Least squares methods • Learning focuses on minimizing :approximation error H
Transfer Learning with Applications to Text Classification Main Challenge: Transfer learning High Dimensional (4000 features) Overlapping (<80% features are the same) Solution with performance bounds
Standard Supervised Learning training (labeled) test (unlabeled) Classifier 85.5% New York Times New York Times
In Reality…… training (labeled) test (unlabeled) Classifier 64.1% Labeled data not available! Reuters New York Times New York Times
Domain Difference Performance Drop train test ideal setting Classifier NYT NYT 85.5% New York Times New York Times realistic setting Classifier NYT Reuters 64.1% Reuters New York Times
High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 • Challenges: • High dimensionality. • more than training examples • Euclidean distance becomes meaningless
Why Dimension Reduction? DMAX DMIN
Curse of Dimensionality Dimensions
Curse of Dimensionality Dimensions
High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 • Challenges: • High dimensionality. • more than training examples • Euclidean distance becomes meaningless • Feature sets completely overlapping? • No. Some less than 80% features are the same. • Marginally not so related? • Harder to find transferable structures • Proper similarity definition.
PAC (Probably Approximately Correct) learning requirement • Training and test distributions must be the same
Transfer between high dimensional overlapping distributions • Overlapping Distributions Data from two domains may not come from the same part of space; potentially overlap at best.
Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not come from the same part of space; potentially overlap at best.
Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not come from the same part of space; potentially overlap at best.
Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not be lying on exactly the same space, but at most an overlapping one.
Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not be lying on exactly the same space, but at most an overlapping one.
Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power
Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power
Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power
Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power Hard to predict correctly
Transfer between high dimensional overlapping distributions • Overlapping Distributions • Use the union of all features and fill in missing values with “zeros”?
Transfer between high dimensional overlapping distributions • Overlapping Distributions • Use the union of all features and fill in missing values with “zeros”?
Transfer between high dimensional overlapping distributions • Overlapping Distribution • Use the union of all features and fill in the missing values with “zeros”? Does it helps?
Transfer between high dimensional overlapping distributions D2 { A, B} = 0.0181 > D2 {A, C} = 0.0101
Transfer between high dimensional overlapping distributions D2 { A, B} = 0.0181 > D2 {A, C} = 0.0101 A is mis-classified as in the class of C, instead of B
Transfer between high dimensional overlapping distributions • When one uses the union of overlapping and non-overlapping features and replaces missing values with “zero”, • distance of two marginal distributions p(x) can become asymptotically very large as a function of non-overlapping features: • becomes a dominant factor in similarity measure.
Transfer between high dimensional overlapping distributions • High dimensionality can underpin important features
Transfer between high dimensional overlapping distributions The “blues” are closer to the “greens” than to the “reds”
LatentMap: two step correction • Missing value regression • Bring marginal distributions closer • Latent space dimensionality reduction • Further bring marginal distributions closer • Ignore non-important noisy and “error imported features” • Identify transferable substructures across two domains.
Missing Value Regression • Predict missing values (recall the previous example)
Missing Value Regression • Predict missing values (recall the previous example)
Missing Value Regression • Predict missing values (recall the previous example) 1. Project to overlapped feature
Missing Value Regression • Predict missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature
Missing Value Regression • Predict missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature
Missing Value Regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125 • Predict missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature
Missing Value Regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125 • Predcit missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature A is correctly classified as in the same class as B
Dimensionality Reduction Missing Values
Dimensionality Reduction Missing Values Overlapping Features
Dimensionality Reduction Missing Values Missing Values Filled Overlapping Features
Dimensionality Reduction Word vector Matrix Missing Values Missing Values Filled Overlapping Features
Dimensionality Reduction • Project the word vector matrix to the most important and inherent sub-space