Transfer Learning with Applications to Text Classification

Jing Peng Computer Science Department Transfer Learning with Applications to Text Classification

Machine learning: • study of algorithms that • improve performance P • on some task T • using experience E • Well defined learning task: <P,T,E>

Learning to recognize targets in images:

Learning to classify text documents:

Learning to build forecasting models:

Growth of Machine Learning • Machine learning is preferred approach to • Speech processing • Computer vision • Medical diagnosis • Robot control • News articles processing • … • This machine learning niche is growing • Improved machine learning algorithms • Lots of data available • Software too complex to code by hand • …

:estimation error Learning • Given • Least squares methods • Learning focuses on minimizing :approximation error H

Transfer Learning with Applications to Text Classification Main Challenge: Transfer learning High Dimensional (4000 features) Overlapping (<80% features are the same) Solution with performance bounds

Standard Supervised Learning training (labeled)‏ test (unlabeled)‏ Classifier 85.5% New York Times New York Times

In Reality…… training (labeled)‏ test (unlabeled)‏ Classifier 64.1% Labeled data not available! Reuters New York Times New York Times

Domain Difference  Performance Drop train test ideal setting Classifier NYT NYT 85.5% New York Times New York Times realistic setting Classifier NYT Reuters 64.1% Reuters New York Times

High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 • Challenges: • High dimensionality. • more than training examples • Euclidean distance becomes meaningless

Why Dimension Reduction? DMAX DMIN

Curse of Dimensionality Dimensions

High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 • Challenges: • High dimensionality. • more than training examples • Euclidean distance becomes meaningless • Feature sets completely overlapping? • No. Some less than 80% features are the same. • Marginally not so related? • Harder to find transferable structures • Proper similarity definition.

PAC (Probably Approximately Correct) learning requirement • Training and test distributions must be the same

Transfer between high dimensional overlapping distributions • Overlapping Distributions Data from two domains may not come from the same part of space; potentially overlap at best.

Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not come from the same part of space; potentially overlap at best.

Transfer between high dimensional overlapping distributions • Overlapping Distribution Data from two domains may not be lying on exactly the same space, but at most an overlapping one.

Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power

Transfer between high dimensional overlapping distributions • Problems with overlapping distributions • Overlapping features alone may not provide sufficient predictive power Hard to predict correctly

Transfer between high dimensional overlapping distributions • Overlapping Distributions • Use the union of all features and fill in missing values with “zeros”?

Transfer between high dimensional overlapping distributions • Overlapping Distribution • Use the union of all features and fill in the missing values with “zeros”? Does it helps?

Transfer between high dimensional overlapping distributions

Transfer between high dimensional overlapping distributions D2 { A, B} = 0.0181 > D2 {A, C} = 0.0101

Transfer between high dimensional overlapping distributions D2 { A, B} = 0.0181 > D2 {A, C} = 0.0101 A is mis-classified as in the class of C, instead of B

Transfer between high dimensional overlapping distributions • When one uses the union of overlapping and non-overlapping features and replaces missing values with “zero”, • distance of two marginal distributions p(x) can become asymptotically very large as a function of non-overlapping features: • becomes a dominant factor in similarity measure.

Transfer between high dimensional overlapping distributions • High dimensionality can underpin important features

Transfer between high dimensional overlapping distributions

Transfer between high dimensional overlapping distributions The “blues” are closer to the “greens” than to the “reds”

LatentMap: two step correction • Missing value regression • Bring marginal distributions closer • Latent space dimensionality reduction • Further bring marginal distributions closer • Ignore non-important noisy and “error imported features” • Identify transferable substructures across two domains.

Missing Value Regression • Predict missing values (recall the previous example)

Missing Value Regression • Predict missing values (recall the previous example) 1. Project to overlapped feature

Missing Value Regression • Predict missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature

Missing Value Regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125 • Predict missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature

Missing Value Regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125 • Predcit missing values (recall the previous example) 2. Map from z to x Relationship found byregression 1. Project to overlapped feature A is correctly classified as in the same class as B

Dimensionality Reduction

Dimensionality Reduction Missing Values

Dimensionality Reduction Missing Values Overlapping Features

Dimensionality Reduction Missing Values Missing Values Filled Overlapping Features

Dimensionality Reduction Word vector Matrix Missing Values Missing Values Filled Overlapping Features

Dimensionality Reduction • Project the word vector matrix to the most important and inherent sub-space

Transfer Learning with Applications to Text Classification

Transfer Learning with Applications to Text Classification

Presentation Transcript

Transfer to Learning

Text Classification

Soft-Supervised Learning for Text Classification

Text Classification with Belief Augmented Frames

Transfer Learning Algorithms for Image Classification

TEXT CLASSIFICATION

Text Classification

Text Classification

Bayesian Learning Application to Text Classification Example: spam filtering

Text Classification

Transfer Learning for Image Classification

Text Classification

ACTIVE LEARNING FOR TEXT CLASSIFICATION

Text Classification

Text Classification with Limited Labeled Data

Text Classification with Limited Labeled Data

Text Classification

Classification Text

Automatic Text Classification through Machine Learning

Text Classification

TEXT CLASSIFICATION