580 likes | 690 Views
Learning from Partially Labeled Data. Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/. Detecting cars. Outline. The partially labeled data problem Data representations Markov random walk Classification criteria Information Regularization.
E N D
Learning from Partially Labeled Data Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/
Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization
Learning from partially labeled data - semi-supervised learning semi-supervised supervised unsupervised
Semi-supervised learning from an unsupervised perspective labels constrain and repair clusters ? or include labels!
Semi-supervised learning from a supervised perspective labeled + unlab boundary labeled boundary unlabclass 1class -1
Benefits of semi-supervised learning • Labeled data can be • expensive may require human labor, and additional experiments / measurements • impossible to obtain labels unavailable at the present time; e.g. for prediction • Unlabeled data can be abundant and cheap! e.g. image sequences from video cameras, text documents from the web
Can we always benefit from partially labeled data? • Not always! • Assumptions required • Labeled and unlabeled data drawn IID from same distribution • Ignorable missingness mechanism • and…
Key assumption • The structure in the unlabeled data must relate to the desired classification; specifically: • A link between the marginal P(x) and the conditional P(y|x), which our classifier is equipped to exploit • Marginal distribution P(x): describes the input domain • Conditional distribution P(y|x): describes the classification Example assumption: points in the same cluster should have the same label
The learning task: notation Task: classify [a subset of] the unlabeled points training with both the labeled and unlabeled points
Previous approach: missing data with EM • Maximize likelihood of a generative model that accounts for P(x) and P(x,y) • Models P(x) and P(x,y) can be mixtures of Gaussians [Miller & Uyar], or Naïve Bayes [Nigam et al] • Issues: what model? How weight unlabeled vs. labeled?
Previous approach: Large margin on unlabeled data • Transduction with SVM or MED (max entropy discrimination) • Issues: computational cost
Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization
unlabeled labeled +1 labeled -1 Clusters and low-dimensional structures
Representation desiderata • Conditional should follow the data manifold- data may lie in a low-dimensional subspace Example: neighborhood graph • Robustly measure similarity between points.Consider volume of all paths, not just shortest path. Example: Markov random walk • Variable resolution: adjustable cluster size or number(differentiate points at coarser scales, not at finer scales)Example: number of time steps t of Markov random walk determines whether two points appear indistinguishable Construct a representation P(i|xk) that satisfies these goals.
Example: Markov random walk representation Example instantiation Local neighborhood relation Euclidean Localtransition probabilities to K nearestneighbors Globaltransition probabilities in tsteps Global representation renormalizes
Representation • Each point k is represented as a vector of (conditional) probabilities over the possible starting states i of a t step random walk ending up in k. • Two points are similar their random walks have indistinguishable starting points
Parameter: length of random walk t • Higher t coarser representation; fewer clusters • Limits: t = 0, (degenerate) • Choosing t – based on unlabeled data alone • diameter of graph • mixing time of graph (2nd eigenvalue of transition matrix) • Choosing t – based on both labeled + unlabeled data • when labels are consistent over large regions t is high • criteria: maximize likelihood, or margin, or cross-validation
A Generative Model for the Labels • Given: nodes i (corresponding to points xi ) Given:label distributions Q(y|i) at each node i Model generates a node identity and a label • Draw a node identity i uniformly Draw a label y ~ Q(y|i) 2. Add t rounds of identity noise: node i is confused with node k according to P(k|i). Label y is intact. 3. Output final identity k, and the label y During classification: only the noisy node identity is observed, and we want to determine the label y.
Classification model Given the noisy node identity k, infer possible starting node identities i, and weight their label distributions Question: how do we obtain Q(y|i)?
Unlike a linear classifier parameters Q(y|i) are bounded, limiting the effects of outliers classifier is directly applicable to multiple classes Link between P(x) and P(y|x): smoothness of the representation Classification model (2)
Maximize conditional log-likelihood EM algorithm conditional over labeled points
unlabeled labeled +1 labeled -1 Swiss roll problem
Summary: Markov Random Walk representation • Points are expressed as a vectors of probabilities, of having been generated by every other point • Related work: Clustering Markovian relaxation [Tishby & Slonim 00] Spectral clustering [Shi & Malik 97; Meila & Shi 00; ++] Visualization: Isomap [Tenenbaum 99] Linear local embedding [Roweis & Saul 00]
Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • conditional maximum likelihood with EM • maximize average margin • … • Information Regularization
Discriminative boundaries Focus on classification decisions more directly than maximum likelihood does Classify labeled points with a margin Margin at point xk:confidence of the classifier
maximize average margin Margin based estimation
Average margin solution has a closed form Closed form: assign weight 1 to the class with largest total “flow” to point m. Two rounds of a weighted neighbor classifier • Classify all points based on the labeled points • Classify all points based on the previous classification
Text classification with Markov random walks 20 Newsgroups dataset Mac Vs. PC2000 examples, 7500 dimensions, averages over 20 runs
1 Class Mac Class Win 0.8 average margin per class 0.6 0.4 0.2 0 5 10 15 20 t Choosing t based on margin Choose t to maximize average margin on labeled and unlabeled points.
Car Detection 2500 road scene images; split evenly between cars and non-cars
Markov Random walk with 1 step (t=1) 0.2 NU=0 NU=256 0.18 NU=512 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled
Markov Random walk with 1 step (t=1) 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled
Markov Random walk (t=5) 0.2 NU=256 NU=512 0.18 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled
Markov random walk (t=5), varying unlabeled 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled
Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • Information Regularization
Information Regularization Overview Markov random walk • Linked P(x) to P(y|x) indirectly through the classification model Information Regularization • Explicitly and directly links P(x) to P(y|x) • Makes no parametric assumptions on the link
Assumption: Inside small regions with a large number of points, the labeling should not change Regularization approach: Cover the domain with small regions, and penalize inhomogeneous labelings in the regions
Mutual information • Mutual information I(x; y) over a region • I(x; y) = how many bits of information does knowledge about x contribute to knowledge about y, on average • I(x ; y) = H(y) – H(y|x), a function of P(x) and P(y|x) • a measure of homogeneity of labels
Example: x = location within the circle; y ={+, –} Mutual Information – a homogeneity measure permutation invariant in both x and y
Information Regularization (in small region) Penalize weighted mutual information over a small regionQ in the input domain MQ =probability mass of x in region Q high density region penalize more VQ =variance of x in region Q IQ/VQ is independent of size of Q as Q shrinks
Information Regularization (whole domain) • Cover the domain with small overlapping regions • Regularize each region • Cover should be connected • Example cover: balls centered at each data point • Trade-off: smaller regions VS more overlap small regions: preserve spatial locality overlap: consistent regularization across regions
Minimize Max Information Content • Minimize the maximum information contained in any region Q in the cover