220 likes | 465 Views
Transductive Rademacher Complexity and its Applications. Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A A A.
E N D
Transductive Rademacher Complexity and its Applications Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA
Induction vs. Transduction Inductive learning: Goal: minimize learning algorithm training set Distribution of examples unlabeled examples hypothesis labels Transductive learning (Vapnik ’74,’98): Goal: minimize training set labels of the test set learning algorithm test set
Distribution-free Model[Vapnik ’74,’98] X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X • Given: “Full sample” of unlabeled examples, each with its true (unknown) label.
Distribution-free Model [Vapnik ’74,’98] X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X • Given: “Full sample” of unlabeled examples, each withits true (unknown) label. • Full sample is partitioned: • training set (m points) • test set (u points)
Distribution-free Model[Vapnik ’74,’98] X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X • Given: “Full sample” of unlabeled examples, each with its true (unknown) label. • Full sample is partitioned: • training set (m points) • test set (u points) • Labels of the training examples are revealed.
Distribution-free Model[Vapnik ’74,’98] ? ? ? ? • Goal: Label test examples ? ? X ? ? ? ? X ? ? ? X ? ? ? X ? ? ? ? X ? ? ? X ? ? ? ? ? ? ? X • Given: “Full sample” of unlabeled examples, each with its true (unknown) label. • Full sample is partitioned: • training set (m points) • test set (u points) • Labels of the training points are revealed.
Rademacher complexity Induction Hypothesis space : set of functions . - training points. - i.i.d. random variables, Rademacher: Transduction(version 1) Hypothesis space : set of vectors,. - full sample with training and test points. - distributed as in induction. Rademacher:
Transductive Rademacher complexity Version 1: - full sample with training and test points. - transductive hypothesis space. - i.i.d. random variables distributed by : . Rademacher complexity: Version 2: sparse distribution, , of Rademacher variables We develop risk bounds with . Lemma 1: .
Risk bound Notation: - 0/1 error of on test examples . - empirical -margin error of on training examples . Theorem: For any , with probability at least over the random partition of the full sample into , for all hypotheses it holds that . Proof: based on and inspired by the results of [McDiarmid, ‘89], [Bartlett and Mendelson, ‘02] and [Meir and Zhang, ‘03]. Previous results: [Lanckriet et al., ‘04] - case of .
Inductive vs. Transductive hypothesis spaces Induction: To use the risk bounds, the hypothesis space should be defined before observing the training set. Transduction: The hypothesis space can be defined after observing , but before observing the actual partition . Conclusion: Transduction allows for the choosing a data-dependent hypothesis space. For example, it can be optimized to have low Rademacher complexity. This cannot be done in induction!
Another view on transductive algorithms learner Example: - inverse of graph Laplacian iff ; otherwise. matrix compute vector compute Unlabeled-Labeled Decomposition (ULD)
Bounding Rademacher complexity Hypothesis space: the set of all , obtained by operating transductive algorithm on all possible partitions . Notation: , - set of ‘s generated by . - all singular values of . Lemma 2: Lemma 2 justifies the spectral transformations performed to improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05]). .
Bounds for graph-based algorithms Consistency method [Zhou, Bousquet, Lal, Weston, Scholkopf, ‘03]: where are singular values of . Similar bounds for the algorithms of [Joachims,’03], [Belkin et al., ‘04], etc.
Topics not covered • Bounding the Rademacher complexity when is a kernel matrix. • For some algorithms: data-dependent method of computing probabilistic upper and lower boundson Rademacher complexity. • Risk bound for transductive mixtures.
Direction for future research Tighten the risk bound to allow effective model selection: • Bound depending on 0/1 empirical error. • Usage of variance information to obtain better convergence rate. • Local transductive Rademacher complexity. • Clever data-dependent choice of low-Rademacher hypothesis spaces.
Monte Carlo estimation of transductive Rademacher complexity Rademacher: . Draw uniformly vectors of Rademacher variables, . By Hoeffding inequality: for any , with prob. at least , . How to compute the supremum? For the Consistency Method of [Zhou et al., ‘03] can be computed in time. Symmetric Hoeffding inequality probabilistic lower bound on the transductive Rademacher complexity.
Induction vs. Transduction: differences • Transduction • No unknown distribution. Each example has unique label. • Induction • Unknown underlying distribution • Dependent training and test examples. • Independent training examples. • Test examples not known. Will be sampled from the same distribution. • Test examples are known. • Generate a general hypothesis. • Want generalization! • Only classify given examples. • No generalization!
Justification of spectral transformations , - set of ‘s generated by . - all singular values of . Lemma 2: . Lemma 2 justifies the spectral transformations performed to improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05]).