340 likes | 500 Views
The Power of Selective Memory. Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem. Outline. Online learning, loss bounds etc. Hypotheses space – PST Margin of prediction and hinge-loss An online learning algorithm Trading margin for depth of the PST
E N D
The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem
Outline • Online learning, loss bounds etc. • Hypotheses space – PST • Margin of prediction and hinge-loss • An online learning algorithm • Trading margin for depth of the PST • Automatic calibration • A self-bounded online algorithm for learning PSTs
Online Learning • For • Get an instance • Predict a target based on • Get true update and suffer loss • Update prediction mechanism
Analysis of Online Algorithm • Relative loss bounds (external regret): For any fixed hypothesis h :
Prediction Suffix Tree (PST) Each hypothesis is parameterized by a triplet: context function
PST Example 0 -3 -1 1 4 -2 7
Margin of Prediction • Margin of prediction • Hinge loss
Complexity of hypothesis • Define the complexity of hypothesis as • We can also extend g s.t. and get
Algorithm I :Learning Unbounded-Depth PST • Init: • For t=1,2,… • Get and predict • Get and suffer loss • Set • Update weight vector • Update tree
Example y = ? y = 0
Example y = ? y = + 0
Example y = ? ? y = + 0
Example y = ? ? y = + - 0 + -.23
Example y = ? ? ? y = + - 0 + -.23
Example y = ? ? ? y = + - + 0 + - .23 -.23 + .16
Example y = ? ? ? - y = + - + 0 + - .23 -.23 + .16
Example y = ? ? ? - y = + - + - 0 + - .23 -.42 + - .16 -.14 + -.09
Example y = ? ? ? - + y = + - + - 0 + - .23 -.42 + - .16 -.14 + -.09
Example y = ? ? ? - + y = + - + - + 0 + - .41 -.42 + - .29 -.14 - + .09 -.09 + .06
Analysis • Let be a sequence of examples and assume that • Let be an arbitrary hypothesis • Let be the loss of on the sequence of examples. Then,
Proof Sketch • Define • Upper bound • Lower bound • Upper + lower bounds give the bound in the theorem
Proof Sketch (Cont.) Where does the lower bound come from? • For simplicity, assume that and • Define a Hilbert space: • The context function gt+1is the projection of gtonto the half-space where f is the function
Example revisited y = + - + - + - + - • The following hypothesis has cumulative loss of 2 and complexity of 2. Therefore, the number of mistakes is bounded above by 12.
Example revisited y = + - + - + - + - • The following hypothesis has cumulative loss of 1 and complexity of 4. Therefore, the number of mistakes is bounded above by 18.But, this tree is very shallow 0 + - 1.41 -1.41 Problem: The tree we learned is much more deeper !
Geometric Intuition (Cont.) Lets force gt+1 to be sparse by “canceling” the new coordinate
Geometric Intuition (Cont.) Now we can show that:
Trading margin for sparsity • We got that • If is much smaller than we can get a loss bound ! • Problem: What happens if is very small and therefore ?Solution: Tolerate small margin errors ! • Conclusion: If we tolerate small margin errors, we can get a sparser tree
Automatic Calibration • Problem: The value of is unknown • Solution: Use the data itself to estimate it ! More specifically: • Denote • If we keep then we get a mistake bound
Algorithm II :Learning Self Bounded-Depth PST • Init: • For t=1,2,… • Get and predict • Get and suffer loss • If do nothing! Otherwise: • Set • Set • Set • Update w and the tree as in Algo. I, up to depth dt
Analysis – Loss Bound • Let be a sequence of examples and assume that • Let be an arbitrary hypothesis • Let be the loss of on the sequence of examples. Then,
Analysis – Bounded depth • Under the previous conditions, the depth of all the trees learned by the algorithm is bounded above by
Performance of Algo. II y = + - + - + - + - … Only 3 mistakes The last PST is of depth 5 The margin is 0.61 (after normalization) The margin of the max margin tree (of infinite depth) is 0.7071 Example revisited 0 - + .55 -.55 + - -. 22 .39 - + .07 -.07 + - .05 -.05 - .03
Conclusions • Discriminative online learning of PSTs • Loss bound • Trade margin and sparsity • Automatic calibration Future work • Experiments • Features selection and extraction • Support vectors selection