The Power of Selective Memory

The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem

Outline • Online learning, loss bounds etc. • Hypotheses space – PST • Margin of prediction and hinge-loss • An online learning algorithm • Trading margin for depth of the PST • Automatic calibration • A self-bounded online algorithm for learning PSTs

Online Learning • For • Get an instance • Predict a target based on • Get true update and suffer loss • Update prediction mechanism

Analysis of Online Algorithm • Relative loss bounds (external regret): For any fixed hypothesis h :

Prediction Suffix Tree (PST) Each hypothesis is parameterized by a triplet: context function

PST Example 0 -3 -1 1 4 -2 7

Margin of Prediction • Margin of prediction • Hinge loss

Complexity of hypothesis • Define the complexity of hypothesis as • We can also extend g s.t. and get

Algorithm I :Learning Unbounded-Depth PST • Init: • For t=1,2,… • Get and predict • Get and suffer loss • Set • Update weight vector • Update tree

Example y = ? y = 0

Example y = ? y = + 0

Example y = ? ? y = + 0

Example y = ? ? y = + - 0 + -.23

Example y = ? ? ? y = + - 0 + -.23

Example y = ? ? ? y = + - + 0 + - .23 -.23 + .16

Example y = ? ? ? - y = + - + 0 + - .23 -.23 + .16

Example y = ? ? ? - y = + - + - 0 + - .23 -.42 + - .16 -.14 + -.09

Example y = ? ? ? - + y = + - + - 0 + - .23 -.42 + - .16 -.14 + -.09

Example y = ? ? ? - + y = + - + - + 0 + - .41 -.42 + - .29 -.14 - + .09 -.09 + .06

Analysis • Let be a sequence of examples and assume that • Let be an arbitrary hypothesis • Let be the loss of on the sequence of examples. Then,

Proof Sketch • Define • Upper bound • Lower bound • Upper + lower bounds give the bound in the theorem

Proof Sketch (Cont.) Where does the lower bound come from? • For simplicity, assume that and • Define a Hilbert space: • The context function gt+1is the projection of gtonto the half-space where f is the function

Example revisited y = + - + - + - + - • The following hypothesis has cumulative loss of 2 and complexity of 2. Therefore, the number of mistakes is bounded above by 12.

Example revisited y = + - + - + - + - • The following hypothesis has cumulative loss of 1 and complexity of 4. Therefore, the number of mistakes is bounded above by 18.But, this tree is very shallow 0 + - 1.41 -1.41 Problem: The tree we learned is much more deeper !

Geometric Intuition

Geometric Intuition (Cont.) Lets force gt+1 to be sparse by “canceling” the new coordinate

Geometric Intuition (Cont.) Now we can show that:

Trading margin for sparsity • We got that • If is much smaller than we can get a loss bound ! • Problem: What happens if is very small and therefore ?Solution: Tolerate small margin errors ! • Conclusion: If we tolerate small margin errors, we can get a sparser tree

Automatic Calibration • Problem: The value of is unknown • Solution: Use the data itself to estimate it ! More specifically: • Denote • If we keep then we get a mistake bound

Algorithm II :Learning Self Bounded-Depth PST • Init: • For t=1,2,… • Get and predict • Get and suffer loss • If do nothing! Otherwise: • Set • Set • Set • Update w and the tree as in Algo. I, up to depth dt

Analysis – Loss Bound • Let be a sequence of examples and assume that • Let be an arbitrary hypothesis • Let be the loss of on the sequence of examples. Then,

Analysis – Bounded depth • Under the previous conditions, the depth of all the trees learned by the algorithm is bounded above by

Performance of Algo. II y = + - + - + - + - … Only 3 mistakes The last PST is of depth 5 The margin is 0.61 (after normalization) The margin of the max margin tree (of infinite depth) is 0.7071 Example revisited 0 - + .55 -.55 + - -. 22 .39 - + .07 -.07 + - .05 -.05 - .03

Conclusions • Discriminative online learning of PSTs • Loss bound • Trade margin and sparsity • Automatic calibration Future work • Experiments • Features selection and extraction • Support vectors selection

The Power of Selective Memory