Active Perspectives on Computational Learning and Testing

Active Perspectives on Computational Learning and Testing Liu Yang Slide 1

Interaction • An Interactive Protocol - Have algorithm interact with oracle/experts • Care about - Query complexity - Computational Efficiency • Question: how much better can learner do w/ interaction, vs. getting data in one shot ? Slide 2

Notation • Instance space X = {0, 1}^n • Concept space C: collection of fn h: X -> {-1,1} • Distribution D over X • Unknown target function h*: the true labeling function (Realizable case: h* in C) • Err(h) = Px~D[h(x) ~= h*(x)] Slide 3

“Active” means Label Request • Label request: have a pool of unlabeled exs, pick any x and receive h*(x), repeat Alg achieves Query ComplexityS(ε,δ,h) for (C,D) if it outputs hn after ≤ n label requests, and for any h* in C, ε > 0, δ > 0, n ≥ S(ε, δ, h), P[err(hn ≤ ε )] ≥ 1- δ • Motivation: labeled data is expensive to get • Using label request, can do - Active Learning: find h has small err(h) - Active Testing: decide h* in C or far from C Slide 4

This Talk Thesis Outline • Bayesian Active Learning Using Arbitrary Binary Valued Queries (Published) • Active Property Testing (Major results submitted) • Self-Verifying Bayesian Active Learning (Published) • A Theory of Transfer Learning (Accepted) • Learning with General Types of Query (In progress) • Active Learning with a Drifting Distri. (Submitted) Slide 5

Outline • Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published) • Transfer Learning(Accepted) • Learning with General Types of Query (in progress) Slide 6

Property Testing • What Property Testing is for ? - Quickly tell whether the right fn class - Estimate complexity of fn without actually learning • Question : Can you do w/ fewer queries than learning ? • Yes !!! e.g. Union of d Intervals, testing help!!!----++++----+++++++++-----++---+++-------- - Testing tells how big d need to be close to target - #Label: Active Testing need O(1), Passive Testing need Θ(√d), Active Learning need Θ(d) Slide 7

Active Property Testing • Passive Testing : has no control on labeled exs • Membership Query: unrealistic to query fn. at arbitrary points • Active query asks labels from what exist in env • Question : Is active testing still get significant benefit in label requests over passive testing ? Slide 8

Property Tester • - Accepts w.p. >= 2/3 if h* in C - Rejects w.p. >= 2/3 if d(h*;C) = ming in C Px~D [h*(x) ≠ g(x)] ≥ ε Slide 9

Testing Union of d Intervals • ----++++----+++++++++-----++---+++-------- Theorem. Testing unions of <=d intervals in active testing model uses only queries. • Noise Sensitivity := Pr [two close points labeled diff] • Proof Idea: - all unions of d intervals have low noise sensitivity - all functions that are far from this class have noticeably larger noise sensitivity - we introduce a tester that estimates the noise sensitivity of the input function. Slide 10

Summary of Results ★★Active Testing (constant # of queries ) much better than Passive Testing and Active Learning on unions of intervals, cluster assumption, margin assumption ★Testing is easier than learning on LTF ✪ For dictator (single variable fn), Active Testing no help Slide 11

Outline • Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published) • Transfer Learning(Accepted) • Learning with General Types of Query (in progress) Slide 12

Self-Verifying Bayesian Active Learning Self-verifying (a special type of stopping criterion) • given ε, adaptively decides # of query, then halts • has the property that E[err] < ε when halts Question: Can you do with E[#query] = o(1/ε) ? (passive learning need 1/ε labels) Slide 13

Example: Intervals Suppose D is uniform on [0,1] - - + 0 1 Slide 14

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } Example: Intervals Verification Lower Bound alg somehow arrives at h, er(h) < ε; how to verify hε close to h* ? h* ε close to h every h’ ε outsidegreen ball is not h* - Everything on red circle is outside green ball; so have to verify those are not h* - Suppose h* is empty interval, then er(h) < ε => p(h=+) < ε. C h h* ε 2ε • In particular, intervals of width 2 are on red circle . • Need one label in each interval to verify it is not h* • So need (1/) labels to verify the target isn’t one of these. 0 1 h* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Suppose h* is empty interval, D is uniform on [0,1] Slide 15 Slide 16

Learning with a prior • Suppose we know a distribution the target is sampled from, call it prior Slide 16

Interval Example with prior- - - - - |+++++++|- - - - - - • Algorithm: Query random pts till find first +, do binary search to find end-pts. Halt when reach a pre-specified prior-based query budget. Output posterior’s Bayes classifier. • Let budget N be high enough so E[err] < ε - N = o(1/ε) sufficient for E[err|w*>0] < ε: can learn each interval of w > 0 w/ o(1/eps) queries; by DCT, the average of any collection of o(1/N) fns is o(1/N), so only need N=o(1/ε) to make E[err|w*>0] < ε. - N = o(1/ε) sufficient for E[err|w*=0] < ε: if P(w*=0)>0, then after some L = O(log(1/ε)) queries, w.p.> 1-ε, most prob. mass on empty interval, so posterior’s Bayes classifier has 0 error rate Slide 17

Can do o(1/eps) for any VC-class Theorem : With the prior, can do o(1/ε) • There are methods that find a good classifier in o(1/eps) queries (though they aren’t self-verifying) [see TrueSampleComplexityAL08] • Need to set a stopping criterion for those alg • The stop criterion we use : let alg run until make a certain #query (set the budget to be just large enough so the E[err] < ε) Slide 18

Outline • Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published) • Transfer Learning (Accepted) • Learning with General Types of Query (in progress) Slide 19

Model of Transfer Learning Motivation: Learners often Not Too Altruistic Task T Task 1 Layer 1: draw task i.i.d. from (unknown) prior … hT* h2* prior Task 2 Better Estimate of Prior h1* x11,y11 x1k,y1k xT1,yT1 xTk,yTk … … x21,y21 x2k,y2k … Layer 2: per task, draw data i.i.d. from target Slide 20

Insights • Using a good estimate of the prior is almost as good as using the true prior - Now we only need a VC-dim # of additional points from each task, to get a good estimate of the prior - We’ve seen self-verifying alg, if given the true prior, has guarantee on the err and # queries - As #task ->∞, when call self-verifying alg, it outputs a classifier as good as if it had the true prior Slide 21

Main Result • Design an alg using this insight - Uses at most VC-dim # additional labeled points per task (vs. learning with known true prior) - Estimate joint distribution on (x1,y1)…(xd,yd) - Invert to get prior estimate • Running this alg asymptotically just as good as having direct knowledge of prior (Bayesian) - [HKS] showed passive save const. factors in Θ(1/ε) sample complexity per task (replace vc-dim w/ a prior-dependent complexity measure) - We showed access to prior can improve active sample complexity, sometimes from Θ(1/ε) to o(1/ε). Slide 22

Estimating the prior Insight : Identifiability of priors by d-dim joint distri. • Distrib of full sequence (x1,y1),(x2,y2),… uniquely identifies prior • Set of joint distribs on (x1,y1),…,(xk,yk) s.t. 1≤k<∞ identifies distrib of full sequence (x1,y1),(x2,y2),… • For any k > d=VC-dim, can express distrib on (x1,y1),…,(xk,yk) in terms of distrib of (x1,y1),…,(xd,yd). • How to do it when d =1 ? e.g. threshold - for two points x1, x2, if x1 < x2, then Pr(+,-)=0, Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce from k to 1 P(-----------(-+)++++++++++) = P( (-+) ) = P( (.+) ) - P( (+.) ) Slide 23

Outline • Active Property Testing (Submitted) • Self-Verifying Bayesian Active learning (published) • Transfer Learning (Accepted) • Learning with General Types of Query (in progress) - Learning DNF formula - Learning Voronoi Slide 24

Learning with General Query • Construct problem-specific queries used to efficiently learn those problems having no known efficient algorithms to PAC-learn. Slide 25

- DNF formulas:n is # of variables; poly-sized DNF: # of terms is nO(1) e.g. set of fn like f = (x1∧x2) ∨ (x1∧x4) Natural form of knowledge representation [Valiant 1984]; a great challenge over 20 years PAC-learning DNF formulas appears to be very hard. Fastest known alg[KS01] runs in time exp(n1/3 log2n). If the alg forced to output a hypothesis which itself is a DNF, the problem is NP-hard. Learning DNF formulas (x1∧x2) (x1∧x4) Slide 26

Thanks ! Slide 31

Active Perspectives on Computational Learning and Testing