480 likes | 523 Views
Explore the significance and examples of domain adaptation, its relation to machine learning prediction, and combining rules for adaptation and generalization. Discover the implications and results of various combining rules in different scenarios.
E N D
Domain Adaptation with Multiple Sources Yishay Mansour, Tel Aviv Univ. & Google Mehryar Mohri, NYU & Google Afshin Rostami, NYU
Adaptation – motivation • High level: • The ability to generalize from one domain to another • Significance: • Basic human property • Essential in most learning environments • Implicit in many applications.
Adaptation - examples • Sentiment analysis: • Users leave reviews • products, sellers, movies, … • Goal: score reviews as positive or negative. • Adaptation example: • Learn for restaurants and airlines • Generalize to hotels
Adaptation - examples • Speech recognition • Adaptation: • Learn a few accents • Generalize to new accents • think “foreign accents”.
Adaptation and generalization • Machine Learning prediction: • Learn from examples drawn from distribution D • predict the label of unseen examples • drawn from the same distribution D • generalization within a distribution • Adaptation: • predict the label of unseen examples • drawn from a different distribution D’ • Generalization across distributions
Adaptation – Related Work • Learn from D and test on D’ • relating the increase in error to dist(D,D’) • Ben-David et al. (2006), Blitzer et al. (2007), • Single distribution varying label quality • Cramer et al. (2005, 2006)
Our Model
D1 h1 L(D1,h1,f)≤ε . . . . . . . . . f target function Dk hk L(Dk,hk,f)≤ε Expected Loss distributions hypotheses Our Model - input Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]
D1 . . . Dk basic distributions Our Model – target distribution λ1 target distribution Dλ λk
Combine h1, … , hkto a hypothesis h* Low expected loss hopefully at most ε combining rules: let z: Σ zi = 1 and zi≥ 0 linear: h*(x) = Σ zi hi(x) distribution weighted: Our model – Combination Rule . . . hk h1 combining rule
Combining Rules – Pros • Alternative: Build a dataset for the mixture. • Learning the mixture parameters is non-trivial • Combined data set might be huge size • Domain dependent data unavailable • Combined data might be huge • Sometimes only classifiers are given/exist • privacy • MOST IMPORTANT: FUNDAMENTAL THEORY QUESTION
Our Results: • Linear Combining rule: • Seems like the first thing to try • Can be very bad • Simple settings where any linear combining rule performs badly.
Our Results: • Distribution weighted combining rules: • Given the mixture parameter λ: • there is a good distribution weighted combining rule. • expected loss at most ε • For any target function f, • there is a good distribution combining rule hz • expected loss at most ε • Extension for multiple “consistent” target functions • expected loss at most 3ε • OUTCOME: This is the “right” hypothesis class
Known Distribution
Linear combining rules Original Loss: ε=0 !!! Any linear combining rule h has expected absolute loss ½
Distribution weighted combining rule • Target distribution – a mixture: Dλ(x)=Σλi Di(x) • Set z=λ : • Claim: L(Dλ,hλ,f) ≤ ε
Back to the bad example Original Loss: ε=0 !!! h+(x): x=a h+(x)=h1(x)=1 x=b h+(x)=h0(x)=0
Unknown Distribution
Unknown mixture distribution • Zero-sum game: • NATURE: selects a distribution Di • LEARNER: selects a z • hypothesis hz • Payoff: L(Di,hz,f) • Restating to previous result: • For any mixed action λ of NATURE • LEARNER has a pure action z= λ • such that the expected loss is at most ε
Unknown mixture distribution • Consequence: • LEARNER has a mixed action (over z’s) • for any mixed action λ of NATURE • a mixture distribution Dλ • The loss is at most ε • Challenge: • show a specific hypothesis hz • pure, not mixed, action
Searching for a good hypothesis • Uniformly good hypothesis hz: • for any Di we have L(Di, hz,f) ≤ ε • Assume all the hi are identical • Extremely lucky and unlikely case • If we have a good hypothesis we are done! • L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λiε = ε • We need to show in general a good hz !
Proof Outline: • Balancing the losses: • Show that some hz has identical loss on any Di • uses Brouwer Fixed Point Theorem • holds very generally • Bounding the losses: • Show this hz has low loss for some mixture • specifically Dz
Brouwer Fixed Point Theorem : For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x A: compact and convex set φ: A→A continuous mapping
A = {Σi zi = 1 and zi ≥ 0 } Balancing Losses Problem 1: Need to get φ continuous
A = {Σi zi = 1 and zi≥ 0 } Balancing Losses Fixed point: z=φ(z) Problem 2: Needs that zi ≠0
Bounding the losses • We can guarantee balanced losses even for linear combining rule ! For z=(½, ½) we have L(Da,hz,f)=½ L(Db,hz,f)=½
Bounding Losses • Consider the previous z • from Brouwer fixed point theorem • Consider the mixture Dz • Expected loss is at most ε • Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ • Conclusion: • For any mixture expected loss at most γ≤ε
Solving the problems: • Redefine the distribution weighted rule: • Claim: For any distribution D, is continuous in z.
Main Theorem For any target function f and any δ>0, there exists η>0 and z such that for any λ we have
Balancing Losses • The set A = {Σ zi = 1 and zi≥ 0 } • The simplex • The mapping φ with parameters η and η’ • [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’) • where Li,z=L(Di,hz,η,f) • For some z in A we have φ(z)=z • zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0 • Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’
Bounding Losses • Consider the previous z • from Brouwer fixed point theorem • Consider the mixture Dz • Expected loss is at most ε+η • By definition ΣzjLj,z= L(Dz,hz,η,f) • Conclusion: γ=ΣzjLj,z ≤ ε+η
Putting it together • There exists (z,η) such that: • Expected loss of hz,ηapproximately balanced • L(Di,hz,η,f) ≤γ+η’ • Bounding γ using Dz • γ =L(Dz,hz,η,f) ≤ε+η • For any mixture Dλ • L(Dλ,hz,η,f) ≤ε+η+ η’
A more general model • So far: NATURE first fixes target function f • consistent target functions f • the expected loss w.r.t. Di is at most ε • for any of the k distributions • Function class F ={f is consistent} • New Model: • LEARNER picks a hypothesis h • NATURE picks f in F and mixture Dλ • Loss L(Dλ,h,f) • RESULT: L(Dλ,h,f)≤ 3ε.
Simple Algorithms
Uniform Algorithm • Hypothesis sets z=(1/k , … , 1/k): • Performance: • For any mixture, expected error ≤ kε • There exists mixture with expected error Ω(kε) • For k=2, there exists a mixture with 2ε-ε2
Open Problem • Find a uniformly good hypothesis • efficiently !!! • algorithmic issues: • Search over the z’s • Multiple local minima.
Empirical Results
Empirical Results • Data-set of sentiment analysis: • good product takes a little time to start operating very good for the price a little trouble using it inside ca • it rocks man this is the rockinest think i've ever seen or buyed dudes check it ou • does not retract agree with the prior reviewers i can not get it to retract any longer and that was only after 3 uses • dont buy not worth a cent got it at walmart can't even remove a scuff i give it 100 good thing i could return it • flash drive excelent hard drive good price and good time for seller thanks
Empirical analysis • Multiple domains: • dvd, books, electronics, kitchen appliance. • Language model: • build a model for each domain • unlike the theory, this is an additional error source • Tested on mixture distribution • known mixture parameters • Target: score (1-5) • error: Mean Square Error (MSE)
Distribution weighted kitchen dvd books electronics linear
Summary • Adaptation model • combining rules • linear • distribution weighted • Theoretical analysis • mixture distribution • Future research • algorithms for combining rules • beyond mixtures
Adaptation – Our Model • Input: • target function: f • k distributions D1, …, Dk • k hypothesis: h1, …, hk • For every i: L(Di,hi,f) ≤ε • where L(D,h,f) defines the expected loss • think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]