Domain Adaptation with Multiple Sources

Domain Adaptation with Multiple Sources Yishay Mansour, Tel Aviv Univ. & Google Mehryar Mohri, NYU & Google Afshin Rostami, NYU

Adaptation

Adaptation – motivation • High level: • The ability to generalize from one domain to another • Significance: • Basic human property • Essential in most learning environments • Implicit in many applications.

Adaptation - examples • Sentiment analysis: • Users leave reviews • products, sellers, movies, … • Goal: score reviews as positive or negative. • Adaptation example: • Learn for restaurants and airlines • Generalize to hotels

Adaptation - examples • Speech recognition • Adaptation: • Learn a few accents • Generalize to new accents • think “foreign accents”.

Adaptation and generalization • Machine Learning prediction: • Learn from examples drawn from distribution D • predict the label of unseen examples • drawn from the same distribution D • generalization within a distribution • Adaptation: • predict the label of unseen examples • drawn from a different distribution D’ • Generalization across distributions

Adaptation – Related Work • Learn from D and test on D’ • relating the increase in error to dist(D,D’) • Ben-David et al. (2006), Blitzer et al. (2007), • Single distribution varying label quality • Cramer et al. (2005, 2006)

Our Model

D1 h1 L(D1,h1,f)≤ε . . . . . . . . . f target function Dk hk L(Dk,hk,f)≤ε Expected Loss distributions hypotheses Our Model - input Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

D1 . . . Dk basic distributions Our Model – target distribution λ1 target distribution Dλ λk

Combine h1, … , hkto a hypothesis h* Low expected loss hopefully at most ε combining rules: let z: Σ zi = 1 and zi≥ 0 linear: h*(x) = Σ zi hi(x) distribution weighted: Our model – Combination Rule . . . hk h1 combining rule

Combining Rules – Pros • Alternative: Build a dataset for the mixture. • Learning the mixture parameters is non-trivial • Combined data set might be huge size • Domain dependent data unavailable • Combined data might be huge • Sometimes only classifiers are given/exist • privacy • MOST IMPORTANT: FUNDAMENTAL THEORY QUESTION

Our Results: • Linear Combining rule: • Seems like the first thing to try • Can be very bad • Simple settings where any linear combining rule performs badly.

Our Results: • Distribution weighted combining rules: • Given the mixture parameter λ: • there is a good distribution weighted combining rule. • expected loss at most ε • For any target function f, • there is a good distribution combining rule hz • expected loss at most ε • Extension for multiple “consistent” target functions • expected loss at most 3ε • OUTCOME: This is the “right” hypothesis class

Known Distribution

Linear combining rules Original Loss: ε=0 !!! Any linear combining rule h has expected absolute loss ½

Distribution weighted combining rule • Target distribution – a mixture: Dλ(x)=Σλi Di(x) • Set z=λ : • Claim: L(Dλ,hλ,f) ≤ ε

Distribution weighted combining rule PROOF:

Back to the bad example Original Loss: ε=0 !!! h+(x): x=a h+(x)=h1(x)=1 x=b h+(x)=h0(x)=0

Unknown Distribution

Unknown mixture distribution • Zero-sum game: • NATURE: selects a distribution Di • LEARNER: selects a z • hypothesis hz • Payoff: L(Di,hz,f) • Restating to previous result: • For any mixed action λ of NATURE • LEARNER has a pure action z= λ • such that the expected loss is at most ε

Unknown mixture distribution • Consequence: • LEARNER has a mixed action (over z’s) • for any mixed action λ of NATURE • a mixture distribution Dλ • The loss is at most ε • Challenge: • show a specific hypothesis hz • pure, not mixed, action

Searching for a good hypothesis • Uniformly good hypothesis hz: • for any Di we have L(Di, hz,f) ≤ ε • Assume all the hi are identical • Extremely lucky and unlikely case • If we have a good hypothesis we are done! • L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λiε = ε • We need to show in general a good hz !

Proof Outline: • Balancing the losses: • Show that some hz has identical loss on any Di • uses Brouwer Fixed Point Theorem • holds very generally • Bounding the losses: • Show this hz has low loss for some mixture • specifically Dz

Brouwer Fixed Point Theorem : For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x A: compact and convex set φ: A→A continuous mapping

A = {Σi zi = 1 and zi ≥ 0 } Balancing Losses Problem 1: Need to get φ continuous

A = {Σi zi = 1 and zi≥ 0 } Balancing Losses Fixed point: z=φ(z) Problem 2: Needs that zi ≠0

Bounding the losses • We can guarantee balanced losses even for linear combining rule ! For z=(½, ½) we have L(Da,hz,f)=½ L(Db,hz,f)=½

Bounding Losses • Consider the previous z • from Brouwer fixed point theorem • Consider the mixture Dz • Expected loss is at most ε • Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ • Conclusion: • For any mixture expected loss at most γ≤ε

Solving the problems: • Redefine the distribution weighted rule: • Claim: For any distribution D, is continuous in z.

Main Theorem For any target function f and any δ>0, there exists η>0 and z such that for any λ we have

Balancing Losses • The set A = {Σ zi = 1 and zi≥ 0 } • The simplex • The mapping φ with parameters η and η’ • [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’) • where Li,z=L(Di,hz,η,f) • For some z in A we have φ(z)=z • zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0 • Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’

Bounding Losses • Consider the previous z • from Brouwer fixed point theorem • Consider the mixture Dz • Expected loss is at most ε+η • By definition ΣzjLj,z= L(Dz,hz,η,f) • Conclusion: γ=ΣzjLj,z ≤ ε+η

Putting it together • There exists (z,η) such that: • Expected loss of hz,ηapproximately balanced • L(Di,hz,η,f) ≤γ+η’ • Bounding γ using Dz • γ =L(Dz,hz,η,f) ≤ε+η • For any mixture Dλ • L(Dλ,hz,η,f) ≤ε+η+ η’

A more general model • So far: NATURE first fixes target function f • consistent target functions f • the expected loss w.r.t. Di is at most ε • for any of the k distributions • Function class F ={f is consistent} • New Model: • LEARNER picks a hypothesis h • NATURE picks f in F and mixture Dλ • Loss L(Dλ,h,f) • RESULT: L(Dλ,h,f)≤ 3ε.

Simple Algorithms

Uniform Algorithm • Hypothesis sets z=(1/k , … , 1/k): • Performance: • For any mixture, expected error ≤ kε • There exists mixture with expected error Ω(kε) • For k=2, there exists a mixture with 2ε-ε2

Open Problem • Find a uniformly good hypothesis • efficiently !!! • algorithmic issues: • Search over the z’s • Multiple local minima.

Empirical Results

Empirical Results • Data-set of sentiment analysis: • good product takes a little time to start operating very good for the price a little trouble using it inside ca • it rocks man this is the rockinest think i've ever seen or buyed dudes check it ou • does not retract agree with the prior reviewers i can not get it to retract any longer and that was only after 3 uses • dont buy not worth a cent got it at walmart can't even remove a scuff i give it 100 good thing i could return it • flash drive excelent hard drive good price and good time for seller thanks

Empirical analysis • Multiple domains: • dvd, books, electronics, kitchen appliance. • Language model: • build a model for each domain • unlike the theory, this is an additional error source • Tested on mixture distribution • known mixture parameters • Target: score (1-5) • error: Mean Square Error (MSE)

Distribution weighted kitchen dvd books electronics linear

Summary

Summary • Adaptation model • combining rules • linear • distribution weighted • Theoretical analysis • mixture distribution • Future research • algorithms for combining rules • beyond mixtures

Thank You!

Adaptation – Our Model • Input: • target function: f • k distributions D1, …, Dk • k hypothesis: h1, …, hk • For every i: L(Di,hi,f) ≤ε • where L(D,h,f) defines the expected loss • think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

Domain Adaptation with Multiple Sources

Domain Adaptation with Multiple Sources

Presentation Transcript

Linking Data from Multiple Sources

Domain Adaptation

Multiple Domain User Personalization

Domain Adaptation in Natural Language Processing

Domain Adaptation with Structural Correspondence Learning

Synthesizing Information from Multiple Sources

Machine Translation Domain Adaptation

Downloading from Multiple Server Sources

Valid Statistical Analysis for Logistic Regression with Multiple Sources

Bagging-based System Combination for Domain Adaptation

Public Domain Literature and Data Sources

Learning Representations of Language for Domain Adaptation

STS Compilation with Multiple Data Sources

Domain Adaptation for Biomedical Information Extraction

Relaying in networks with multiple sources has new aspects:

Integrating Multiple Sources of Data

Frustratingly Easy Domain Adaptation

Domain Adaptation for Statistical Machine Translation

Multiple Sources of Recovery

Domain Adaptation with Multiple Sources

Cross Domain Distribution Adaptation via Kernel Mapping

STS Compilation with Multiple Data Sources