370 likes | 561 Views
Submodular Functions Learnability , Structure & Optimization. Nick Harvey, UBC CS Maria- Florina Balcan , Georgia Tech. Who studies submodular functions?. CS, Approximation Algorithms. Machine Learning. OR, Optimization. AGT, Economics. Valuation Functions.
E N D
Submodular FunctionsLearnability, Structure & Optimization Nick Harvey, UBC CS Maria-FlorinaBalcan, Georgia Tech
Who studies submodular functions? CS,ApproximationAlgorithms Machine Learning OR,Optimization AGT,Economics
Valuation Functions A first step in economic modeling: • individuals have valuation functions givingutility for different outcomes or events. f( ) !R
Valuation Functions A first step in economic modeling: • individuals have valuation functions givingutility for different outcomes or events. Focus on combinatorial settings: • n items,{1,2,…,n} = [n] f( ) !R • f : 2[n]!R.
Learning Valuation Functions This talk: learningvaluation functions from past data. • Bundle pricing • Package travel deals
S[ T SÅ T Submodular valuations • [n]={1,…,n}; Function f : 2[n]! R submodularif For all S,T µ [n]: f(S)+f(T) ¸ f(S [ T) + f(S Å T) + ¸ + T S • Equivalent to decreasing marginal return: For TµS, xS, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S) + S x Large improvement T + x Small improvement
Submodular valuations • E.g., • Vector Spaces Let V={v1,,vn}, each vi2Fn.For each S µ [n], let f(S) = rank({ vi : i2 S}) • Concave Functions Let h : R!R be concave. For each S µ [n], let f(S) = h(|S|) • Decreasing marginal return: For TµS, xS, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S) + S x Large improvement T + x Small improvement
Passive Supervised Learning Distribution D on 2[n] Data Source Expert / Oracle Learning Algorithm S1,…, Sk Labeled Examples (S1,f(S1)),…, (Sk,f(Sk)) f : 2[n]!R+ Alg. outputs g : 2[n]!R+
PMAC model for learning real valued functions Boolean PAC Data Source S1,…, Sk Distribution D on 2[n] Expert / Oracle Learning Algorithm {0,1} {0,1} Labeled Examples (S1,f(S1)),…, (Sk,f(Sk)) • Alg. sees (S1,f(S1)),…, (Sk,f(Sk)), Sii.i.d. from D, produces g f : 2[n]!R+ Alg.outputs • With probability ¸ 1-±,we have PrS[g(S)·f(S)·®g(S) ]¸1-² g : 2[n]!R+ • Probably Mostly ApproximatelyCorrect
Learning submodular functions • Theorem: (Our general upper bound) Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2). • Theorem: (Our general lower bound) • Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3). • Corollary: Gross substitutes functions do not haveaconcise, approximate representation. • Theorem: (Product distributions) • Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Learning submodular functions • Theorem: (Our general upper bound) Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2). • Theorem: (Our general lower bound) • Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3). • Corollary: Gross substitutes functions do not haveaconcise, approximate representation. • Theorem: (Product distributions) • Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Computing Linear Separators + – + – + + – – + + – – – • Given {+,–}-labeled points in Rn, find a hyperplanecTx = b that separates the +s and –s. • Easily solved by linear programming. + – – +
Learning Linear Separators + – + – + Error! + – – + + – – – + – – + • Given random sampleof {+,–}-labeled points in Rn, find a hyperplanecTx = b that separates most ofthe +s and –s. • Classic machine learning problem.
Learning Linear Separators + – + – + Error! + – – + + – – – + – – + • Classic Theorem: [Vapnik-Chervonenkis 1971?]O( n/²2 ) samples suffice to get error ². ~
Submodular Functions are Approximately Linear • Let f be non-negative, monotone and submodular • Claim:f can be approximated to within factor nby a linear functiong. • Proof Sketch: Let g(S) = §s2Sf({s}).Then f(S) ·g(S) ·n¢f(S). Submodularity: f(S)+f(T)¸f(SÅT)+f(S[T) 8S,TµV Monotonicity: f(S)·f(T) 8SµT Non-negativity: f(S)¸0 8SµV
n¢f g – • Randomly sample {S1,…,Sk} from distribution • Create + for f(Si) and – for n¢f(Si) • Now just learn a linear separator! – + + + – – f + + – + V – + –
n¢f g • Theorem:g approximates f to within a factor n on a 1-² fraction of the distribution. f V
n¢f2 g • Can improve to O(n1/2): in fact f2 and n¢f2 are separatedby a linear function [Goemans et al. ‘09] • John’s Ellipsoid theorem: any centrally symmetric convex body is approximated by an ellipsoid to within factor n1/2 f2 V
Learning submodular functions • Theorem: (Our general upper bound) Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2). • Theorem: (Our general lower bound) • Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3). • Corollary: Gross substitutes functions do not haveaconcise, approximate representation. • Theorem: (Product distributions) • Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution with approximation factor O(1).
V ; f(S) = min{ |S|, k } |S| (if |S| · k) f(S) = k (otherwise)
A V ; |S| (if |S| · k) f(S) = k-1 (if S=A) k (otherwise)
A1 A2 A3 Ak V ; A = {A1,,Am}, |Ai|=k |S| (if |S| · k) f(S) = k-1 (if S 2A) Claim: f is submodular if |AiÅAj|·k-2 8ij k (otherwise)
A1 If algorithm seesonly these examples A2 A3 Then f can’t bepredicted here Ak V ; Delete half of the bumps at random. Then f is very unconcentrated on A ) any algorithm to learn f has additive error 1 |S| (if |S| · k) f(S) = k-1 (if S 2A and wasn’t deleted) k (otherwise)
A1 A2 A3 Ak V ; Can we force a bigger error with bigger bumps? Yes, if Ai’s are very “far apart”. This can be achieved by picking them randomly.
Theorem: (Main lower bound construction) • There is a distribution D and a randomly chosen function f s.t. • f is monotone, submodular • Knowing the value of f on poly(n) random samples from D does not suffice to predict the value of f on future samples from D, even to within a factor o(n1/3). Plan: • Choose two values High=n1/3 and Low=O(log2 n). • Choose random sets A1,…,Amµ[n],with |Ai|=High and m = nlog n. • D is the uniform distribution on {A1,…,Am}. • Create a function f : 2[n]!R.For each i, randomly set f(Ai)=High or f(Ai)=Low. • Extend f to a monotone, submodular function on 2[n]. ~
Creating the function f • We choose f to be a matroid rank function • Such functions have a rich combinatorial structure, and are always submodular • The randomly chosen Ai’s form an expander: • The expansion property can be leveraged to ensure f(Ai)=High or f(Ai)=Low as desired. where H = { j : f(Aj) = High }
Learning submodular functions • Theorem: (Our general upper bound) Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2). • Theorem: (Our general lower bound) • Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3). • Corollary: Gross substitutes functions do not haveaconcise, approximate representation. • Theorem: (Product distributions) • Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Gross Substitutes Functions • Class of utility functions commonly used in mechanism design [Kelso-Crawford ‘82, Gul-Stacchetti ‘99, Milgrom ‘00, …] • Intuitively, increasingthe prices for someitemsdoes not decrease demand for the other items. • Question:[Blumrosen-Nisan, Bing-Lehman-Milgrom]Do GS functions have a concise representation?
Gross Substitutes Functions • Class of utility functions commonly used in mechanism design [Kelso, Crawford, Gul, Stacchetti, …] • Question:[Blumrosen-Nisan, Bing-Lehman-Milgrom]Do GS functions have a concise representation? • Fact: Every matroid rank function is GS. • Corollary: The answer to the question is no. • Theorem: (Main lower bound construction) • There is a distribution D and a randomly chosen function f s.t. • f is a matroid rank function • poly(n) bits of information do not suffice to predict the value of f on samples from D, even to within a factor o(n1/3). ~
Learning submodular functions • Theorem: (Our general upper bound) Monotone, submodular functions can be PMAC-learned(w.r.t. an arbitrary distribution) with approximation factor ®=O(n1/2). • Theorem: (Our general lower bound) • Monotone, submodular functions cannot be PMAC-learnedwith approximation factor õ(n1/3). • Corollary: Gross substitutes functions do not haveaconcise, approximate representation. • Theorem: (Product distributions) • Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution with approximation factor O(1).
Learning submodular functions • Hypotheses: • PrX»D[ X=x ] = i Pr[Xi=xi] (“Product distribution”) • f({i})2 [0,1] for all i2 [n] (“Lipschitz function”) • f({i})2 {0,1} for all i2 [n] Stronger condition! • Theorem: (Product distributions) Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution withapproximation factor O(1).
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] !Rs.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. h(k) ·f(S) · O(log2(1/²))¢h(k). V ;
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] !Rs.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Algorithm: • Let ¹ = §i=1 f(xi) / m • Let g be the constant function with value ¹ This achieves approximation factor O(log2(1/²)) ona 1-² fraction of points, with high probability. h(k) ·f(S) · O(log2(1/²))¢h(k). m • Theorem: (Product distributions) Lipschitz, monotonesubmodularfuntions can be PMAC-learnedunder a product distribution withapproximation factor O(1).
Technical Theorem:For any ²>0, there exists a concave function h : [0,n] !Rs.t.for every k2[n], and for a 1-² fraction of SµV with |S|=k,we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Concentration Lemma:Let X have a product distribution. For any ®2 [0,1], Proof: Based on Talagrand’s concentration inequality. h(k) ·f(S) · O(log2(1/²))¢h(k).
Follow-up work • Subadditive & XOS functions [Badanidiyuru et al., Balcan et al.] • O(n1/2) approximation • (n1/2) inapproximability • Symmetric submodular functions [Balcan et al.] • O(n1/2) approximation • (n1/3) inapproximability
Conclusions • Learning-theoretic view of submodular fns • Structural properties: • Very “bumpy” under arbitrary distributions • Very “smooth” under product distributions • Learnability in PMAC model: • O(n1/2) approximation algorithm • (n1/3) inapproximability • O(1) approx for Lipschitz fns & product distrs • No concise representation for gross substitutes