Learning Submodular Functions

Learning Submodular Functions Maria FlorinaBalcan LGO, 11/16/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

Submodular functions V={1,2, …, n}; set-function f : 2V! R • Concave Functions Let h : R!R be concave. For each S µ V, let f(S) = h(|S|) f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,TµV • Decreasing marginal return f(T [ {x})-f(T)¸ f(S [ {x})-f(S), 8 S,TµV, Tµ S, x not in T Examples: • Vector Spaces Let V={v1,,vn}, each vi2Fn.For each S µ V, let f(S) = rank(V[S])

S[T SÅT Submodular set functions • Set function f on V is called submodular if For all S,T µ V: f(S)+f(T) ¸ f(S[T)+f(SÅT) • Equivalent diminishing returns characterization: + ¸ + T S + x S Large improvement Submodularity: T + x Small improvement For TµS, xS, f(T [{x}) – f(T) ¸f(S [{x}) – f(S)

Example: Set cover Want to cover floorplan with discs Place sensorsin building Possiblelocations V For S µ V: f(S) = “area (# locations) covered by sensors placed at S” Node predicts values of positions with some radius Formally: W finite set, collection of n subsets Wiµ W For S µ V={1,…,n} define f(S) = |i2S Wi|

x x Set cover is submodular T={W1,W2} W1 W2 f(T[{x})-f(T) ¸ f(S[{x})-f(S) W1 W2 W3 W4 S = {W1,W2,W3,W4}

Submodular functions V={1,2, …, n}; set-function f : 2V! R • Concave Functions Let h : R!R be concave. For each S µ V, let f(S) = h(|S|) f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,TµV • Decreasing marginal return f(T [ {x})-f(T)· f(S [ {x})-f(S), 8 S,TµV, S µ T, x not in T Examples: • Vector Spaces Let V={v1,,vn}, each vi2Fn.For each S µ V, let f(S) = rank(V[S])

Submodular functions V={1,2, …, n}; set-function f : 2V! R f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,TµV Monotone: f(S) · f(T) , 8 S µ T Non-negative: f(S) ¸ 0, 8 S µ V

Submodular functions • A lot of work on optimization and submodularity. • Can be minimized in polynomial time. • Algorithmic game theory • decreasing marginal utilities. • Substantial interest in the ML community recently. • Tutorials, workshops at ICML, NIPS, etc. • www.submodularity.org/ owned by ML.

Learnability of Submodular Fns • Important to also understand their learnability. • Exact learning with value queries • Previous Work: Goemans, Harvey, Iwata, Mirrokni, SODA 2009 • [GHIM’09] Model • There is an unknown submodular target function. • Algorithm allowed to (adaptively) pick sets and query the value of the target on those sets. • Can we learn the target with a polynomial number of queries in poly time? • Output a function that approximates the target within a factor of ® on every single subset.

Exact learning with value queries Goemans, Harvey, Iwata, Mirrokni, SODA 2009 • Theorem: (General upperbound) 9 an alg. for learning a submodular function with an approx. factor O(n1/2). • Theorem: (General lower bound) • Any alg. for learning a submodular must have an approx. factor of (n1/2).

Problems with the GHIM model • - Lower bound fails if our goal is to do well on most of the points. • - Many simple functions that are easy to learn in the PAC model (e.g., conjunctions) are impossible to get exactly from a poly number of queries • - Well known that value queries are undesirable in some learning applications. Is there a better model that gets around these problems?

Problems with the GHIM model • - Lower bound fails if our goal is to do well on most of the points. • - Many simple functions that are easy to learn in the PAC model (e.g., conjunctions) are impossible to get exactly from a poly number of queries • - Well known that value queries are undesirable in some learning applications. Learning submodularfns in a distributional learning setting [BH10]

Our model: Passive Supervised Learning Data Source Distribution D on {0,1}n Expert / Oracle Learning Algorithm Labeled Examples (x1,f(x1)),…, (xk,f(xk)) f : {0,1}n R+ Alg.outputs g : {0,1}n R+

Our model: Passive Supervised Learning Data Source Distribution D on {0,1}n Expert / Oracle Learning Algorithm Labeled Examples (x1,f(x1)),…, (xk,f(xk)) • Algorithm sees (x1,f(x1)),…, (xk,f(xk)),xii.i.d. from D f : {0,1}n R+ • Algorithm produces “hypothesis” g. (Hopefully g ¼f) • Prx1,…,xm[ Prx[g(x)·f(x)·®g(x)]¸1-²] ¸1-± Alg.outputs g : {0,1}n R+ • “Probably MostlyApproximatelyCorrect”

Main results • Theorem: (Our general upper bound) 9 an alg. for PMAC-learning the class of non-negative, monotone, submodular fns (w.r.t. an arbitrary distribution) with an approx. factor O(n1/2). • Note: • Much simpler alg. compared to GIHM’09 • Theorem: (Our general lower bound) • No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factorõ(n1/3). • Note: • The GIHM’09 lower bound fails in our model. • Theorem: (Product distributions) Matroid rank functions, const. approx.

A General Upper Bound • Theorem: 9 an alg. for PMAC-learning the class of non-negative, monotone, submodular fns (w.r.t. an arbitrary distribution) with an approx. factor O(n1/2).

Subaddtive Fns are Approximately Linear • Let f be non-negative, monotone and subadditive • Claim:f can be approximated to within factor nby a linear functiong. • Proof Sketch: Let g(S) = s in S f({s}).Then f(S)·g(S) ·n ¢f(S). Subadditive: f(S)+f(T) ¸ f(S[ T) 8 S,T µ V Monotonicity: f(S) · f(T) 8 Sµ T Non-negativity: f(S) ¸ 0 8 S µ V

Subaddtive Fns are Approximately Linear • f(S) ·g(S) ·n¢f(S). n¢f g f V

PMAC Learning Subadditive Fns • fnon-negative, monotone,subadditiveapproximated to within factor nby a linear functiong, • g (S) =w¢Â (S). • Sample S from D; flip a coin. • Labeled examples((Â(S), f(S) ), +) and ((Â(S), n¢f(S) ), -) are linearly separable inRn+1. • Idea: learn a linear separator. Use std sample complex. • Problem: data noti.i.d. • Solution: create a related distribution. • If heads add ((Â(S), f(S) ), +). • Else add ((Â(S), n¢f(S) ), -).

PMAC Learning Subadditive Fns • Algorithm: • Note: • Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) • For each Si, flip a coin. • If heads add ((Â(S), f(Si) ), +). • Else add ((Â(S), n¢f(Si) ), -). • Learn a linear separator u=(w,-z) in Rn+1. • Output:g(S)=1/(n+1) w ¢Â (S). • Theorem: For m = £(n/²), g approximates f to within a factor n on a 1-² fraction of the distribution.

PMAC Learning Submodular Fns • Algorithm: • Note: • Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) • For each Si, flip a coin. • If heads add ((Â(S), f2(S_i)) ), +). • Else add ((Â(S), n f2(S_i) ), -). • Learn a linear separator u=(w,-z) in Rn+1. • Output:g(S)=1/(n+1)1/2 w ¢Â (S) • Theorem: For m = £(n/²), g approximates f to within a factor \sqrt{n} on a 1-² fraction of the distribution. Proof idea: f non-negative, monotone, submodular approximated to within factor \sqrt{n} by a \sqrt{linear function}. [GHIM, 09]

PMAC Learning Submodular Fns • Algorithm: • Note: • Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) • For each Si, flip a coin. • If heads add ((Â(S), f2(S_i)) ), +). • Else add ((Â(S), n f2(S_i) ), -). • Learn a linear separator u=(w,-z) in Rn+1. • Output:g(S)=1/(n+1)1/2 w ¢Â (S) • Much simpler than [GIHM09]. More robust to variations. • the target only needs to be within an ¯ factor of a submodularfnc. • 9 a submodularfnc that agrees with target on all but a ´ fraction of the points (on the points it disagrees it can be arbitrarily far). • [the alg is inefficient in this case]

A General Lower Bound • Theorem: (Our general lower bound) • No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factorõ(n1/3). Plan: Use the fact that any matroid rank fnc is submodular. Construct a hard family of matroid rank functions. High=n1/3 X X L=nlog log n X X Low=log2n A1 AL A2 A3 … … …. ….

Partition Matroids A1, A2, …, Akµ V={1,2, …, n}, all disjoint;ui· |Ai|-1 • E.g., n=5, A1={1,2,3}, A2={3,4,5}, u1=u2=2. Ind={I: |I ÅAj| ·uj, for all j } Then (V, Ind) is a matroid. If sets Ai are not disjoint, then (V,Ind) might not be a matroid. • {1,2,4,5} and {2,3,4} both maximal sets in Ind; do not have the same cardinality.

Almost partition matroids k=2, A1, A2µ V (not necessarily disjoint); ui· |Ai|-1 Ind={I: |I ÅAj| ·uj , |I Å (A1[A2)| ·u1 +u2 - |A1ÅA2|} Then (V,Ind) is a matroid.

Almost partition matroids More generally f : 2[k]! Z A1, A2, …, Akµ V={1,2, …, n}, ui· |Ai|-1; =<0 f(J)= j 2 J uj +|A(J)|-j 2J|Aj|, 8 J µ [k] Ind= { I: |I ÅA(J)| · f(J), 8 J µ [k] } Then (V, Ind) is a matroid (if nonempty). Rewrite f, f(J)=|A(J)|-j 2 J(|Aj| - uj), 8 J µ [k]

A generalization of partition matroids f : 2[k]! Z More generally f(J)=|A(J)|-j 2 J(|Aj| - uj), 8 J µ [k] Ind= { I: |I ÅA(J)| · f(J), 8 J µ [k] } Then (V, Ind) is a matroid (if nonempty). Uncrossing argument Proof technique: For a set I, define T(I) to be the set of tight constraints T(I)= {J µ [k], |I ÅA(J)|=f(J)} 8 I 2Ind, J1, J22 T(I), then (J1 [J22 T(I)) or (J1Å J2 =) Ind is the family of independent sets of a matroid.

A generalization of almost partition matroids f : 2[k]! Z, f(J)=|A(J)|-j 2 J(|Aj| -uj), 8 J µ [k]; ui· |Ai|-1 Note: This requires k· n (for k > n, f becomes negative) But we want k=n^{log log n}. Do some sort of truncation to allow k>>n. f(J) is (¹, ¿) good if f(J) ¸ 0 for J µ [k], |J| ·¿ and f(J) ¸¹ for J µ [k], ¿·|J| · 2¿ -2 h(J)=f(J) if |J| ·¿ and h(J)=¹, otherwise. Ind= { I: |I ÅA(J)| · h(J), 8 J µ [k] } Then (V,Ind) is a matroid (if nonempty).

A generalization of partition matroids Let L = nlog log n. Let A1, A2, …, AL be random subsets of V. (Ai -- include each elem of V indep with prob n-2/3. ) Let ¹=n^{1/3} log2 n, u=log2 n, ¿=n1/3 Each subset J µ {1,2, …, L} induces a matroids.t. for any i not in J, Ai is indep in this matroid • Rank(Ai), i not in J, is roughly |Ai| (i.e., £(n^{1/3})), • The rank of sets Aj, j in J is u=log2 n. High=n1/3 X X L=nlog log n X X Low=log2n A1 AL A2 A3 … … …. ….

Product distributions, Matroid Rank Fns Talagrand implies: • Let D be a product distribution on V, R=rank(X), X drawn from D. If E[R] ¸ 4000, • [Chekuri, Vondrak ’09] and [Vondrak ’10] prove a slightly more general result by two different techniques • If E[R]· 500 log(1/²), Related Work:

Product distributions, Matroid Rank Fns Talagrand implies: • Let D be a product distribution on V, R=rank(X), X drawn from D. If E[R] ¸ 4000, • Let ¹= i=1m f (xi) / m • Let g be the constant function with value ¹ • If E[R]· 500 log(1/²), • Algorithm: • This achieves approximation factor O(log(1/²)) on a 1-² fraction of points, with high probability.

Conclusions and Open Questions • Analyze intrinsic learnability of submodular fns • Our analysis reveals interesting novel extremal and structural properties of submodular fns. • Open questions • Improve (n1/3) lower bound to (n1/2) • Non-monotone submodular functions

Other interesting structural properties • Let h : R!R+be concave, non-decreasing. For each Sµ V, let f(S) = h(|S|) • Claim: These functions f are submodular, monotone, non-negative. V ;

Theorem:Every submodular function looks like this. Lots of approximately usually. V ;

Theorem:Every submodular function looks like this. Lots of approximately usually. Let f be a non-negative, monotone, submodular, 1-Lipschitz function.For any ²>0, there exists a concave function h : [0,n] !Rs.t.for every k2[0,n], and for a 1-² fraction of SµV with |S|=k,we have: V ; • Theorem h(k) ·f(S) · O(log2(1/²))¢h(k). In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality.

Conclusions and Open Questions • Analyze intrinsic learnability of submodular fns • Our analysis reveals interesting novel extremal and structural properties of submodular fns. • Open questions • Improve (n1/3) lower bound to (n1/2) • Non-monotone submodular functions • Any algorithm? • Lower bound better than (n1/3)

Learning Submodular Functions